> You can set the FP rate arbitrarily low, but they will eventually happen. "Bas...

omginternets · on Dec 8, 2022

>“Eventually” needs clarification here.

We're discussing mathematics, so it has a precise meaning: as the number of samples approaches infinity, the probability of observing a false-positive approaches 1.

As I mentioned in another comment, your claim is both formally and practically incorrect. It is formally incorrect for the reason above. Given enough samples, a false-positive must occur.

It is practically incorrect because there exists no image-classification system whose FP rate is small enough that multiple FPs won't be observed daily, given the number of samples at play [0].

[0] Except, of course, for the trivial case in which you allow an exorbitant number of misses, but surely this isn't what you're arguing...

JimDabell · on Dec 8, 2022

> We're discussing mathematics

No, we are discussing a real-world system.

> as the number of samples approaches infinity

There are not an infinite number of people with Apple accounts.

omginternets · on Dec 8, 2022

Actually, we're talking about both, and you are wrong on both counts.

If you weren't, you would be able to point to an existing system that has demonstrated the capability of classifying an image set on the order of iCloud's, without producing a false-positive. You can't, because such a system doesn't exist.

Even the "solved" problem of OCR isn't capable of such a feat: https://youtu.be/XxCha4Kez9c

JimDabell · on Dec 8, 2022

> you would be able to point to an existing system that has demonstrated the capability of classifying an image set on the order of iCloud's, without producing a false-positive.

This is not relevant to what I was saying; it sounds like you might be mixing up collisions with false positives.

As I said before:

> Remember – I said “basically nil when you use them in the way Apple was proposing (two separate hashes; several matches required)”, and not that a single hash function wouldn’t ever produce a single collision for a single image. Do you really disagree with that?

The relevant false positive is when an account gets flagged by Apple. That’s when it actually matters; that is what we don’t want to happen; that is what Apple are describing as a one in a trillion chance. The rate of hash collisions are only important as an input to that larger system.

That “number of samples”? It’s not the number of images in the whole of iCloud. It’s the number of Apple accounts. It doesn’t matter what happens “approaching infinity”, it matters what happens in the range zero to eight billion. We have a known upper bound, and it is substantially less than infinity. It doesn’t matter if a false positive is guaranteed “approaching infinity”, what matters is the likelihood of a false positive for any Apple account. That is not guaranteed, and actually extremely unlikely.

Also, just checking – you do understand that iCloud is using perceptual hashes and that it’s iMessage that uses an image classifier, right? Perceptual hashes aren’t normally referred to as image classifiers and aren’t really doing the same sort of thing as OCR; with OCR you need to output a token even if the shape is uncertain, but that failure case doesn’t exist for perceptual hashes because the result is just that there isn’t a match. What would be a false positive in the OCR case would be a false negative in the perceptual hash case, which we don’t care about for the purpose of this discussion.