> You can set the FP rate arbitrarily low, but they will eventually happen. "Basically nil" is not the same as "actually nil".
I know “basically nil” is not the same as “actually nil”, that’s why I said “basically nil” and not just “nil”. It is “nil for all practical purposes”. This is a good enough standard for software the whole world uses all day every day to depend upon – otherwise hashes would be pretty useless.
“Eventually” needs clarification here. Of course, if you spend from now until the heat death of the universe iterating through the search space you could get there. But that doesn’t mean that in practical use, such a collision is guaranteed as you claim. This system could have run its entire lifetime without a collision. You are mistaking “if you cover the entire search space” for what happens in the real world.
Do you understand that there is not one, but two hashes, and that they both need to collide simultaneously for each image? And do you understand that this has to happen not just a single time, but for several images on your system? That’s why I specifically said “basically nil when you use them in the way Apple was proposing” and described how it worked.
Everybody with even a basic grasp of what hashes are understand that collisions are certain when considering the entire search space. But that doesn’t correspond with what happens in practical terms, which is why hashing is actually useful in general, and I think people are getting too fixated on “hashes can have collisions” to notice the other properties of the system.
Apple can tune the false positive rate by varying the number of matches necessary to flag an account. They say they chose a threshold that would result in a false positive rate of one in a trillion. Are you saying they got their maths wrong or were lying, and that’s actually impossible? Because there’s only eight billion people on the planet and most of them don’t have Apple accounts, so if one in a trillion is accurate, then it seems entirely possible for this system to have run indefinitely without a single false positive.
Remember – I said “basically nil when you use them in the way Apple was proposing (two separate hashes; several matches required)”, and not that a single hash function wouldn’t ever produce a single collision for a single image. Do you really disagree with that?
We're discussing mathematics, so it has a precise meaning: as the number of samples approaches infinity, the probability of observing a false-positive approaches 1.
As I mentioned in another comment, your claim is both formally and practically incorrect. It is formally incorrect for the reason above. Given enough samples, a false-positive must occur.
It is practically incorrect because there exists no image-classification system whose FP rate is small enough that multiple FPs won't be observed daily, given the number of samples at play [0].
[0] Except, of course, for the trivial case in which you allow an exorbitant number of misses, but surely this isn't what you're arguing...
Actually, we're talking about both, and you are wrong on both counts.
If you weren't, you would be able to point to an existing system that has demonstrated the capability of classifying an image set on the order of iCloud's, without producing a false-positive. You can't, because such a system doesn't exist.
> you would be able to point to an existing system that has demonstrated the capability of classifying an image set on the order of iCloud's, without producing a false-positive.
This is not relevant to what I was saying; it sounds like you might be mixing up collisions with false positives.
As I said before:
> Remember – I said “basically nil when you use them in the way Apple was proposing (two separate hashes; several matches required)”, and not that a single hash function wouldn’t ever produce a single collision for a single image. Do you really disagree with that?
The relevant false positive is when an account gets flagged by Apple. That’s when it actually matters; that is what we don’t want to happen; that is what Apple are describing as a one in a trillion chance. The rate of hash collisions are only important as an input to that larger system.
That “number of samples”? It’s not the number of images in the whole of iCloud. It’s the number of Apple accounts. It doesn’t matter what happens “approaching infinity”, it matters what happens in the range zero to eight billion. We have a known upper bound, and it is substantially less than infinity. It doesn’t matter if a false positive is guaranteed “approaching infinity”, what matters is the likelihood of a false positive for any Apple account. That is not guaranteed, and actually extremely unlikely.
Also, just checking – you do understand that iCloud is using perceptual hashes and that it’s iMessage that uses an image classifier, right? Perceptual hashes aren’t normally referred to as image classifiers and aren’t really doing the same sort of thing as OCR; with OCR you need to output a token even if the shape is uncertain, but that failure case doesn’t exist for perceptual hashes because the result is just that there isn’t a match. What would be a false positive in the OCR case would be a false negative in the perceptual hash case, which we don’t care about for the purpose of this discussion.
I know “basically nil” is not the same as “actually nil”, that’s why I said “basically nil” and not just “nil”. It is “nil for all practical purposes”. This is a good enough standard for software the whole world uses all day every day to depend upon – otherwise hashes would be pretty useless.
“Eventually” needs clarification here. Of course, if you spend from now until the heat death of the universe iterating through the search space you could get there. But that doesn’t mean that in practical use, such a collision is guaranteed as you claim. This system could have run its entire lifetime without a collision. You are mistaking “if you cover the entire search space” for what happens in the real world.
Do you understand that there is not one, but two hashes, and that they both need to collide simultaneously for each image? And do you understand that this has to happen not just a single time, but for several images on your system? That’s why I specifically said “basically nil when you use them in the way Apple was proposing” and described how it worked.
Everybody with even a basic grasp of what hashes are understand that collisions are certain when considering the entire search space. But that doesn’t correspond with what happens in practical terms, which is why hashing is actually useful in general, and I think people are getting too fixated on “hashes can have collisions” to notice the other properties of the system.
Apple can tune the false positive rate by varying the number of matches necessary to flag an account. They say they chose a threshold that would result in a false positive rate of one in a trillion. Are you saying they got their maths wrong or were lying, and that’s actually impossible? Because there’s only eight billion people on the planet and most of them don’t have Apple accounts, so if one in a trillion is accurate, then it seems entirely possible for this system to have run indefinitely without a single false positive.
Remember – I said “basically nil when you use them in the way Apple was proposing (two separate hashes; several matches required)”, and not that a single hash function wouldn’t ever produce a single collision for a single image. Do you really disagree with that?