next up previous
Next: Fingerprint Storage Up: Fingerprinting Previous: Selection Strategy

Limitations of Fixed Length Fingerprints

An important measure of the reliability of a fingerprinting scheme is how closely its results correlate with those from full fingerprinting. For the fixed length selective fingerprinting approach we have chosen, the correlation is good for documents of similar size, but can become problematic for documents of significantly different size. Specifically, we can show that for documents of identical size, the expected match ratios for fixed length selective fingerprinting are identical to those for full fingerprinting. However for documents of different size, the results can vary by a ratio as high as the ratio of sizes of the two documents. To illustrate the problem, consider the extreme case of matching a document of 1000 words against a document of 100,000 words, and suppose that the smaller document appears once in the larger document. Now if the stored fingerprint of the larger document has size 100, then on average there will be one substring for every 1000 word piece of the larger document. In other words, the stored fingerprint of the larger document will have about one substring in common with the smaller document (i.e. about a 1% match ratio), compared with the 100% match given by full fingerprinting. We are currently investigating ways to address this issue, including using variable sized fingerprints and flagging low match ratios as significant if document sizes vary significantly.



Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996