next up previous
Next: Document Prologs Up: Reducing False Positives Previous: Fingerprint Generation

Fingerprint Matching

 

Some substrings are very common in certain collections of documents. For example, in a technical report series, substrings generated from addresses, funding agencies acknowledgements and strings such as ``This work is the opinion of the author and does not necessarily represent the view of his employer or ...'' appear in many documents. Such substrings are difficult to recognize within the context of a single document (and so the technique described in the previous section does not detect them), but require the context of a collection of documents.

We use a frequency check during search to isolate these strings. When a search is performed and a particular (hashed) substring is looked up in the database, we check to see if this particular string appears in 4 or more documents. If it does, then we ignore it during the search. However, this raises a security issue: if one could repeatedly add copies of the one document to the database, then eventually all of the substrings of the document would be ignored, and the document would not generate a match. To address this situation, we cap the number of ignored substrings at 10. In Section 7, we show that this technique can reduce false negatives by between 15% and 85%, depending on which of the other checks in this section are deployed. We remark that the values of 4 and 10 deserve further evaluation using different sizes and classes of document sets. We also remark that this technique is an application of a very standard information processing idea: discount the significance of common features.



Nevin Heintze
Thu Oct 3 20:48:58 EDT 1996