Efficient Near-duplicate Detection and Sub-image Retrieval
by Yan Ke, Rahul Sukthankar,
and Larry Huston
Abstract:
We introduce a system for near-duplicate detection and sub-image retrieval. Such a system
is
useful for finding copyright violations and detecting forged images.
We define near-duplicates as images altered with common transformations such
as changing contrast, saturation, scaling, cropping, framing, etc.
Our system builds a parts-based representation of images using
distinctive local descriptors which give high quality
matches even under severe transformations. To cope with the
large number of features extracted from the images, we employ
locality-sensitive hashing to index the local descriptors.
This allows us to make approximate similarity queries
that only examine a small fraction of the database. Although
locality-sensitive hashing has excellent theoretical performance
properties, a standard implementation would still be unacceptably
slow for this application. We show that, by optimizing layout and
access to the index data on disk, we can efficiently query indices
containing millions of keypoints. Our system achieves near-perfect
accuracy (100% precision at 99.85% recall) on the tests used by
Meng et al., and consistently strong results
on our own, significantly more challenging experiments. Query
times are interactive even for collections of thousands of images.
Y. Ke, R. Sukthankar, and L. Huston. ACM Multimedia,
2004. [PDF 560KB]
Datasets used in the experiments are available
here.
Links:
PCA-SIFT
Page maintained by Yan Ke < y k e @ c m u . e d u >
July 2004