Abstract
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional techniques relying on direct inter-document similarity computation are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, while very attractive computationally, can be unstable even to small perturbations of document content, which causes signature fragmentation. We focus on I-Match and present a randomization-based technique of increasing its signature stability, with the proposed method consistently outperforming traditional I-Match by as high as 40–60% in terms of the relative improvement in near-duplicate recall. Importantly, the large gains in detection accuracy are offset by only small increases in computational requirements. We also address the complimentary problem of spurious matches, which is particularly important for I-Match when fingerprinting long documents. Our discussion is supported by experiments involving large web-page and email datasets.
Similar content being viewed by others
References
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000) An evaluation of Naive Bayesian anti-spam filtering. In: Potamias, G, Moustakis, V, van Someren, M (eds) Proceedings of the workshop on machine learning in the new information age: 11th European conference on machine learning (ECML2000), pp 9–17
Bailey P, Craswell N, Hawking D (2003) Engineering a multi-purpose test collection for web retrieval experiments. Inf Process Manag 39:853–871
Bilenko M, Mooney RJ (2002) Learning to combine trained distance metrics for duplicate detection in databases. Technical report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin
Breiman L (1996) Bagging predictors. Mach Lear 24:123–140
Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proceeding of SIGMOD, pp 398–409
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of complexity and compression of sequences (SEQUENCES ’97), pp 21–29
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. In: Proceedings of the sixth international world wide web conference
Buckley C, Cardie C, Mardisa S, Mitra M, Pierce D, Wagsta K, Walz J (2000) The smart/empire tipster IR system. In: TIPSTER phase III proceedings. Morgan Kaufmann
Chowdhury A, Frieder O, Grossman DA, McCabe MC (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191
Conrad J, Guo X, Schriber C (2003) Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp 443–452
Cooper J, Coden A, Brown E (2002) A novel method for detecting similar documents. In: Proceedings of the 35th Hawaii international conference on system sciences
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neur Netw 10(5):1048–1054
Fawcett T (2003) “In vivo” spam filtering: a challenge problem for data mining. KDD Explor 5(2):203–231
Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. In: Proceedings of the 1st Latin American web congress, pp 37–45
Gionis A, Indyk P, Motwani R (1997) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large databases (VLDB)
Graham-Cummings J (2003) The spammers’ compendium. In: Proceedings of the spam conference
Hall RJ (1999) A countermeasure to duplicate-detecting anti-spam techniques. Technical report 99.9.1, AT&T Labs Research
Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web. In: Proceedings of WebDB-2000
Hawking D (2000) Overview of the TREC-9 web track. In: TREC-9 NIST
Hawking D, Craswell N (2001) Overview of the TREC-2001 web track. In: TREC-10 NIST
Heintze N (1996) Scalable document fingerprinting. In: 1996 USENIX workshop on electronic commerce, November 1996
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the SIGMOD conference
Hoad TC, Zobel J (2002) Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol
Ilyinsky S, Kuzmin M, Melkov A, Segalovich I (2002) An efficient method to detect duplicates of web documents with the use of inverted index. In: Proceedings of the eleventh international world wide web conference
Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2002)
Kołcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the workshop on text mining (TextDM’2001)
Kołcz A, Chowdhury A, Alspector J (2003) Data duplication: an imbalance problem? In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets (II)
Kołcz A, Chowdhury A, Alspector J (2004) Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2004)
Kwok K (1996) Relevance feedback in information retrieval. In: Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval
McCallum A, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000)
Robertson S, Walker S, Beaulieu M (1998) Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of the 7th text retrieval conference
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the AAAI-98 workshop on learning for text categorization
Salton G, Yang C, Wong A (1975) A vector-space model for information retrieval. Commun ACM 18
Sanderson M (1997) Duplicate detection in the Reuters collection. Technical report TR-1997-5, Department of Computing Science, University of Glasgow
Shivakumar N, Garcia-Molina H (1999) Finding near-replicas of documents on the web. In: WEBDB: international workshop on the world wide web and databases, WebDB. LNCS
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval
Soboroff I (2002) Does wt10g look like the web? In: SIGIR 2002, pp 423–424
Winkler WE (1999) The state of record linkage and current research problems. Technical report, Statistical Research Division, US Bureau of Census, Washington, DC, 1999
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kołcz, A., Chowdhury, A. Lexicon randomization for near-duplicate detection with I-Match. J Supercomput 45, 255–276 (2008). https://doi.org/10.1007/s11227-007-0171-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-007-0171-z