Lexicon randomization for near-duplicate detection with I-Match

Kołcz, A.; Chowdhury, A.

doi:10.1007/s11227-007-0171-z

Lexicon randomization for near-duplicate detection with I-Match

Published: 26 January 2008

Volume 45, pages 255–276, (2008)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

A. Kołcz¹ &
A. Chowdhury²

113 Accesses
10 Citations
Explore all metrics

Abstract

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional techniques relying on direct inter-document similarity computation are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, while very attractive computationally, can be unstable even to small perturbations of document content, which causes signature fragmentation. We focus on I-Match and present a randomization-based technique of increasing its signature stability, with the proposed method consistently outperforming traditional I-Match by as high as 40–60% in terms of the relative improvement in near-duplicate recall. Importantly, the large gains in detection accuracy are offset by only small increases in computational requirements. We also address the complimentary problem of spurious matches, which is particularly important for I-Match when fingerprinting long documents. Our discussion is supported by experiments involving large web-page and email datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000) An evaluation of Naive Bayesian anti-spam filtering. In: Potamias, G, Moustakis, V, van Someren, M (eds) Proceedings of the workshop on machine learning in the new information age: 11th European conference on machine learning (ECML2000), pp 9–17
Bailey P, Craswell N, Hawking D (2003) Engineering a multi-purpose test collection for web retrieval experiments. Inf Process Manag 39:853–871
Article Google Scholar
Bilenko M, Mooney RJ (2002) Learning to combine trained distance metrics for duplicate detection in databases. Technical report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin
Breiman L (1996) Bagging predictors. Mach Lear 24:123–140
MATH MathSciNet Google Scholar
Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proceeding of SIGMOD, pp 398–409
Broder A (1997) On the resemblance and containment of documents. In: Proceedings of complexity and compression of sequences (SEQUENCES ’97), pp 21–29
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. In: Proceedings of the sixth international world wide web conference
Buckley C, Cardie C, Mardisa S, Mitra M, Pierce D, Wagsta K, Walz J (2000) The smart/empire tipster IR system. In: TIPSTER phase III proceedings. Morgan Kaufmann
Chowdhury A, Frieder O, Grossman DA, McCabe MC (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191
Article Google Scholar
Conrad J, Guo X, Schriber C (2003) Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp 443–452
Cooper J, Coden A, Brown E (2002) A novel method for detecting similar documents. In: Proceedings of the 35th Hawaii international conference on system sciences
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neur Netw 10(5):1048–1054
Article Google Scholar
Fawcett T (2003) “In vivo” spam filtering: a challenge problem for data mining. KDD Explor 5(2):203–231
Google Scholar
Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. In: Proceedings of the 1st Latin American web congress, pp 37–45
Gionis A, Indyk P, Motwani R (1997) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large databases (VLDB)
Graham-Cummings J (2003) The spammers’ compendium. In: Proceedings of the spam conference
Hall RJ (1999) A countermeasure to duplicate-detecting anti-spam techniques. Technical report 99.9.1, AT&T Labs Research
Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web. In: Proceedings of WebDB-2000
Hawking D (2000) Overview of the TREC-9 web track. In: TREC-9 NIST
Hawking D, Craswell N (2001) Overview of the TREC-2001 web track. In: TREC-10 NIST
Heintze N (1996) Scalable document fingerprinting. In: 1996 USENIX workshop on electronic commerce, November 1996
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the SIGMOD conference
Hoad TC, Zobel J (2002) Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol
Ilyinsky S, Kuzmin M, Melkov A, Segalovich I (2002) An efficient method to detect duplicates of web documents with the use of inverted index. In: Proceedings of the eleventh international world wide web conference
Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2002)
Kołcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the workshop on text mining (TextDM’2001)
Kołcz A, Chowdhury A, Alspector J (2003) Data duplication: an imbalance problem? In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets (II)
Kołcz A, Chowdhury A, Alspector J (2004) Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2004)
Kwok K (1996) Relevance feedback in information retrieval. In: Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval
McCallum A, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000)
Robertson S, Walker S, Beaulieu M (1998) Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of the 7th text retrieval conference
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Proceedings of the AAAI-98 workshop on learning for text categorization
Salton G, Yang C, Wong A (1975) A vector-space model for information retrieval. Commun ACM 18
Sanderson M (1997) Duplicate detection in the Reuters collection. Technical report TR-1997-5, Department of Computing Science, University of Glasgow
Shivakumar N, Garcia-Molina H (1999) Finding near-replicas of documents on the web. In: WEBDB: international workshop on the world wide web and databases, WebDB. LNCS
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval
Soboroff I (2002) Does wt10g look like the web? In: SIGIR 2002, pp 423–424
Winkler WE (1999) The state of record linkage and current research problems. Technical report, Statistical Research Division, US Bureau of Census, Washington, DC, 1999

Download references

Author information

Authors and Affiliations

Microsoft Live Labs, 1 Microsoft Way, Redmonnd, WA, 98052, USA
A. Kołcz
University of Bremen TZI, Bremen, Germany
A. Chowdhury

Authors

A. Kołcz
View author publications
You can also search for this author in PubMed Google Scholar
A. Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Chowdhury.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kołcz, A., Chowdhury, A. Lexicon randomization for near-duplicate detection with I-Match. J Supercomput 45, 255–276 (2008). https://doi.org/10.1007/s11227-007-0171-z

Download citation

Received: 10 September 2007
Accepted: 27 December 2007
Published: 26 January 2008
Issue Date: September 2008
DOI: https://doi.org/10.1007/s11227-007-0171-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lexicon randomization for near-duplicate detection with I-Match

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

K-Means algorithm based on multi-feature-induced order

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lexicon randomization for near-duplicate detection with I-Match

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

K-Means algorithm based on multi-feature-induced order

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation