Abstract
Text similarity measure is a common issue in Information Retrieval, Text Mining, Web Mining, Text Classification/Clustering and Document Copy Detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But they are symmetric similarity measures, which cannot find out the subset copies. In this paper we present the concepts of asymmetric similarity model and heavy frequency vector (HFV). The former can detect subset copies well, and the latter can save a great resources and CPU time. We have developed two new asymmetric measures: heavy frequency vector (HFM) and Heavy inclusion proportion model HIPM. The HFM and HIPM are derived from cosine function and proportion function by combining asymmetric similarity concept with HFV. The HFV is to truncate the original full frequency vector to a short vector. We can adjust the parameter of HFV to balance the model’s performance. The paper illustrates the aspects of asymmetric similarity and HFV models by several experiments.
Similar content being viewed by others
References
Brin S et al (1995) Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD annual conference, San Francisco
Broder AZ et al (1997) Syntactic clustering of the Web. In: Proceedings of the 6th international Web Conference, Santa Clara
Denning PJ (1995) Editorial: plagiarism in the web. Commun ACM, 38(12)
Garcia-Molina H et al (1996) dSCAM: finding document copies across multiple databases. In: Proceedings of 4th International conference on parallel and distributed systems, Miami Beach
Heintze N (1996) Scalable document fingerprinting. In: Proceedings of the 2nd USENIX workshop on electronic commerce, Oakland
Monostori K et al (2000) MatchDetectReveal: finding overlapping and similar digital documents. In: Proceedings of information resources management association international conference, 21–24 May 2000 at Anchorage Hilton Hotel, Anchorage
Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proceedings of 2nd international conference in theory and practice of digital libraries, Austin
Shivakumar N, Garcia-Molina H (1998) Finding near-replicas of documents on the web. In: Proceedings of workshop on Web databases
Si A (1997) CHECK: a document plagiarism detection system. In: Proceedings of ACM symposium for applied computing, pp 70–77
Song QB, Shen JY (2001) On illegal coping and distributing detection mechanism for digital goods. J Comput Res Dev 38(1):121–125
Acknowledgements
Our study is supported by national natural science foundation of china, ID: 60173058 (NSFC) and Xi’an Jiaotong University Science Research Fund (ID: 573031).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bao, JP., Shen, JY., Liu, HY. et al. A fast document copy detection model. Soft Comput 10, 41–46 (2006). https://doi.org/10.1007/s00500-005-0463-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-005-0463-2