A fast document copy detection model

Bao, Jun-Peng; Shen, Jun-Yi; Liu, Hai-Yan; Liu, Xiao-Dong

doi:10.1007/s00500-005-0463-2

A fast document copy detection model

Focus
Published: 06 April 2005

Volume 10, pages 41–46, (2006)
Cite this article

Soft Computing Aims and scope Submit manuscript

Jun-Peng Bao¹,
Jun-Yi Shen¹,
Hai-Yan Liu¹ &
…
Xiao-Dong Liu¹

141 Accesses
11 Citations
Explore all metrics

Abstract

Text similarity measure is a common issue in Information Retrieval, Text Mining, Web Mining, Text Classification/Clustering and Document Copy Detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But they are symmetric similarity measures, which cannot find out the subset copies. In this paper we present the concepts of asymmetric similarity model and heavy frequency vector (HFV). The former can detect subset copies well, and the latter can save a great resources and CPU time. We have developed two new asymmetric measures: heavy frequency vector (HFM) and Heavy inclusion proportion model HIPM. The HFM and HIPM are derived from cosine function and proportion function by combining asymmetric similarity concept with HFV. The HFV is to truncate the original full frequency vector to a short vector. We can adjust the parameter of HFV to balance the model’s performance. The paper illustrates the aspects of asymmetric similarity and HFV models by several experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Shingling Similar Text Detection Algorithm

Near Duplicate Text Detection Using Frequency-Biased Signatures

Similarity Based on Data Compression

References

Brin S et al (1995) Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD annual conference, San Francisco
Broder AZ et al (1997) Syntactic clustering of the Web. In: Proceedings of the 6th international Web Conference, Santa Clara
Denning PJ (1995) Editorial: plagiarism in the web. Commun ACM, 38(12)
Garcia-Molina H et al (1996) dSCAM: finding document copies across multiple databases. In: Proceedings of 4th International conference on parallel and distributed systems, Miami Beach
Heintze N (1996) Scalable document fingerprinting. In: Proceedings of the 2nd USENIX workshop on electronic commerce, Oakland
Monostori K et al (2000) MatchDetectReveal: finding overlapping and similar digital documents. In: Proceedings of information resources management association international conference, 21–24 May 2000 at Anchorage Hilton Hotel, Anchorage
Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proceedings of 2nd international conference in theory and practice of digital libraries, Austin
Shivakumar N, Garcia-Molina H (1998) Finding near-replicas of documents on the web. In: Proceedings of workshop on Web databases
Si A (1997) CHECK: a document plagiarism detection system. In: Proceedings of ACM symposium for applied computing, pp 70–77
Song QB, Shen JY (2001) On illegal coping and distributing detection mechanism for digital goods. J Comput Res Dev 38(1):121–125
Google Scholar

Download references

Acknowledgements

Our study is supported by national natural science foundation of china, ID: 60173058 (NSFC) and Xi’an Jiaotong University Science Research Fund (ID: 573031).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China
Jun-Peng Bao, Jun-Yi Shen, Hai-Yan Liu & Xiao-Dong Liu

Authors

Jun-Peng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Yi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Yan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Dong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun-Peng Bao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bao, JP., Shen, JY., Liu, HY. et al. A fast document copy detection model. Soft Comput 10, 41–46 (2006). https://doi.org/10.1007/s00500-005-0463-2

Download citation

Published: 06 April 2005
Issue Date: January 2006
DOI: https://doi.org/10.1007/s00500-005-0463-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast document copy detection model

Abstract

Access this article

Similar content being viewed by others

A New Shingling Similar Text Detection Algorithm

Near Duplicate Text Detection Using Frequency-Biased Signatures

Similarity Based on Data Compression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fast document copy detection model

Abstract

Access this article

Similar content being viewed by others

A New Shingling Similar Text Detection Algorithm

Near Duplicate Text Detection Using Frequency-Biased Signatures

Similarity Based on Data Compression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation