Skip to main content
Log in

A fast document copy detection model

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Text similarity measure is a common issue in Information Retrieval, Text Mining, Web Mining, Text Classification/Clustering and Document Copy Detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But they are symmetric similarity measures, which cannot find out the subset copies. In this paper we present the concepts of asymmetric similarity model and heavy frequency vector (HFV). The former can detect subset copies well, and the latter can save a great resources and CPU time. We have developed two new asymmetric measures: heavy frequency vector (HFM) and Heavy inclusion proportion model HIPM. The HFM and HIPM are derived from cosine function and proportion function by combining asymmetric similarity concept with HFV. The HFV is to truncate the original full frequency vector to a short vector. We can adjust the parameter of HFV to balance the model’s performance. The paper illustrates the aspects of asymmetric similarity and HFV models by several experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Brin S et al (1995) Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD annual conference, San Francisco

  • Broder AZ et al (1997) Syntactic clustering of the Web. In: Proceedings of the 6th international Web Conference, Santa Clara

  • Denning PJ (1995) Editorial: plagiarism in the web. Commun ACM, 38(12)

  • Garcia-Molina H et al (1996) dSCAM: finding document copies across multiple databases. In: Proceedings of 4th International conference on parallel and distributed systems, Miami Beach

  • Heintze N (1996) Scalable document fingerprinting. In: Proceedings of the 2nd USENIX workshop on electronic commerce, Oakland

  • Monostori K et al (2000) MatchDetectReveal: finding overlapping and similar digital documents. In: Proceedings of information resources management association international conference, 21–24 May 2000 at Anchorage Hilton Hotel, Anchorage

  • Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proceedings of 2nd international conference in theory and practice of digital libraries, Austin

  • Shivakumar N, Garcia-Molina H (1998) Finding near-replicas of documents on the web. In: Proceedings of workshop on Web databases

  • Si A (1997) CHECK: a document plagiarism detection system. In: Proceedings of ACM symposium for applied computing, pp 70–77

  • Song QB, Shen JY (2001) On illegal coping and distributing detection mechanism for digital goods. J Comput Res Dev 38(1):121–125

    Google Scholar 

Download references

Acknowledgements

Our study is supported by national natural science foundation of china, ID: 60173058 (NSFC) and Xi’an Jiaotong University Science Research Fund (ID: 573031).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun-Peng Bao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bao, JP., Shen, JY., Liu, HY. et al. A fast document copy detection model. Soft Comput 10, 41–46 (2006). https://doi.org/10.1007/s00500-005-0463-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-005-0463-2

Keywords

Navigation