ABSTRACT
The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow.
The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.
Supplemental Material
- S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 33(9):577--591, 2007. Google ScholarDigital Library
- M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271. 2009. Google ScholarDigital Library
- L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR. 2009.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, volume 6051 of Lecture Notes in Computer Science, pages 83--97. Springer, 2010. Google ScholarDigital Library
- J. Bleiholder, S. Khuller, F. Naumann, L. Raschid, and Y. Wu. Query planning in the presence of overlapping sources. In EDBT, volume 3896 of Lecture Notes in Computer Science, pages 811--828. Springer, 2006. Google ScholarDigital Library
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409. ACM Press, 1995. Google ScholarDigital Library
- X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, 50(7):1545--1551, 2004. Google ScholarDigital Library
- X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
- R. Falke, P. Frenzel, and R. Koschke. Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering, 13(6):601--643, 2008. Google ScholarDigital Library
- H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In PDIS, pages 68--79. IEEE Computer Society, 1996. Google ScholarDigital Library
- O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70. ACM, 2009. Google ScholarDigital Library
- A. Hampapur and R. M. Bolle. Comparison of distance measures for video copy detection. In ICME. IEEE Computer Society, 2001.Google ScholarCross Ref
- A. Hampapur, K. Hyun, and R. M. Bolle. Comparison of sequence matching techniques for video copy detection. In Storage and Retrieval for Media Databases, volume 4676 of SPIE Proceedings, pages 194--201. SPIE, 2002.Google Scholar
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291. ACM, 2006. Google ScholarDigital Library
- L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In ICSE, pages 96--105. IEEE Computer Society, 2007. Google ScholarDigital Library
- L.-W. Kang, C.-Y. Hsu, H.-W. Chen, and C.-S. Lu. Secure sift-based sparse representation for image copy detection and recognition. In ICME, pages 1248--1253. IEEE, 2010.Google ScholarCross Ref
- C. Kim and B. Vasudev. Spatiotemporal sequence matching for efficient video copy detection. IEEE Trans. Circuits Syst. Video Techn., 15(1):127--132, 2005. Google ScholarDigital Library
- R. Koschke. Survey of research on software clones. In Duplication, Redundancy, and Similarity in Software, volume 06301 of Dagstuhl Seminar Proceedings. Internationales Begegnungsund Forschungszentrum fuer Informatik (IBFI), 2006.Google Scholar
- J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378. 2007. Google ScholarDigital Library
- C.-S. Lu and C.-Y. Hsu. Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication. Multimedia Syst., 11(2):159--173, 2005.Google ScholarDigital Library
- C.-S. Lu, C.-Y. Hsu, S.-W. Sun, and P.-C. Chang. Robust mesh-based hashing for copy detection and tracing of images. In ICME, pages 731--734. IEEE, 2004.Google Scholar
- H. A. Maurer, F. Kappe, and B. Zaka. Plagiarism - a survey. J. UCS, 12(8):1050--1084, 2006.Google Scholar
- A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz. Automatic detection of local reuse. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 229--244. Springer, 2010. Google ScholarDigital Library
- K. Muthmann and A. Löser. Detecting near-duplicate relations in user generated forum content. In OTM Workshops, volume 6428 of Lecture Notes in Computer Science, pages 698--707. Springer, 2010. Google ScholarDigital Library
- Z. Nie and S. Kambhampati. A frequency-based approach for mining coverage statistics in data integration. In ICDE, pages 387--398. 2004. Google ScholarDigital Library
- Z. Nie, S. Kambhampati, and U. Nambiar. Effectively mining and using coverage and overlap statistics for data integration. IEEE Trans. Knowl. Data Eng., 17(5):638--651, 2005. Google ScholarDigital Library
- C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: an empirical study. Journal of Software Maintenance, 22(3):165--189, 2010. Google ScholarDigital Library
- C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470--495, 2009. Google ScholarDigital Library
- S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD. 2003. Google ScholarDigital Library
- J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578. ACM, 2008. Google ScholarDigital Library
Index Terms
- Large-scale copy detection
Recommendations
A Copy Detection Method Based on SCAM and PPCHECKER
SoICT '15: Proceedings of the 6th International Symposium on Information and Communication TechnologyWith the widespread use of the Internet and the availability of a huge amount of digital documents online, plagiarism is increasing. This is a serious problem not only in publishing of scientific documents but also in education. Copying is a frequent ...
Research on System of Chinese Document Copy Detection
IAS '09: Proceedings of the 2009 Fifth International Conference on Information Assurance and Security - Volume 02Text copying is easy to do but not easy to be detected. Copy detection is a delicate ground. It determines whether a document is copied from others or not. It helps to verify and to detect paper redistribution, thesis plagiarism or copyright violation. ...
Digital Media Copy Detection: Research Actuality and Prospect
ICNC '08: Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 06This paper reviews and summarizes the copy detection of the digital media detailedly, which can be classified as digital watermarking and contended based copy detection. Then, set forth the research actuality of the text, image and video. Based on ...
Comments