skip to main content
10.1145/1989323.1989454acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

Large-scale copy detection

Published:12 June 2011Publication History

ABSTRACT

The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow.

The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.

Skip Supplemental Material Section

Supplemental Material

thursday16-athensview-tutorial6.wmv

wmv

365.8 MB

References

  1. S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 33(9):577--591, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR. 2009.Google ScholarGoogle Scholar
  4. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, volume 6051 of Lecture Notes in Computer Science, pages 83--97. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bleiholder, S. Khuller, F. Naumann, L. Raschid, and Y. Wu. Query planning in the presence of overlapping sources. In EDBT, volume 3896 of Lecture Notes in Computer Science, pages 811--828. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409. ACM Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, 50(7):1545--1551, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Falke, P. Frenzel, and R. Koschke. Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering, 13(6):601--643, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In PDIS, pages 68--79. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Hampapur and R. M. Bolle. Comparison of distance measures for video copy detection. In ICME. IEEE Computer Society, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  15. A. Hampapur, K. Hyun, and R. M. Bolle. Comparison of sequence matching techniques for video copy detection. In Storage and Retrieval for Media Databases, volume 4676 of SPIE Proceedings, pages 194--201. SPIE, 2002.Google ScholarGoogle Scholar
  16. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In ICSE, pages 96--105. IEEE Computer Society, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L.-W. Kang, C.-Y. Hsu, H.-W. Chen, and C.-S. Lu. Secure sift-based sparse representation for image copy detection and recognition. In ICME, pages 1248--1253. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Kim and B. Vasudev. Spatiotemporal sequence matching for efficient video copy detection. IEEE Trans. Circuits Syst. Video Techn., 15(1):127--132, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Koschke. Survey of research on software clones. In Duplication, Redundancy, and Similarity in Software, volume 06301 of Dagstuhl Seminar Proceedings. Internationales Begegnungsund Forschungszentrum fuer Informatik (IBFI), 2006.Google ScholarGoogle Scholar
  21. J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C.-S. Lu and C.-Y. Hsu. Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication. Multimedia Syst., 11(2):159--173, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C.-S. Lu, C.-Y. Hsu, S.-W. Sun, and P.-C. Chang. Robust mesh-based hashing for copy detection and tracing of images. In ICME, pages 731--734. IEEE, 2004.Google ScholarGoogle Scholar
  24. H. A. Maurer, F. Kappe, and B. Zaka. Plagiarism - a survey. J. UCS, 12(8):1050--1084, 2006.Google ScholarGoogle Scholar
  25. A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz. Automatic detection of local reuse. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 229--244. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Muthmann and A. Löser. Detecting near-duplicate relations in user generated forum content. In OTM Workshops, volume 6428 of Lecture Notes in Computer Science, pages 698--707. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Z. Nie and S. Kambhampati. A frequency-based approach for mining coverage statistics in data integration. In ICDE, pages 387--398. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Nie, S. Kambhampati, and U. Nambiar. Effectively mining and using coverage and overlap statistics for data integration. IEEE Trans. Knowl. Data Eng., 17(5):638--651, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: an empirical study. Journal of Software Maintenance, 22(3):165--189, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470--495, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale copy detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
            June 2011
            1364 pages
            ISBN:9781450306614
            DOI:10.1145/1989323

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 June 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • tutorial

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader