tutorial

Large-scale copy detection

Authors:
Xin Luna Dong

AT&T Labs-Research, Florham Park, NJ, USA

AT&T Labs-Research, Florham Park, NJ, USA
View Profile

,
Divesh Srivastava

AT&T Labs-Research, Florham Park, NJ, USA

AT&T Labs-Research, Florham Park, NJ, USA
View Profile

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataJune 2011Pages 1205–1208https://doi.org/10.1145/1989323.1989454

Published:12 June 2011Publication History

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Pages 1205–1208

ABSTRACT

The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow.

The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.

Supplemental Material

thursday16-athensview-tutorial6.wmv

wmv

365.8 MB

Download

References

S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 33(9):577--591, 2007. Google ScholarDigital Library
M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271. 2009. Google ScholarDigital Library
L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR. 2009.Google Scholar
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, volume 6051 of Lecture Notes in Computer Science, pages 83--97. Springer, 2010. Google ScholarDigital Library
J. Bleiholder, S. Khuller, F. Naumann, L. Raschid, and Y. Wu. Query planning in the presence of overlapping sources. In EDBT, volume 3896 of Lecture Notes in Computer Science, pages 811--828. Springer, 2006. Google ScholarDigital Library
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409. ACM Press, 1995. Google ScholarDigital Library
X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, 50(7):1545--1551, 2004. Google ScholarDigital Library
X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
R. Falke, P. Frenzel, and R. Koschke. Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering, 13(6):601--643, 2008. Google ScholarDigital Library
H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In PDIS, pages 68--79. IEEE Computer Society, 1996. Google ScholarDigital Library
O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70. ACM, 2009. Google ScholarDigital Library
A. Hampapur and R. M. Bolle. Comparison of distance measures for video copy detection. In ICME. IEEE Computer Society, 2001.Google ScholarCross Ref
A. Hampapur, K. Hyun, and R. M. Bolle. Comparison of sequence matching techniques for video copy detection. In Storage and Retrieval for Media Databases, volume 4676 of SPIE Proceedings, pages 194--201. SPIE, 2002.Google Scholar
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291. ACM, 2006. Google ScholarDigital Library
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In ICSE, pages 96--105. IEEE Computer Society, 2007. Google ScholarDigital Library
L.-W. Kang, C.-Y. Hsu, H.-W. Chen, and C.-S. Lu. Secure sift-based sparse representation for image copy detection and recognition. In ICME, pages 1248--1253. IEEE, 2010.Google ScholarCross Ref
C. Kim and B. Vasudev. Spatiotemporal sequence matching for efficient video copy detection. IEEE Trans. Circuits Syst. Video Techn., 15(1):127--132, 2005. Google ScholarDigital Library
R. Koschke. Survey of research on software clones. In Duplication, Redundancy, and Similarity in Software, volume 06301 of Dagstuhl Seminar Proceedings. Internationales Begegnungsund Forschungszentrum fuer Informatik (IBFI), 2006.Google Scholar
J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378. 2007. Google ScholarDigital Library
C.-S. Lu and C.-Y. Hsu. Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication. Multimedia Syst., 11(2):159--173, 2005.Google ScholarDigital Library
C.-S. Lu, C.-Y. Hsu, S.-W. Sun, and P.-C. Chang. Robust mesh-based hashing for copy detection and tracing of images. In ICME, pages 731--734. IEEE, 2004.Google Scholar
H. A. Maurer, F. Kappe, and B. Zaka. Plagiarism - a survey. J. UCS, 12(8):1050--1084, 2006.Google Scholar
A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz. Automatic detection of local reuse. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 229--244. Springer, 2010. Google ScholarDigital Library
K. Muthmann and A. Löser. Detecting near-duplicate relations in user generated forum content. In OTM Workshops, volume 6428 of Lecture Notes in Computer Science, pages 698--707. Springer, 2010. Google ScholarDigital Library
Z. Nie and S. Kambhampati. A frequency-based approach for mining coverage statistics in data integration. In ICDE, pages 387--398. 2004. Google ScholarDigital Library
Z. Nie, S. Kambhampati, and U. Nambiar. Effectively mining and using coverage and overlap statistics for data integration. IEEE Trans. Knowl. Data Eng., 17(5):638--651, 2005. Google ScholarDigital Library
C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: an empirical study. Journal of Software Maintenance, 22(3):165--189, 2010. Google ScholarDigital Library
C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470--495, 2009. Google ScholarDigital Library
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD. 2003. Google ScholarDigital Library
J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578. ACM, 2008. Google ScholarDigital Library

Index Terms

Large-scale copy detection

Recommendations

A Copy Detection Method Based on SCAM and PPCHECKER
SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology

With the widespread use of the Internet and the availability of a huge amount of digital documents online, plagiarism is increasing. This is a serious problem not only in publishing of scientific documents but also in education. Copying is a frequent ...
Read More
Research on System of Chinese Document Copy Detection
IAS '09: Proceedings of the 2009 Fifth International Conference on Information Assurance and Security - Volume 02

Text copying is easy to do but not easy to be detected. Copy detection is a delicate ground. It determines whether a document is copied from others or not. It helps to verify and to detect paper redistribution, thesis plagiarism or copyright violation. ...
Read More
Digital Media Copy Detection: Research Actuality and Prospect
ICNC '08: Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 06

This paper reviews and summarizes the copy detection of the digital media detailedly, which can be classified as digital watermarking and contended based copy detection. Then, set forth the research actuality of the text, image and video. Based on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
June 2011
1364 pages
ISBN:9781450306614
DOI:10.1145/1989323
General Chair:
Timos Sellis
IMIS/RC Athena
,
Program Chair:
Renée J. Miller
University of Toronto
,
Publications Chairs:
Anastasios Kementsietsidis
IBM T.J. Watson Research Center
,
Yannis Velegrakis
University of Trento
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
copy detection
data integration
Qualifiers
- tutorial
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 541
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Large-scale copy detection

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Copy Detection Method Based on SCAM and PPCHECKER

Research on System of Chinese Document Copy Detection

Digital Media Copy Detection: Research Actuality and Prospect

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Large-scale copy detection

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Copy Detection Method Based on SCAM and PPCHECKER

Research on System of Chinese Document Copy Detection

Digital Media Copy Detection: Research Actuality and Prospect

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media