research-article

Detecting Source Code Re-Use with Ensemble Models

Authors:
Enrique Flores

Universitat Politècnica de, València, Spain

Universitat Politècnica de, València, Spain
View Profile

,
Lidia Moreno

Universitat Politècnica de, València, Spain

Universitat Politècnica de, València, Spain
View Profile

,
Paolo Rosso

Universitat Politècnica de, València, Spain

Universitat Politècnica de, València, Spain
View Profile

CERI '16: Proceedings of the 4th Spanish Conference on Information RetrievalJune 2016Article No.: 16Pages 1–7https://doi.org/10.1145/2934732.2934738

Published:14 June 2016Publication History

CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval

Pages 1–7

ABSTRACT

Source code re-use has been usually faced from a compiler perspective. Considering the source code as a piece of text, we are able to use natural language techniques for the detection of source code re-use. This paper describes the use of ensemble models in the task of source code re-use detection. Ensembles of Information Retrieval (IR) models are constructed using common classifiers. The IR-inspired models are compared with the ensembles in C and Java programming languages. The use of ensemble classifiers shows promising results for detecting source code re-use.

References

V. Anjali, T. Swapna, and B. Jayaraman. Plagiarism detection for Java programs without source codes. Procedia Computer Science, 46:749--758, 2015.Google ScholarCross Ref
C. Arwin and S. Tahaghoghi. Plagiarism detection across programming languages. Proceedings of the 29th Australian Computer Science Conference, Australian Computer Society, 48:277--286, 2006. Google ScholarDigital Library
N. Baer and R. Zeidman. Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249--254, 2012.Google ScholarCross Ref
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011. Google ScholarDigital Library
A. Barrón-Cedeño, M. Lestari-Paramita, P. Clough, and P. Rosso. A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. Advances in Information Retrieval, Springer International Publishing, LNCS(8416), pages 424--429, 2014.Google Scholar
D.-K. Chae, J. Ha, S.-W. Kim, B. Kang, and E. G. Im. Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1577--1580. ACM, 2013. Google ScholarDigital Library
M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In IEEE 17th International Conference on Program Comprehension, pages 243--247, May 2009.Google ScholarCross Ref
D. Chuda, P. Navrat, B. Kovacova, and P. Humay. The issue of (software) plagiarism: A student view. IEEE Transactions on Education, 55(1):22--28, 2012. Google ScholarDigital Library
G. Cosma and M. Joy. Evaluating the performance of LSA for source-code plagiarism detection. Informatica, 36(4):409--424, 2013.Google Scholar
B. Cui, J. Li, T. Guo, J. Wang, and D. Ma. Code comparison system based on abstract syntax tree. In 3rd IEEE International Conference on Broadband Network and Multimedia Technology, pages 668--673, Oct 2010.Google Scholar
J. Cullum and R. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Society for Industrial and Applied Mathematics, 2002. Google ScholarDigital Library
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.Google ScholarCross Ref
E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, 21(13):1708--1725, dec 2015. http://www.jucs.org/jucs_21_13/cross_language_source_code.Google Scholar
E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23(3):383--390, 2015. Google ScholarDigital Library
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. DeSoCoRe: Detecting Source Code Re-use across programming languages. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1--4. Association for Computational Linguistics, 2012. Google ScholarDigital Library
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. Towards the Detection of Cross-Language Source Code Reuse. Proceedings of 16th International Conference on Applications of Natural Language to Information Systems, Springer-Verlag, LNCS(6716), pages 250--253, 2011. Google ScholarDigital Library
E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. PAN@FIRE: Overview of SOCO track on the detection of SOurce COde re-use. In Prasenjit et al. {23}.Google Scholar
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, volume 7, pages 1606--1611, 2007. Google ScholarDigital Library
R. García-Hernández and Y. Lendeneva. Identification of similar source codes based on longest common substrings. In Prasenjit et al. {23}.Google Scholar
M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2):129--133, May 1999. Google ScholarDigital Library
P. Mcnamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1-2):73--97, 2004. Google ScholarDigital Library
M. Prasenjit, M. Mandar, P. Sukomal, A. Madhulika, and M. Parth, editors. FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5-7 December, 2014.Google Scholar
L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016--1038, 2002.Google Scholar
A. Ramírez-de-la Cruz, G. Ramírez-de-la Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar, and C. Rodríguez-Lucatero. UAM@SOCO 2014: Detection of source code reuse by means of combining different types of representations. In Prasenjit et al. {23}.Google Scholar
M. Simard, G. Foster, and P. Isabelle. Using cognates to align sentences in bilingual corpora. In Proceedings of the Conference Centre for Advanced Studies on Collaborative research: Distributed Computing, IBM Press, volume 2, pages 1071--1082, 1993. Google ScholarDigital Library
G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2):131--138, 1990. Google ScholarDigital Library

Recommendations

On the Detection of SOurce COde Re-use
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for ...
Read More
Comparison of ensemble learning methods applied to network intrusion detection
ICC '17: Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing

This paper investigates the possibility of using ensemble learning methods to improve the performance of intrusion detection systems. We compare an ensemble of three ensemble learning methods, boosting, bagging and stacking in order to improve the ...
Read More
Optimizing clustering to promote data diversity when generating an ensemble classifier
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference Companion

In this paper, we propose a method to generate an optimized ensemble classifier. In the proposed method, a diverse input space is created by clustering training data incrementally within a cycle. A cycle is one complete round that includes clustering, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval
June 2016
146 pages
ISBN:9781450341417
DOI:10.1145/2934732

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ensemble classifiers
re-use detection
source code re-use
source code re-use retrieval
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
CERI '16 Paper Acceptance Rate18of27submissions,67%Overall Acceptance Rate36of51submissions,71%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 94
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting Source Code Re-Use with Ensemble Models

CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval

ABSTRACT

References

Cited By

Recommendations

On the Detection of SOurce COde Re-use

Comparison of ensemble learning methods applied to network intrusion detection

Optimizing clustering to promote data diversity when generating an ensemble classifier

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting Source Code Re-Use with Ensemble Models

CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval

ABSTRACT

References

Cited By

Recommendations

On the Detection of SOurce COde Re-use

Comparison of ensemble learning methods applied to network intrusion detection

Optimizing clustering to promote data diversity when generating an ensemble classifier

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media