skip to main content
10.1145/2934732.2934738acmotherconferencesArticle/Chapter ViewAbstractPublication PagesceriConference Proceedingsconference-collections
research-article

Detecting Source Code Re-Use with Ensemble Models

Published:14 June 2016Publication History

ABSTRACT

Source code re-use has been usually faced from a compiler perspective. Considering the source code as a piece of text, we are able to use natural language techniques for the detection of source code re-use. This paper describes the use of ensemble models in the task of source code re-use detection. Ensembles of Information Retrieval (IR) models are constructed using common classifiers. The IR-inspired models are compared with the ensembles in C and Java programming languages. The use of ensemble classifiers shows promising results for detecting source code re-use.

References

  1. V. Anjali, T. Swapna, and B. Jayaraman. Plagiarism detection for Java programs without source codes. Procedia Computer Science, 46:749--758, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  2. C. Arwin and S. Tahaghoghi. Plagiarism detection across programming languages. Proceedings of the 29th Australian Computer Science Conference, Australian Computer Society, 48:277--286, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Baer and R. Zeidman. Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249--254, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Barrón-Cedeño, M. Lestari-Paramita, P. Clough, and P. Rosso. A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. Advances in Information Retrieval, Springer International Publishing, LNCS(8416), pages 424--429, 2014.Google ScholarGoogle Scholar
  6. D.-K. Chae, J. Ha, S.-W. Kim, B. Kang, and E. G. Im. Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1577--1580. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In IEEE 17th International Conference on Program Comprehension, pages 243--247, May 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Chuda, P. Navrat, B. Kovacova, and P. Humay. The issue of (software) plagiarism: A student view. IEEE Transactions on Education, 55(1):22--28, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cosma and M. Joy. Evaluating the performance of LSA for source-code plagiarism detection. Informatica, 36(4):409--424, 2013.Google ScholarGoogle Scholar
  10. B. Cui, J. Li, T. Guo, J. Wang, and D. Ma. Code comparison system based on abstract syntax tree. In 3rd IEEE International Conference on Broadband Network and Multimedia Technology, pages 668--673, Oct 2010.Google ScholarGoogle Scholar
  11. J. Cullum and R. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Society for Industrial and Applied Mathematics, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  14. E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, 21(13):1708--1725, dec 2015. http://www.jucs.org/jucs_21_13/cross_language_source_code.Google ScholarGoogle Scholar
  15. E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23(3):383--390, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. DeSoCoRe: Detecting Source Code Re-use across programming languages. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1--4. Association for Computational Linguistics, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. Towards the Detection of Cross-Language Source Code Reuse. Proceedings of 16th International Conference on Applications of Natural Language to Information Systems, Springer-Verlag, LNCS(6716), pages 250--253, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. PAN@FIRE: Overview of SOCO track on the detection of SOurce COde re-use. In Prasenjit et al. {23}.Google ScholarGoogle Scholar
  19. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, volume 7, pages 1606--1611, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. García-Hernández and Y. Lendeneva. Identification of similar source codes based on longest common substrings. In Prasenjit et al. {23}.Google ScholarGoogle Scholar
  21. M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2):129--133, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Mcnamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1-2):73--97, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Prasenjit, M. Mandar, P. Sukomal, A. Madhulika, and M. Parth, editors. FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5-7 December, 2014.Google ScholarGoogle Scholar
  24. L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016--1038, 2002.Google ScholarGoogle Scholar
  25. A. Ramírez-de-la Cruz, G. Ramírez-de-la Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar, and C. Rodríguez-Lucatero. UAM@SOCO 2014: Detection of source code reuse by means of combining different types of representations. In Prasenjit et al. {23}.Google ScholarGoogle Scholar
  26. M. Simard, G. Foster, and P. Isabelle. Using cognates to align sentences in bilingual corpora. In Proceedings of the Conference Centre for Advanced Studies on Collaborative research: Distributed Computing, IBM Press, volume 2, pages 1071--1082, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2):131--138, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval
    June 2016
    146 pages
    ISBN:9781450341417
    DOI:10.1145/2934732

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 June 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    CERI '16 Paper Acceptance Rate18of27submissions,67%Overall Acceptance Rate36of51submissions,71%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader