ABSTRACT
Source code re-use has been usually faced from a compiler perspective. Considering the source code as a piece of text, we are able to use natural language techniques for the detection of source code re-use. This paper describes the use of ensemble models in the task of source code re-use detection. Ensembles of Information Retrieval (IR) models are constructed using common classifiers. The IR-inspired models are compared with the ensembles in C and Java programming languages. The use of ensemble classifiers shows promising results for detecting source code re-use.
- V. Anjali, T. Swapna, and B. Jayaraman. Plagiarism detection for Java programs without source codes. Procedia Computer Science, 46:749--758, 2015.Google ScholarCross Ref
- C. Arwin and S. Tahaghoghi. Plagiarism detection across programming languages. Proceedings of the 29th Australian Computer Science Conference, Australian Computer Society, 48:277--286, 2006. Google ScholarDigital Library
- N. Baer and R. Zeidman. Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249--254, 2012.Google ScholarCross Ref
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011. Google ScholarDigital Library
- A. Barrón-Cedeño, M. Lestari-Paramita, P. Clough, and P. Rosso. A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. Advances in Information Retrieval, Springer International Publishing, LNCS(8416), pages 424--429, 2014.Google Scholar
- D.-K. Chae, J. Ha, S.-W. Kim, B. Kang, and E. G. Im. Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1577--1580. ACM, 2013. Google ScholarDigital Library
- M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In IEEE 17th International Conference on Program Comprehension, pages 243--247, May 2009.Google ScholarCross Ref
- D. Chuda, P. Navrat, B. Kovacova, and P. Humay. The issue of (software) plagiarism: A student view. IEEE Transactions on Education, 55(1):22--28, 2012. Google ScholarDigital Library
- G. Cosma and M. Joy. Evaluating the performance of LSA for source-code plagiarism detection. Informatica, 36(4):409--424, 2013.Google Scholar
- B. Cui, J. Li, T. Guo, J. Wang, and D. Ma. Code comparison system based on abstract syntax tree. In 3rd IEEE International Conference on Broadband Network and Multimedia Technology, pages 668--673, Oct 2010.Google Scholar
- J. Cullum and R. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Society for Industrial and Applied Mathematics, 2002. Google ScholarDigital Library
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.Google ScholarCross Ref
- E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, 21(13):1708--1725, dec 2015. http://www.jucs.org/jucs_21_13/cross_language_source_code.Google Scholar
- E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23(3):383--390, 2015. Google ScholarDigital Library
- E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. DeSoCoRe: Detecting Source Code Re-use across programming languages. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1--4. Association for Computational Linguistics, 2012. Google ScholarDigital Library
- E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. Towards the Detection of Cross-Language Source Code Reuse. Proceedings of 16th International Conference on Applications of Natural Language to Information Systems, Springer-Verlag, LNCS(6716), pages 250--253, 2011. Google ScholarDigital Library
- E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. PAN@FIRE: Overview of SOCO track on the detection of SOurce COde re-use. In Prasenjit et al. {23}.Google Scholar
- E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, volume 7, pages 1606--1611, 2007. Google ScholarDigital Library
- R. García-Hernández and Y. Lendeneva. Identification of similar source codes based on longest common substrings. In Prasenjit et al. {23}.Google Scholar
- M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2):129--133, May 1999. Google ScholarDigital Library
- P. Mcnamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1-2):73--97, 2004. Google ScholarDigital Library
- M. Prasenjit, M. Mandar, P. Sukomal, A. Madhulika, and M. Parth, editors. FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5-7 December, 2014.Google Scholar
- L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016--1038, 2002.Google Scholar
- A. Ramírez-de-la Cruz, G. Ramírez-de-la Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar, and C. Rodríguez-Lucatero. UAM@SOCO 2014: Detection of source code reuse by means of combining different types of representations. In Prasenjit et al. {23}.Google Scholar
- M. Simard, G. Foster, and P. Isabelle. Using cognates to align sentences in bilingual corpora. In Proceedings of the Conference Centre for Advanced Studies on Collaborative research: Distributed Computing, IBM Press, volume 2, pages 1071--1082, 1993. Google ScholarDigital Library
- G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2):131--138, 1990. Google ScholarDigital Library
Recommendations
On the Detection of SOurce COde Re-use
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval EvaluationThis paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for ...
Comparison of ensemble learning methods applied to network intrusion detection
ICC '17: Proceedings of the Second International Conference on Internet of things, Data and Cloud ComputingThis paper investigates the possibility of using ensemble learning methods to improve the performance of intrusion detection systems. We compare an ensemble of three ensemble learning methods, boosting, bagging and stacking in order to improve the ...
Optimizing clustering to promote data diversity when generating an ensemble classifier
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference CompanionIn this paper, we propose a method to generate an optimized ensemble classifier. In the proposed method, a diverse input space is created by clustering training data incrementally within a cycle. A cycle is one complete round that includes clustering, ...
Comments