ABSTRACT
Code clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design Amain, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the original complex tree into simple Markov chains and measure the distance of all states in these chains. After obtaining all distance values, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of Amain, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that Amain is superior to nine state-of-the-art code clone detection tools (i.e., SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, FCCA, DeepSim, and SCDetector).
- 2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google Scholar
- 2022. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google Scholar
- 2022. An open source machine learning library that supports supervised and unsupervised learning. (scikit-learn). https://scikit-learn.org/stable/.Google Scholar
- 2022. A pure Python library for working with Java source code, provies a lexer and parser targeting Java 8. (Javalang). https://pypi.org/project/javalang/.Google Scholar
- 2022. pycparser is a complete parser of the C language. https://pypi.python.org/pypi/pycparser/.Google Scholar
- Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS’99). 292–303.Google ScholarDigital Library
- Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577–591.Google ScholarDigital Library
- Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe. 1992. Global optimization of a neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks 3, 2 (1992), 252–9.Google ScholarDigital Library
- Sergej Chodarev, Emilia Pietrikova, and Jan Kollar. 2015. Haskell clone detection using pattern comparing algorithm. In Proceedings of the 13th International Conference on Engineering of Modern Electric Systems (EMES’15). 1–4.Google ScholarCross Ref
- Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM’99). 109–118.Google ScholarCross Ref
- Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR’09). 219–228.Google ScholarDigital Library
- Yaroslav Golubev, Viktor Poletansky, Nikita Povarov, and Timofey Bryksin. 2021. Multi-threshold token-based code clone detection. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). 496–500.Google ScholarCross Ref
- Syed Mohd Fazalul Haque, V. Srikanth, and E. Sreenivasa Reddy. 2016. Generic code cloning method for detection of clone code in software development. In Proceedings of the 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE’16). 340–344.Google ScholarCross Ref
- Wei Hua, Yulei Sui, Yao Wan, Guangzhong Liu, and Guandong Xu. 2021. FCCA: Hybrid code representation for functional clone detection using attention networks. IEEE Transactions on Reliability 70, 1 (2021), 304–318.Google ScholarCross Ref
- Yu-Liang Hung and Shingo Takada. 2020. CPPCD: A token-based approach to detecting potential clones. In Proceedings of the 14th IEEE International Workshop on Software Clones (IWSC’20). 26–32.Google ScholarCross Ref
- Shruti Jadon. 2016. Code clones detection using machine learning technique: support vector machine. In Proceedings of the 2016 IEEE International Conference on Computing, Communication and Automation (ICCCA’16). 299–303.Google ScholarCross Ref
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07). 96–105.Google ScholarDigital Library
- Young-Bin Jo, Jihyun Lee, and Cheol-Jung Yoo. 2021. Two-Pass technique for clone detection and type classification using tree-based convolution neural network. Applied Sciences 11, 14 (2021), 1–18.Google ScholarCross Ref
- J. Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM’94). 120–126.Google ScholarCross Ref
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.Google ScholarDigital Library
- Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS’01). 40–56.Google ScholarCross Ref
- Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01). 301–309.Google ScholarCross Ref
- Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME’17). 249–260.Google ScholarCross Ref
- Hongliang Liang and Lu Ai. 2021. AST-path based compare-aggregate network for code clone detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN’21). 1–8.Google ScholarCross Ref
- Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM’96). 244–253.Google ScholarCross Ref
- Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 1287–1293.Google ScholarCross Ref
- J. F. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC’99). 49–56.Google ScholarDigital Library
- Jayadeep Pati, Babloo Kumar, Devesh Manjhi, and Kaushal Kumar Shukla. 2017. A comparison among ARIMA, BP-NN, and MOGA-NN for software clone evolution prediction. IEEE ACCESS 5, 1 (2017), 11841–11851.Google ScholarCross Ref
- Chaiyong Ragkhitwetsagul and Jens Krinke. 2017. Using compilation/decompilation to enhance clone detection. In Proceedings of the 11th IEEE International Workshop on Software Clones (IWSC’17). 1–7.Google ScholarCross Ref
- Chaiyong Ragkhitwetsagul, Jens Krinke, and Bruno Marnette. 2018. A picture is worth a thousand words: Code clone detection based on image similarity. In Proceedings of the 12th IEEE International Workshop on Software Clones (IWSC’18). 44–50.Google ScholarCross Ref
- Chanchal Kumar Roy and James Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.Google Scholar
- Chanchal Kumar Roy and James Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC’08). 172–181.Google ScholarDigital Library
- Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 354–365.Google ScholarDigital Library
- Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). 1157–1168.Google ScholarDigital Library
- M. Sudhamani and Lalitha Rangarajan. 2016. Code clone detection based on order and content of control statements. In Proceedings of the 2nd IEEE International Conference on Contemporary Computing and Informatics (ICCCI’16). 59–64.Google ScholarCross Ref
- Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME’14). 476–480.Google ScholarDigital Library
- Jeffrey Svajlenko and Chanchal K. Roy. 2017. Fast and flexible large-scale clone detection with cloneWorks. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE’17). 27–30.Google ScholarDigital Library
- Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075(2015).Google Scholar
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Data mining cluster analysis: Basic concepts and algorithms. Introduction to Data Mining 487, 1 (2005), 487–568.Google Scholar
- Masateru Tsunoda, Yasutaka Kamei, and Atsushi Sawada. 2016. Assessing the differences of clone detection methods used in the fault-prone module prediction. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’16). 15–16.Google ScholarCross Ref
- Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified PDGs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC’17). 100–109.Google ScholarCross Ref
- Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). 1066–1077.Google ScholarDigital Library
- Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI’17). 3034–3040.Google ScholarCross Ref
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE’16). 87–98.Google ScholarDigital Library
- Holger Wigström. 1974. A model of a neural network with recurrent inhibition. Kybernetik 16, 2 (1974), 103–12.Google ScholarCross Ref
- Yueming Wu, Deqing Zou, Shihan Dou, Siru Yang, Wei Yang, Feng Cheng, Hong Liang, and Hai Jin. 2020. SCDetector: Software functional clone detection based on semantic tokens analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20). 1000–1012.Google ScholarDigital Library
- Dongjin Yu, Jie Wang, Qing Wu, Jiazha Yang, Jiaojiao Wang, Wei Yang, and Wei Yan. 2017. Detecting Java code clones with multi-granularities based on bytecode. In Proceedings of the 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). 317–326.Google ScholarCross Ref
- Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE’19). 783–794.Google ScholarDigital Library
- Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 141–151.Google ScholarDigital Library
- Yue Zou, Bihuan Ban, Yinxing Xue, and Yun Xu. 2020. CCGraph: A PDG-based code clone detector with approximate graph matching. In Proceedings of the 35th International Conference on Automated Software Engineering (ASE’20). 931–942.Google ScholarDigital Library
Index Terms
- Detecting Semantic Code Clones by Building AST-based Markov Chains Model
Recommendations
Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and AnalysisCode clone detection aims to find similar code fragments and gains increasing importance in the field of software engineering. There are several types of techniques for detecting code clones. Text-based or token-based code clone detectors are scalable ...
Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software EngineeringCode clone detection refers to finding the functional similarities between two code fragments, which is becoming increasingly important with the evolution of software engineering. It is reasonable because code cloning can increase maintenance costs ...
Detecting code clones with gaps by function applications
PEPM 2017: Proceedings of the 2017 ACM SIGPLAN Workshop on Partial Evaluation and Program ManipulationCode clones are pairs or groups of code segments which are identical or similar to each other. Generally the existence of code clones is considered to make it cumbersome to maintain the source code, so that various kinds of code clone detection tools ...
Comments