research-article

Detecting Semantic Code Clones by Building AST-based Markov Chains Model

Authors:
Yueming Wu

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Siyue Feng

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Deqing Zou

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Hai Jin

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software EngineeringOctober 2022Article No.: 34Pages 1–13https://doi.org/10.1145/3551349.3560426

Published:05 January 2023Publication History

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Pages 1–13

ABSTRACT

Code clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design Amain, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the original complex tree into simple Markov chains and measure the distance of all states in these chains. After obtaining all distance values, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of Amain, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that Amain is superior to nine state-of-the-art code clone detection tools (i.e., SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, FCCA, DeepSim, and SCDetector).

References

2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google Scholar
2022. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google Scholar
2022. An open source machine learning library that supports supervised and unsupervised learning. (scikit-learn). https://scikit-learn.org/stable/.Google Scholar
2022. A pure Python library for working with Java source code, provies a lexer and parser targeting Java 8. (Javalang). https://pypi.org/project/javalang/.Google Scholar
2022. pycparser is a complete parser of the C language. https://pypi.python.org/pypi/pycparser/.Google Scholar
Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS’99). 292–303.Google ScholarDigital Library
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577–591.Google ScholarDigital Library
Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe. 1992. Global optimization of a neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks 3, 2 (1992), 252–9.Google ScholarDigital Library
Sergej Chodarev, Emilia Pietrikova, and Jan Kollar. 2015. Haskell clone detection using pattern comparing algorithm. In Proceedings of the 13th International Conference on Engineering of Modern Electric Systems (EMES’15). 1–4.Google ScholarCross Ref
Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM’99). 109–118.Google ScholarCross Ref
Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR’09). 219–228.Google ScholarDigital Library
Yaroslav Golubev, Viktor Poletansky, Nikita Povarov, and Timofey Bryksin. 2021. Multi-threshold token-based code clone detection. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). 496–500.Google ScholarCross Ref
Syed Mohd Fazalul Haque, V. Srikanth, and E. Sreenivasa Reddy. 2016. Generic code cloning method for detection of clone code in software development. In Proceedings of the 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE’16). 340–344.Google ScholarCross Ref
Wei Hua, Yulei Sui, Yao Wan, Guangzhong Liu, and Guandong Xu. 2021. FCCA: Hybrid code representation for functional clone detection using attention networks. IEEE Transactions on Reliability 70, 1 (2021), 304–318.Google ScholarCross Ref
Yu-Liang Hung and Shingo Takada. 2020. CPPCD: A token-based approach to detecting potential clones. In Proceedings of the 14th IEEE International Workshop on Software Clones (IWSC’20). 26–32.Google ScholarCross Ref
Shruti Jadon. 2016. Code clones detection using machine learning technique: support vector machine. In Proceedings of the 2016 IEEE International Conference on Computing, Communication and Automation (ICCCA’16). 299–303.Google ScholarCross Ref
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07). 96–105.Google ScholarDigital Library
Young-Bin Jo, Jihyun Lee, and Cheol-Jung Yoo. 2021. Two-Pass technique for clone detection and type classification using tree-based convolution neural network. Applied Sciences 11, 14 (2021), 1–18.Google ScholarCross Ref
J. Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM’94). 120–126.Google ScholarCross Ref
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.Google ScholarDigital Library
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS’01). 40–56.Google ScholarCross Ref
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01). 301–309.Google ScholarCross Ref
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME’17). 249–260.Google ScholarCross Ref
Hongliang Liang and Lu Ai. 2021. AST-path based compare-aggregate network for code clone detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN’21). 1–8.Google ScholarCross Ref
Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM’96). 244–253.Google ScholarCross Ref
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 1287–1293.Google ScholarCross Ref
J. F. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC’99). 49–56.Google ScholarDigital Library
Jayadeep Pati, Babloo Kumar, Devesh Manjhi, and Kaushal Kumar Shukla. 2017. A comparison among ARIMA, BP-NN, and MOGA-NN for software clone evolution prediction. IEEE ACCESS 5, 1 (2017), 11841–11851.Google ScholarCross Ref
Chaiyong Ragkhitwetsagul and Jens Krinke. 2017. Using compilation/decompilation to enhance clone detection. In Proceedings of the 11th IEEE International Workshop on Software Clones (IWSC’17). 1–7.Google ScholarCross Ref
Chaiyong Ragkhitwetsagul, Jens Krinke, and Bruno Marnette. 2018. A picture is worth a thousand words: Code clone detection based on image similarity. In Proceedings of the 12th IEEE International Workshop on Software Clones (IWSC’18). 44–50.Google ScholarCross Ref
Chanchal Kumar Roy and James Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.Google Scholar
Chanchal Kumar Roy and James Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC’08). 172–181.Google ScholarDigital Library
Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 354–365.Google ScholarDigital Library
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). 1157–1168.Google ScholarDigital Library
M. Sudhamani and Lalitha Rangarajan. 2016. Code clone detection based on order and content of control statements. In Proceedings of the 2nd IEEE International Conference on Contemporary Computing and Informatics (ICCCI’16). 59–64.Google ScholarCross Ref
Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME’14). 476–480.Google ScholarDigital Library
Jeffrey Svajlenko and Chanchal K. Roy. 2017. Fast and flexible large-scale clone detection with cloneWorks. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE’17). 27–30.Google ScholarDigital Library
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075(2015).Google Scholar
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Data mining cluster analysis: Basic concepts and algorithms. Introduction to Data Mining 487, 1 (2005), 487–568.Google Scholar
Masateru Tsunoda, Yasutaka Kamei, and Atsushi Sawada. 2016. Assessing the differences of clone detection methods used in the fault-prone module prediction. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’16). 15–16.Google ScholarCross Ref
Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified PDGs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC’17). 100–109.Google ScholarCross Ref
Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). 1066–1077.Google ScholarDigital Library
Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI’17). 3034–3040.Google ScholarCross Ref
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE’16). 87–98.Google ScholarDigital Library
Holger Wigström. 1974. A model of a neural network with recurrent inhibition. Kybernetik 16, 2 (1974), 103–12.Google ScholarCross Ref
Yueming Wu, Deqing Zou, Shihan Dou, Siru Yang, Wei Yang, Feng Cheng, Hong Liang, and Hai Jin. 2020. SCDetector: Software functional clone detection based on semantic tokens analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20). 1000–1012.Google ScholarDigital Library
Dongjin Yu, Jie Wang, Qing Wu, Jiazha Yang, Jiaojiao Wang, Wei Yang, and Wei Yan. 2017. Detecting Java code clones with multi-granularities based on bytecode. In Proceedings of the 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). 317–326.Google ScholarCross Ref
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE’19). 783–794.Google ScholarDigital Library
Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 141–151.Google ScholarDigital Library
Yue Zou, Bihuan Ban, Yinxing Xue, and Yun Xu. 2020. CCGraph: A PDG-based code clone detector with approximate graph matching. In Proceedings of the 35th International Conference on Automated Software Engineering (ASE’20). 931–942.Google ScholarDigital Library

Index Terms

Detecting Semantic Code Clones by Building AST-based Markov Chains Model
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Code clone detection aims to find similar code fragments and gains increasing importance in the field of software engineering. There are several types of techniques for detecting code clones. Text-based or token-based code clone detectors are scalable ...
Read More
Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Code clone detection refers to finding the functional similarities between two code fragments, which is becoming increasingly important with the evolution of software engineering. It is reasonable because code cloning can increase maintenance costs ...
Read More
Detecting code clones with gaps by function applications
PEPM 2017: Proceedings of the 2017 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation

Code clones are pairs or groups of code segments which are identical or similar to each other. Generally the existence of code clones is considered to make it cumbersome to maintain the source code, so that various kinds of code clone detection tools ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
October 2022
2006 pages
ISBN:9781450394758
DOI:10.1145/3551349

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Abstract Syntax Tree
Markov Chain
Semantic Code Clones
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate82of337submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 307
  Total Downloads
- Downloads (Last 12 months)204
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Detecting Semantic Code Clones by Building AST-based Markov Chains Model

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree

Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model

Detecting code clones with gaps by function applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Detecting Semantic Code Clones by Building AST-based Markov Chains Model

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree

Tritor: Detecting Semantic Code Clones by Building Social Network-Based Triads Model

Detecting code clones with gaps by function applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media