skip to main content
10.1145/3551349.3560426acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Detecting Semantic Code Clones by Building AST-based Markov Chains Model

Authors Info & Claims
Published:05 January 2023Publication History

ABSTRACT

Code clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design Amain, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the original complex tree into simple Markov chains and measure the distance of all states in these chains. After obtaining all distance values, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of Amain, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that Amain is superior to nine state-of-the-art code clone detection tools (i.e., SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, FCCA, DeepSim, and SCDetector).

References

  1. 2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google ScholarGoogle Scholar
  2. 2022. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google ScholarGoogle Scholar
  3. 2022. An open source machine learning library that supports supervised and unsupervised learning. (scikit-learn). https://scikit-learn.org/stable/.Google ScholarGoogle Scholar
  4. 2022. A pure Python library for working with Java source code, provies a lexer and parser targeting Java 8. (Javalang). https://pypi.org/project/javalang/.Google ScholarGoogle Scholar
  5. 2022. pycparser is a complete parser of the C language. https://pypi.python.org/pypi/pycparser/.Google ScholarGoogle Scholar
  6. Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS’99). 292–303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577–591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe. 1992. Global optimization of a neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks 3, 2 (1992), 252–9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sergej Chodarev, Emilia Pietrikova, and Jan Kollar. 2015. Haskell clone detection using pattern comparing algorithm. In Proceedings of the 13th International Conference on Engineering of Modern Electric Systems (EMES’15). 1–4.Google ScholarGoogle ScholarCross RefCross Ref
  10. Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM’99). 109–118.Google ScholarGoogle ScholarCross RefCross Ref
  11. Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR’09). 219–228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yaroslav Golubev, Viktor Poletansky, Nikita Povarov, and Timofey Bryksin. 2021. Multi-threshold token-based code clone detection. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’21). 496–500.Google ScholarGoogle ScholarCross RefCross Ref
  13. Syed Mohd Fazalul Haque, V. Srikanth, and E. Sreenivasa Reddy. 2016. Generic code cloning method for detection of clone code in software development. In Proceedings of the 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE’16). 340–344.Google ScholarGoogle ScholarCross RefCross Ref
  14. Wei Hua, Yulei Sui, Yao Wan, Guangzhong Liu, and Guandong Xu. 2021. FCCA: Hybrid code representation for functional clone detection using attention networks. IEEE Transactions on Reliability 70, 1 (2021), 304–318.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yu-Liang Hung and Shingo Takada. 2020. CPPCD: A token-based approach to detecting potential clones. In Proceedings of the 14th IEEE International Workshop on Software Clones (IWSC’20). 26–32.Google ScholarGoogle ScholarCross RefCross Ref
  16. Shruti Jadon. 2016. Code clones detection using machine learning technique: support vector machine. In Proceedings of the 2016 IEEE International Conference on Computing, Communication and Automation (ICCCA’16). 299–303.Google ScholarGoogle ScholarCross RefCross Ref
  17. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07). 96–105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Young-Bin Jo, Jihyun Lee, and Cheol-Jung Yoo. 2021. Two-Pass technique for clone detection and type classification using tree-based convolution neural network. Applied Sciences 11, 14 (2021), 1–18.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM’94). 120–126.Google ScholarGoogle ScholarCross RefCross Ref
  20. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS’01). 40–56.Google ScholarGoogle ScholarCross RefCross Ref
  22. Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01). 301–309.Google ScholarGoogle ScholarCross RefCross Ref
  23. Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME’17). 249–260.Google ScholarGoogle ScholarCross RefCross Ref
  24. Hongliang Liang and Lu Ai. 2021. AST-path based compare-aggregate network for code clone detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN’21). 1–8.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM’96). 244–253.Google ScholarGoogle ScholarCross RefCross Ref
  26. Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 1287–1293.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. F. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC’99). 49–56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jayadeep Pati, Babloo Kumar, Devesh Manjhi, and Kaushal Kumar Shukla. 2017. A comparison among ARIMA, BP-NN, and MOGA-NN for software clone evolution prediction. IEEE ACCESS 5, 1 (2017), 11841–11851.Google ScholarGoogle ScholarCross RefCross Ref
  29. Chaiyong Ragkhitwetsagul and Jens Krinke. 2017. Using compilation/decompilation to enhance clone detection. In Proceedings of the 11th IEEE International Workshop on Software Clones (IWSC’17). 1–7.Google ScholarGoogle ScholarCross RefCross Ref
  30. Chaiyong Ragkhitwetsagul, Jens Krinke, and Bruno Marnette. 2018. A picture is worth a thousand words: Code clone detection based on image similarity. In Proceedings of the 12th IEEE International Workshop on Software Clones (IWSC’18). 44–50.Google ScholarGoogle ScholarCross RefCross Ref
  31. Chanchal Kumar Roy and James Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.Google ScholarGoogle Scholar
  32. Chanchal Kumar Roy and James Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC’08). 172–181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 354–365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). 1157–1168.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Sudhamani and Lalitha Rangarajan. 2016. Code clone detection based on order and content of control statements. In Proceedings of the 2nd IEEE International Conference on Contemporary Computing and Informatics (ICCCI’16). 59–64.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME’14). 476–480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jeffrey Svajlenko and Chanchal K. Roy. 2017. Fast and flexible large-scale clone detection with cloneWorks. In Proceedings of the IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE’17). 27–30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075(2015).Google ScholarGoogle Scholar
  39. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Data mining cluster analysis: Basic concepts and algorithms. Introduction to Data Mining 487, 1 (2005), 487–568.Google ScholarGoogle Scholar
  40. Masateru Tsunoda, Yasutaka Kamei, and Atsushi Sawada. 2016. Assessing the differences of clone detection methods used in the fault-prone module prediction. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’16). 15–16.Google ScholarGoogle ScholarCross RefCross Ref
  41. Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified PDGs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC’17). 100–109.Google ScholarGoogle ScholarCross RefCross Ref
  42. Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). 1066–1077.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI’17). 3034–3040.Google ScholarGoogle ScholarCross RefCross Ref
  44. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE’16). 87–98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Holger Wigström. 1974. A model of a neural network with recurrent inhibition. Kybernetik 16, 2 (1974), 103–12.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yueming Wu, Deqing Zou, Shihan Dou, Siru Yang, Wei Yang, Feng Cheng, Hong Liang, and Hai Jin. 2020. SCDetector: Software functional clone detection based on semantic tokens analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20). 1000–1012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dongjin Yu, Jie Wang, Qing Wu, Jiazha Yang, Jiaojiao Wang, Wei Yang, and Wei Yan. 2017. Detecting Java code clones with multi-granularities based on bytecode. In Proceedings of the 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). 317–326.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE’19). 783–794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE’18). 141–151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yue Zou, Bihuan Ban, Yinxing Xue, and Yun Xu. 2020. CCGraph: A PDG-based code clone detector with approximate graph matching. In Proceedings of the 35th International Conference on Automated Software Engineering (ASE’20). 931–942.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting Semantic Code Clones by Building AST-based Markov Chains Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
      October 2022
      2006 pages
      ISBN:9781450394758
      DOI:10.1145/3551349

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 January 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate82of337submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format