skip to main content
10.1145/3524610.3527911acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

C4: contrastive cross-language code clone detection

Published:20 October 2022Publication History

ABSTRACT

During software development, developers introduce code clones by reusing existing code to improve programming productivity. Considering the detrimental effects on software maintenance and evolution, many techniques are proposed to detect code clones. Existing approaches are mainly used to detect clones written in the same programming language. However, it is common to develop programs with the same functionality but in different programming languages to support various platforms. In this paper, we propose a new approach named C4, referring to <u>C</u>ontrastive <u>C</u>ross-language <u>C</u>ode <u>C</u>lone detection model. It can detect cross-language clones with learned representations effectively. C4 exploits the pre-trained model CodeBERT to convert programs in different languages into high-dimensional vector representations. In addition, we fine tune the C4 model through a constrastive learning objective that can effectively recognize clone pairs and non-clone pairs. To evaluate the effectiveness of our approach, we conduct extensive experiments on the dataset proposed by CLCDSA. Experimental results show that C4 achieves scores of 0.94, 0.90, and 0.92 in terms of precision, recall and F-measure and substantially outperforms the state-of-the-art baselines.

References

  1. B.S. Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of 2nd Working Conference on Reverse Engineering. 86--95. Google ScholarGoogle ScholarCross RefCross Ref
  2. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 (2007), 577--591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 [cs.LG]Google ScholarGoogle Scholar
  4. Xiao Cheng, Zhiming Peng, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2016. Mining revision histories to detect cross-language clones without intermediates. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 696--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Xiao CHENG, Zhiming PENG, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2017. CLCMiner: Detecting Cross-Language Clones without Intermediates. IEICE Transactions on Information and Systems E100.D (02 2017), 273--284. Google ScholarGoogle ScholarCross RefCross Ref
  6. Xiao Cheng, Zhiming Peng, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2017. CLCMiner: detecting cross-language clones without intermediates. IEICE TRANSACTIONS on Information and Systems 100, 2 (2017), 273--284.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs.CL]Google ScholarGoogle Scholar
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  9. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv:2002.08155 [cs.CL]Google ScholarGoogle Scholar
  10. Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton. 2019. Analyzing and Improving Representations with the Soft Nearest Neighbor Loss. arXiv:1902.01889 [stat.ML]Google ScholarGoogle Scholar
  11. Zhipeng Gao, Xin Xia, David Lo, John Grundy, and Thomas Zimmermann. 2021. Automating the removal of obsolete TODO comments. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 218--229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones. In 29th International Conference on Software Engineering (ICSE'07). 96--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Kamiya, S. Kusumoto, and K. Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2021. Supervised Contrastive Learning. arXiv:2004.11362 [cs.LG]Google ScholarGoogle Scholar
  16. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International static analysis symposium. Springer, 40--56.Google ScholarGoogle Scholar
  17. Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara G. Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2017), 249--260.Google ScholarGoogle ScholarCross RefCross Ref
  18. Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 473--485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. CoRR abs/2001.08210 (2020). arXiv:2001.08210 https://arxiv.org/abs/2001.08210Google ScholarGoogle Scholar
  20. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692Google ScholarGoogle Scholar
  21. Nikita Mehrotra, Navdha Agarwal, Piyush Gupta, Saket Anand, David Lo, and Rahul Purandare. 2020. Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks. arXiv:2011.11228 [cs.SE]Google ScholarGoogle Scholar
  22. Kawser Wazed Nafi, Tonny Shekha Kar, Banani Roy, Chanchal K Roy, and Kevin A Schneider. 2019. Clcdsa: cross language code clone detection using syntactical features and api documentation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1026--1037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu, Peng Li, Heng Ji, Minlie Huang, Maosong Sun, and Jie Zhou. 2020. ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning. CoRR abs/2012.15022 (2020). arXiv:2012.15022 https://arxiv.org/abs/2012.15022Google ScholarGoogle Scholar
  24. XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai, and XuanJing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (Sep 2020), 1872--1897. Google ScholarGoogle Scholar
  25. Gordana Rakić. 2015. Extendable and adaptable framework for input language independent static analysis. Ph.D. Dissertation. University of Novi Sad (Serbia).Google ScholarGoogle Scholar
  26. Chanchal K Roy, Minhaz F Zibran, and Rainer Koschke. 2014. The vision of software clone management: Past, present, and future (keynote paper). In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 18--33.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC. Proceedings of the 38th International Conference on Software Engineering (May 2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2015). Google ScholarGoogle ScholarCross RefCross Ref
  29. Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdfGoogle ScholarGoogle Scholar
  30. Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 476--480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jeffrey Svajlenko and Chanchal K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 [cs.LG]Google ScholarGoogle Scholar
  33. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google ScholarGoogle Scholar
  34. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  35. Tijana Vislavski, Gordana Rakić, Nicolás Cardozo, and Zoran Budimac. 2018. LICCA: A tool for cross-language clone detection. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 512--516. Google ScholarGoogle ScholarCross RefCross Ref
  36. Christian Wagner. 2014. Model-Driven Software Migration: A Methodology: Reengineering, Recovery and Modernization of Legacy Systems. Springer Science & Business Media.Google ScholarGoogle Scholar
  37. Andrew Walker, Tom Černý, and Eunjee Song. 2020. Open-source tools and benchmarks for code-clone detection: past, present, and future trends. ACM SIGAPP Applied Computing Review 19 (01 2020), 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian Wu, Wei Zeng, Ge Li, Wen Gao, and Haifeng Wang. 2021. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. CoRR abs/2112.12731 (2021). arXiv:2112.12731 https://arxiv.org/abs/2112.12731Google ScholarGoogle Scholar
  39. Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. arXiv:2002.08653 [cs.SE]Google ScholarGoogle Scholar
  40. Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. CoRR abs/2002.08653 (2020). arXiv:2002.08653 https://arxiv.org/abs/2002.08653Google ScholarGoogle Scholar
  41. Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In IJCAI.Google ScholarGoogle Scholar
  42. Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, and Songlin Hu. 2021. ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding. CoRR abs/2109.04380 (2021). arXiv:2109.04380 https://arxiv.org/abs/2109.04380Google ScholarGoogle Scholar
  43. Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. arXiv:1805.01978 [cs.CV]Google ScholarGoogle Scholar
  44. Dejiao Zhang, Feng Nan, Xiaokai Wei, Shang-Wen Li, Henghui Zhu, Kathleen R. McKeown, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021. Supporting Clustering with Contrastive Learning. CoRR abs/2103.12953 (2021). arXiv:2103.12953 https://arxiv.org/abs/2103.12953Google ScholarGoogle Scholar
  45. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 783--794. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension
    May 2022
    698 pages
    ISBN:9781450392983
    DOI:10.1145/3524610
    • Conference Chairs:
    • Ayushi Rastogi,
    • Rosalia Tufano,
    • General Chair:
    • Gabriele Bavota,
    • Program Chairs:
    • Venera Arnaoudova,
    • Sonia Haiduc

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 20 October 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Upcoming Conference

    ICSE 2025

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader