research-article

C4: contrastive cross-language code clone detection

Authors:
Chenning Tao

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Qi Zhan

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Xing Hu

Zhejiang University, Ningbo, China

Zhejiang University, Ningbo, China
View Profile

,
Xin Xia

Huawei, China

Huawei, China
View Profile

ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program ComprehensionMay 2022Pages 413–424https://doi.org/10.1145/3524610.3527911

Published:20 October 2022Publication History

ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

Pages 413–424

ABSTRACT

During software development, developers introduce code clones by reusing existing code to improve programming productivity. Considering the detrimental effects on software maintenance and evolution, many techniques are proposed to detect code clones. Existing approaches are mainly used to detect clones written in the same programming language. However, it is common to develop programs with the same functionality but in different programming languages to support various platforms. In this paper, we propose a new approach named C4, referring to Contrastive Cross-language Code Clone detection model. It can detect cross-language clones with learned representations effectively. C4 exploits the pre-trained model CodeBERT to convert programs in different languages into high-dimensional vector representations. In addition, we fine tune the C4 model through a constrastive learning objective that can effectively recognize clone pairs and non-clone pairs. To evaluate the effectiveness of our approach, we conduct extensive experiments on the dataset proposed by CLCDSA. Experimental results show that C4 achieves scores of 0.94, 0.90, and 0.92 in terms of precision, recall and F-measure and substantially outperforms the state-of-the-art baselines.

References

B.S. Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of 2nd Working Conference on Reverse Engineering. 86--95. Google ScholarCross Ref
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 (2007), 577--591.Google ScholarDigital Library
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 [cs.LG]Google Scholar
Xiao Cheng, Zhiming Peng, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2016. Mining revision histories to detect cross-language clones without intermediates. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 696--701.Google ScholarDigital Library
Xiao CHENG, Zhiming PENG, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2017. CLCMiner: Detecting Cross-Language Clones without Intermediates. IEICE Transactions on Information and Systems E100.D (02 2017), 273--284. Google ScholarCross Ref
Xiao Cheng, Zhiming Peng, Lingxiao Jiang, Hao Zhong, Haibo Yu, and Jianjun Zhao. 2017. CLCMiner: detecting cross-language clones without intermediates. IEICE TRANSACTIONS on Information and Systems 100, 2 (2017), 273--284.Google ScholarCross Ref
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs.CL]Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv:2002.08155 [cs.CL]Google Scholar
Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton. 2019. Analyzing and Improving Representations with the Soft Nearest Neighbor Loss. arXiv:1902.01889 [stat.ML]Google Scholar
Zhipeng Gao, Xin Xia, David Lo, John Grundy, and Thomas Zimmermann. 2021. Automating the removal of obsolete TODO comments. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 218--229.Google ScholarDigital Library
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarDigital Library
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones. In 29th International Conference on Software Engineering (ICSE'07). 96--105. Google ScholarDigital Library
T. Kamiya, S. Kusumoto, and K. Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670. Google ScholarDigital Library
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2021. Supervised Contrastive Learning. arXiv:2004.11362 [cs.LG]Google Scholar
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International static analysis symposium. Springer, 40--56.Google Scholar
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara G. Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2017), 249--260.Google ScholarCross Ref
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 473--485.Google ScholarDigital Library
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. CoRR abs/2001.08210 (2020). arXiv:2001.08210 https://arxiv.org/abs/2001.08210Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692Google Scholar
Nikita Mehrotra, Navdha Agarwal, Piyush Gupta, Saket Anand, David Lo, and Rahul Purandare. 2020. Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks. arXiv:2011.11228 [cs.SE]Google Scholar
Kawser Wazed Nafi, Tonny Shekha Kar, Banani Roy, Chanchal K Roy, and Kevin A Schneider. 2019. Clcdsa: cross language code clone detection using syntactical features and api documentation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1026--1037.Google ScholarDigital Library
Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu, Peng Li, Heng Ji, Minlie Huang, Maosong Sun, and Jie Zhou. 2020. ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning. CoRR abs/2012.15022 (2020). arXiv:2012.15022 https://arxiv.org/abs/2012.15022Google Scholar
XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai, and XuanJing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (Sep 2020), 1872--1897. Google Scholar
Gordana Rakić. 2015. Extendable and adaptable framework for input language independent static analysis. Ph.D. Dissertation. University of Novi Sad (Serbia).Google Scholar
Chanchal K Roy, Minhaz F Zibran, and Rainer Koschke. 2014. The vision of software clone management: Past, present, and future (keynote paper). In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 18--33.Google ScholarCross Ref
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC. Proceedings of the 38th International Conference on Software Engineering (May 2016). Google ScholarDigital Library
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2015). Google ScholarCross Ref
Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdfGoogle Scholar
Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 476--480.Google ScholarDigital Library
Jeffrey Svajlenko and Chanchal K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 131--140. Google ScholarDigital Library
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 [cs.LG]Google Scholar
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Tijana Vislavski, Gordana Rakić, Nicolás Cardozo, and Zoran Budimac. 2018. LICCA: A tool for cross-language clone detection. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 512--516. Google ScholarCross Ref
Christian Wagner. 2014. Model-Driven Software Migration: A Methodology: Reengineering, Recovery and Modernization of Legacy Systems. Springer Science & Business Media.Google Scholar
Andrew Walker, Tom Černý, and Eunjee Song. 2020. Open-source tools and benchmarks for code-clone detection: past, present, and future trends. ACM SIGAPP Applied Computing Review 19 (01 2020), 28--39. Google ScholarDigital Library
Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian Wu, Wei Zeng, Ge Li, Wen Gao, and Haifeng Wang. 2021. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. CoRR abs/2112.12731 (2021). arXiv:2112.12731 https://arxiv.org/abs/2112.12731Google Scholar
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. arXiv:2002.08653 [cs.SE]Google Scholar
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. CoRR abs/2002.08653 (2020). arXiv:2002.08653 https://arxiv.org/abs/2002.08653Google Scholar
Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In IJCAI.Google Scholar
Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, and Songlin Hu. 2021. ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding. CoRR abs/2109.04380 (2021). arXiv:2109.04380 https://arxiv.org/abs/2109.04380Google Scholar
Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. arXiv:1805.01978 [cs.CV]Google Scholar
Dejiao Zhang, Feng Nan, Xiaokai Wei, Shang-Wen Li, Henghui Zhu, Kathleen R. McKeown, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021. Supporting Clustering with Contrastive Learning. CoRR abs/2103.12953 (2021). arXiv:2103.12953 https://arxiv.org/abs/2103.12953Google Scholar
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 783--794. Google ScholarDigital Library

Recommendations

Deep learning code fragments for code clone detection
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend ...
Read More
XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the ...
Read More
CCCS: Contrastive Cross-Language Code Search Using Code Graph Information
Developers often search and reuse existing code snippets to improve software development efficiency during software development. Currently, researchers have proposed many code search methods. However, the search intent of existing methods is basically a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension
May 2022
698 pages
ISBN:9781450392983
DOI:10.1145/3524610
Conference Chairs:
Ayushi Rastogi
University of Groningen, The Netherlands
,
Rosalia Tufano
USI Università della Svizzera italiana, Switzerland
,
General Chair:
Gabriele Bavota
USI Università della Svizzera italiana, Switzerland
,
Program Chairs:
Venera Arnaoudova
Washington State University, United States of America
,
Sonia Haiduc
Florida State University, United States of America
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code clone detection
contrastive learning
cross-language
neural networks
Qualifiers
- research-article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 222
  Total Downloads
- Downloads (Last 12 months)164
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

C4: contrastive cross-language code clone detection

ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

ABSTRACT

References

Cited By

Recommendations

Deep learning code fragments for code clone detection

XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training

CCCS: Contrastive Cross-Language Code Search Using Code Graph Information