ABSTRACT
Code clone detection is to find out code fragments with similar functionalities, which has been more and more important in software engineering. Many approaches have been proposed to detect code clones, in which token-based methods are the most scalable but cannot handle semantic clones because of the lack of consideration of program semantics. To address the issue, researchers conduct program analysis to distill the program semantics into a graph representation and detect clones by matching the graphs. However, such approaches suffer from low scalability since graph matching is typically time-consuming.
In this paper, we propose SCDetector to combine the scalability of token-based methods with the accuracy of graph-based methods for software functional clone detection. Given a function source code, we first extract the control flow graph by static analysis. Instead of using traditional heavyweight graph matching, we treat the graph as a social network and apply social-network-centrality analysis to dig out the centrality of each basic block. Then we assign the centrality to each token in a basic block and sum the centrality of the same token in different basic blocks. By this, a graph is turned into certain tokens with graph details (i.e., centrality), called semantic tokens. Finally, these semantic tokens are fed into a Siamese architecture neural network to train a code clone detector. We evaluate SCDetector on two large datasets of functionally similar code. Experimental results indicate that our system is superior to four state-of-the-art methods (i.e., SourcererCC, Deckard, RtvNN, and ASTNN) and the time cost of SCDetector is 14 times less than a traditional graph-based method (i.e., CCSharp) on detecting semantic clones.
- 2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google Scholar
- 2020. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google Scholar
- 2020. A Java optimization framework (Soot). https://github.com/Sable/soot.Google Scholar
- 2020. Platform for C/C++ Code Analysis (Joern). https://joern.io.Google Scholar
- 2020. Software for complex networks (Networkx). http://networkx.github.io.Google Scholar
- 2020. Tensors and Dynamic neural networks in Python with strong GPU acceleration (PyTorch). https://pytorch.org.Google Scholar
- 2020. T.J. Watson Libraries for Analysis (WALA). http://wala.sourceforge.net/wiki/index.php/Main_Page.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS'99). 292--303.Google ScholarDigital Library
- Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402--418.Google ScholarDigital Library
- Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577--591.Google ScholarDigital Library
- Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 175--186.Google ScholarDigital Library
- Nigel Coles. 2001. It's not what you know---It's who you know that counts. Analysing serious crime groups as social networks. British Journal of Criminology 41, 4 (2001), 580--594.Google ScholarCross Ref
- Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM'99). 109--118.Google ScholarCross Ref
- Rochelle Elva and GT. Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida.Google Scholar
- Katherine Faust. 1997. Centrality in affiliation networks. Social Networks 19, 2 (1997), 157--191.Google ScholarCross Ref
- LC. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 1 (1977), 35--41.Google ScholarCross Ref
- LC. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215--239.Google ScholarCross Ref
- DM. German, Massimiliano Di Penta, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2009. Code siblings: Technical and legal implications of copying code between applications. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 81--90.Google ScholarDigital Library
- Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR'09). 219--228.Google ScholarDigital Library
- Roger Guimera, Stefano Mossa, Adrian Turtschi, and LA Nunes Amaral. 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. the National Academy of Sciences 102, 22 (2005), 7794--7799.Google Scholar
- Tomoya Ishihara, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2012. Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects. In Proceedings of the 19th Working Conference on Reverse Engineering (WCRE'12). 387--391.Google ScholarDigital Library
- Hawoong Jeong, SP. Mason, AL. Barabási, and ZN. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (2001), 41--42.Google Scholar
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE'07). 96--105.Google ScholarDigital Library
- Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA'09). 81--92.Google ScholarDigital Library
- J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM'94). 120--126.Google ScholarCross Ref
- Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceedings of the 21st International Conference on Program Comprehension (ICPC'13). 227--229.Google ScholarCross Ref
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.Google ScholarDigital Library
- Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39--43.Google ScholarCross Ref
- Iman Keivanloo, Juergen Rilling, and Philippe Charland. 2011. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of the 18th Working Conference on Reverse Engineering (WCRE'11). 23--27.Google ScholarDigital Library
- Iman Keivanloo, CK. Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th International Conference on Program Comprehension (ICPC'12). 247--249.Google ScholarCross Ref
- Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS'01). 40--56.Google ScholarCross Ref
- Rainer Koschke. 2012. Large-scale inter-system clone detection using suffix trees. In Proceedings of the 16th European Conference on Software Maintenance and Reengineering (ECSME'12). 309--318.Google ScholarDigital Library
- Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE'01). 301--309.Google ScholarCross Ref
- Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME'17). 249--260.Google ScholarCross Ref
- Xiaoming Liu, Johan Bollen, ML. Nelson, and Herbert Van de Sompel. 2005. Co-authorship networks in the digital library research community. Information Processing & Management 41, 6 (2005), 1462--1480.Google ScholarCross Ref
- Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM'96). 244--253.Google ScholarCross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- JF. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC'99). 49--56.Google ScholarDigital Library
- CK. Roy and JR. Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC'08). 172--181.Google Scholar
- Chanchal Kumar Roy and JR. Cordy. 2007. A survey on software clone detection research. Queen's School of Computing TR 541, 115 (2007), 64--68.Google Scholar
- Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 354--365.Google ScholarDigital Library
- Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, CK. Roy, and CV. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). 1157--1168.Google Scholar
- Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th International Conference on Machine Learning and Applications (ICMLA'16). 1024--1028.Google ScholarCross Ref
- Jeffrey Svajlenko, JF. Islam, Iman Keivanloo, CK. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME'14). 476--480.Google ScholarDigital Library
- Kai Sheng Tai, Richard Socher, and CD. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).Google Scholar
- Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (CMNLP'15). 1422--1432.Google ScholarCross Ref
- Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR'18). 542--553.Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AN. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceddings of the 2017 Conference on Neural Information Processing Systems (NIPS'17). 5998--6008.Google Scholar
- Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified pdgs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC'17). 100--109.Google ScholarCross Ref
- Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and CK. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE'18). 1066--1077.Google Scholar
- Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI'17). 3034--3040.Google ScholarCross Ref
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE'16). 87--98.Google ScholarDigital Library
- Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE'19). 139--150.Google ScholarDigital Library
- Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. 2015. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 37th International Conference on Software Engineering (ICSE'15). 303--313.Google ScholarDigital Library
- Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615 (2014).Google Scholar
- Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE'19). 783--794.Google ScholarDigital Library
- Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 141--151.Google ScholarDigital Library
Index Terms
- SCDetector: software functional clone detection based on semantic tokens analysis
Recommendations
Brain-inspired GCN: Modularity-based Siamese simple graph convolutional networks
AbstractIn graph representation learning, Graph Convolutional Networks (GCNs) and their variants have received much attention. However, GCNs encounter oversmoothing as the models get deeper, limiting their ability to aggregate node representations within ...
Highlights- Low-pass filtered features can alleviate oversmoothing.
- Nodes in the graphs have similar characteristics as brain modules.
- The nonlinearity is not necessary in graph convolutional networks.
- Preservation of modular structure ...
Executive Network Centrality and Corporate Reporting
This paper investigates the association of corporate reporting and executive network centrality, which measures an executive’s relative position in a massive network consisting of outside corporate leaders. I find that high-centrality chief executive ...
ACSiam: Asymmetric convolution structures for visual tracking with Siamese network
AbstractObject trackers based on Siamese network usually transform the tracking task into a matching problem between the candidate samples and the target template. However, with the increasing depth and width of backbone networks, researches ...
Comments