Ranking code clones to support maintenance activities

Ehsan, Osama; Khomh, Foutse; Zou, Ying; Qiu, Dong

doi:10.1007/s10664-023-10292-0

Ranking code clones to support maintenance activities

Published: 25 April 2023

Volume 28, article number 70, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Osama Ehsan ORCID: orcid.org/0000-0001-5478-059X¹,
Foutse Khomh¹,
Ying Zou² &
…
Dong Qiu³

260 Accesses
1 Citation
Explore all metrics

Abstract

Developers often reuse code fragments by copy-and-paste activities to speed up code delivery. Through this copy-and-paste process, they create duplicated code, also known as code clones. As the software system evolves, the number of clones can increase substantially and impact code quality negatively. Prior studies have shown that inconsistent changes on code clones can introduce bugs in a software system and clones that have experienced some specific evolutionary patterns being more at risk than others. As the number of clone copies increases in a software system. it becomes tedious and time-consuming for developers to track and maintain all code clones. Recent studies have proposed approaches to analyze the clone evolution history for better clone maintenance. However, these approaches do not provide a specified list of code clones at a granular level (i.e., commits) that can help developers prioritize their clone maintenance activities. It is important to track the code clone changes at the commit level, as developers can fix/refactor code clones early. In this paper, we leverage machine learning to develop clone ranking models that can help developers identify the most risky clones early on. Specifically, we detect clones from 52 projects (34 Java and 18 C) that have 534,672 commits and build 469,239 clone genealogies. We extract 28 features capturing the characteristics of code clones at commit level. We then train learning-to-rank (LtR), classification, and regression machine learning models to rank the code clones based on fault occurrence during their evolutionary history. Our comparison of machine learning approaches indicates that classification (for the probability of being faulty) and regression (for the proportion of faulty changes) perform well in ranking code clones. Multiple unique developers who change a code clone and the age of a code clone (in terms of the number of cloned code changes) have a significant effect on the risk of faults in the code clones. Our results can help developers identify the most risky code clones first and prioritize them for refactoring to prevent future faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

Article Open access 11 September 2019

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Article 08 April 2024

Notes

References

Asaduzzaman M, Roy CK, Schneider KA (2011) Viscad: flexible code clone analysis support for nicad. In: Proceedings of the 5th international workshop on software clones, pp 77–78
Avelino G, Constantinou E, Valente MT, Serebrenik A (2019) On the abandonment and survival of open source projects: an empirical investigation. In: 2019 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM). IEEE, pp 1–12
Barbour L, An L, Khomh F, Zou Y, Wang S (2018) An investigation of the fault-proneness of clone evolutionary patterns. Softw Qual J 26 (4):1187–1222
Article Google Scholar
Barbour L, Khomh F, Zou Y (2011) Late propagation in software clones. In: 2011 27Th IEEE international conference on software maintenance (ICSM). IEEE, pp 273–282
Bates D, Maechler M, Bolker B, Walker S, Christensen RHB, Singmann H, Dai B, Scheipl F, Grothendieck G (2011) Package ‘lme4’
Berg K, Svensson O (2018) Szz unleashed: bug prediction on the jenkins core repository (open source implementations of bug prediction tools on commit level). LU-CS-EX:2018–04
CS C (2020) whatthepatch - python’s third party patch parsing library. Online (Accessed 17th August 2020)
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Cordy JR, Roy CK (2011) The nicad clone detector. In: 2011 IEEE 19Th international conference on program comprehension, pp 219–220. IEEE
Dhaliwal SS, Nahid AA, Abbas R (2018) Effective intrusion detection system using xgboost. Information 9(7):149
Article Google Scholar
Ehsan O, Hassan S, Mezouar ME, Zou Y (2020) An empirical study of developer discussions in the gitter platform. ACM Trans Soft Eng Method (TOSEM) 30(1):1–39
Google Scholar
Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: International conference on software maintenance, 2003. ICSM 2003. Proceedings. IEEE, pp 23–32
Fowler M (2018) Refactoring: improving the design of existing code. Addison-Wesley professional
Fox J, Weisberg S, Adler D, Bates D, Baud-Bovy G, Ellison S, Firth D, Friendly M, Gorjanc G, Graves S et al (2012) Package car. Vienna: R foundation for statistical computing
Garg R, Tekchandani R (2014) An approach to rank code clones for efficient clone management. In: 2014 international conference on advances in electronics computers and communications. IEEE, pp 1–5
Göde N, Koschke R (2009) Incremental clone detection. In: 2009 13Th European conference on software maintenance and reengineering. IEEE, pp 219–228
Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: European conference on information retrieval. Springer, pp 345–359
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
Hassan S, Tantithamthavorn C, Bezemer CP, Hassan AE (2018) Studying the dialogue between users and developers of free apps in the google play store. Empir Softw Eng 23(3):1275–1312
Article Google Scholar
Herbold S, Trautsch A, Trautsch F, Ledel B (2019) Issues with szz: an empirical assessment of the state of practice of defect prediction data collection. arXiv:1911.08938
Jr FEH (2019) Harrell miscellaneous. https://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf, (Last accessed: August 2019)
Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter?. In: 2009 IEEE 31St international conference on software engineering. IEEE, pp 485–495
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670
Article Google Scholar
Kapser CJ, Godfrey MW (2008) cloning considered harmful considered harmful: patterns of cloning in software. Empir Softw Eng 13(6):645–692
Article Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, pp 3146–3154
Kouters E, Vasilescu B, Serebrenik A, Van Den Brand MG (2012) Who’s who in gnome: Using lsa to merge software repository identities. In: 2012 28Th IEEE international conference on software maintenance (ICSM). IEEE, pp 592–595
Lafontaine F, White KJ (1986) Obtaining any wald statistic you want. Econ Lett 21(1):35–40
Article Google Scholar
Li H (2014) Learning to rank for information retrieval and natural language processing. Synthesis Lectures Human Language Technol 7(3):1–121
Article Google Scholar
Li J, Ernst MD (2012) Cbcd: cloned buggy code detector. In: 2012 34Th international conference on software engineering (ICSE). IEEE, pp 310–320
Mondal M, Roy CK, Schneider KA (2017) Bug propagation through code cloning: an empirical study. In: 2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 227–237
Mondal M, Roy CK, Schneider KA (2017) Does cloned code increase maintenance effort?. In: 2017 IEEE 11Th international workshop on software clones (IWSC). IEEE, pp 1–7
Nakakoji K, Yamamoto Y, Nishinaka Y, Kishida K, Ye Y (2002) Evolution patterns of open-source software systems and communities. In: Proceedings of the international workshop on Principles of software evolution, pp 76–85
Pan Q, Tang W, Yao S (2020) The application of lightgbm in microsoft malware detection. In: Journal of physics: conference series. IOP Publishing, vol 1684, p 012041
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Roy CK (2009) Detection and analysis of near-miss software clones. In: 2009 IEEE international conference on software maintenance. IEEE, pp 447–450
Saha RK, Roy CK, Schneider KA (2011) An automatic framework for extracting and classifying near-miss clone genealogies. In: 2011 27Th IEEE international conference on software maintenance (ICSM). IEEE, pp 293–302
Saha RK, Roy CK, Schneider KA, Perry DE (2013) Understanding the evolution of type-3 clones: an exploratory study. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, pp 139–148
Schwarz N, Lungu M, Robbes R (2012) On how often code is cloned across repositories. In: Proceedings of the 34th international conference on software engineering. IEEE Press, pp 1289–1292
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616
Article Google Scholar
Snijders TA, Bosker RJ, et al. (1999) An introduction to basic and advanced multilevel modeling. Sage, London. WONG, GY, y MASON, WM (1985): the hierarchical logistic Regression. Model Multilevel Analy, J Am Stat Assoc 80(5):13–524
Google Scholar
Svajlenko J, Roy CK (2014) Evaluating modern clone detection tools. In: 2014 IEEE international conference on software maintenance and evolution. IEEE, pp 321–330
Svajlenko J, Roy CK (2019) The mutation and injection framework: evaluating clone detection tools with mutation analysis. IEEE Trans Softw Eng 47 (5):1060–1087
Article Google Scholar
Tang C, Luktarhan N, Zhao Y (2020) An efficient intrusion detection method based on lightgbm and autoencoder. Symmetry 12(9):1458
Article Google Scholar
Thongtanunam P, Shang W, Hassan AE (2019) Will this clone be short-lived? towards a better understanding of the characteristics of short-lived clones. Empir Softw Eng 24(2):937–972
Article Google Scholar
Walthers J (2015) Learning to rank for cross-device identification. In: 2015 IEEE international conference on data mining workshop (ICDMW). IEEE, pp 1710–1712
Wang S, Chen TH, Hassan AE (2018) Understanding the factors for fast answers in technical q & a websites. Empir Softw Eng 23(3):1552–1593
Article Google Scholar
Wang S, Zou Y, Ng J, Ng T (2017) Context-aware service input ranking by learning from historical information. IEEE Trans Serv Comput
Weisberg S (2005) Applied linear regression. Wiley, vol 528
Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 345–355
Xie S, Khomh F, Zou Y (2013) An empirical study of the fault-proneness of clone mutation and clone migration. In: 2013 10Th working conference on mining software repositories (MSR). IEEE, pp 149–158
Yang B, He Y, Liu H, Chen Y, Jin Z (2020) A lightweight fault localization approach based on xgboost. In: 2020 IEEE 20Th international conference on software quality, reliability and security (QRS). IEEE, pp 168–179
Yang X, Tang K, Yao X (2014) A learning-to-rank approach to software defect prediction. IEEE Trans Reliab 64(1):234–246
Article Google Scholar
Zhang F, Khoo SC, Su X (2017) Predicting change consistency in a clone group. J Syst Soft 134:105–119
Article Google Scholar
Zhao G, da Costa DA, Zou Y (2019) Improving the pull requests review process using learning-to-rank algorithms. Empir Softw Eng 24(4):2140–2170
Article Google Scholar
Zhou J, Zhang H (2012) Learning to rank duplicate bug reports. In: Proceedings of the 21st ACM international conference on information and knowledge management, pp 852–861
Śliwerski J., Zimmermann T, Zeller A (2005) When do changes induce fixes? ACM Sigsoft Soft Eng Notes 30(4):1–5
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada
Osama Ehsan & Foutse Khomh
Polytechnique Montréal, Montréal, QC, Canada
Ying Zou
Huawei Technologies Co., Foshan, Guangdong, China
Dong Qiu

Authors

Osama Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Foutse Khomh
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zou
View author publications
You can also search for this author in PubMed Google Scholar
Dong Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osama Ehsan.

Additional information

Communicated by: Miryung Kim

Data availability statement

The datasets generated during and/or analysed during the current study are available in the Zenodo repository^{Footnote 7}.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 6 Details of the selected projects

Full size table

Table 7 Results of the mixed-effect model for Model_all, Sorted by χ² descendingly

Full size table

Table 8 Results of the mixed-effect model for Model_mature, Sorted by χ² descendingly

Full size table

Table 9 Results of the mixed-effect model for Model_early, Sorted by χ² descendingly

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ehsan, O., Khomh, F., Zou, Y. et al. Ranking code clones to support maintenance activities. Empir Software Eng 28, 70 (2023). https://doi.org/10.1007/s10664-023-10292-0

Download citation

Accepted: 05 January 2023
Published: 25 April 2023
DOI: https://doi.org/10.1007/s10664-023-10292-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ranking code clones to support maintenance activities

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Data availability statement

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ranking code clones to support maintenance activities

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Data availability statement

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation