Abstract
The usage of empirical methods has grown common in software engineering. This trend spawned hundreds of publications, whose results are helping to understand and improve the software development process. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the replicability and validity of approaches. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Furthermore, many studies use small data sets, which comprise of less than 10 projects. This poses a threat especially to the external validity of these studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created SmartSHARK, which implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present SmartSHARK and discuss our experiences regarding the use of it and the mentioned problems. Additionally, we show how we have addressed the issues that we have identified during our work with SmartSHARK.
Similar content being viewed by others
Notes
The complete source code as well as deployment scripts are available in our public SVN: http://trex.informatik.uni-goettingen.de/svn/smartshark/. A running instance is located at the following URL: http://smartshark.informatik.uni-goettingen.de.
The developing company Intooitus does not exist anymore and the tool is also not available anymore.
The default set of regular expressions includes: "defect(s)?", "patch(ing|es|ed)?", "bug(s|fix(es)?)?", "(re)?fix(es|ed|ing|age|∖s?up(s)?)?", "debug(ged)?", "∖#∖d+", "back∖s?out", "revert(ing|ed)?"
The default assumption may be overridden by applying different strategies based on the size of the changes or other information.
There is no ksudoku specific list. Instead we collected the whole kde-games-devel mailing: https://mail.kde.org/pipermail/kde-games-devel
The project is not available anymore.
Note that all messages were always additionally sent to the mailing list.
See: http://visjs.org/
This problem does not occur anymore with the current version of CVSAnalY.
We use the official git library for analysis: https://libgit2.github.com/
Currently, mecoSHARK is able to detect Type-2 clones, which are clones that are syntactically identical except for variations in layout, comments, whitespaces, type references, identifier names, and literals. Details can be found in the SourceMeter documentation: FrontEndART Ltd (2016a)
https://github.com/smartshark/issueSHARK
According to Google Scholar on 2017-07-06.
References
Alexandru CV, Gall HC (2015) Rapid multi-purpose, multi-commit code analysis. In: Proceedings of the IEEE/ACM 37th international conference on software engineering (ICSE). IEEE/ACM, pp 635–638
Alliance O (2007) Osgi service platform, core specification, release 4, version 4.3. https://www.osgi.org/release-4-version-4-3/
Arcuri A, Fraser G, Galeotti JP (2015) Generating tcp/udp network data for automated unit test generation. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 155–165
Avdiienko V, Kuznetsov K, Gorla A, Zeller A, Arzt S, Rasthofer S, Bodden E (2015) Mining apps for abnormal usage of sensitive data. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 426–436
Bang L, Aydin A, Bultan T (2015) Automatically computing path complexity of programs. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 61–72
Benelallam A, Gómez A, Sunyé G, Tisi M, Launay D (2014) Neo4emf, a scalable persistence layer for emf models. In: Proceedings of the 10th European conference on modelling foundations and applications - volume 8569. Springer-Verlag New York, Inc., New York, NY, USA, pp 230–241. doi:10.1007/978-3-319-09195-2_15
Bevan J, Whitehead E J, Kim S, Godfrey M (2005) Facilitating software evolution research with kenyon. In: ACM SIGSOFT software engineering notes, vol 30. ACM, pp 177186
Beyer D, Dangl M, Dietsch D, Heizmann M, Stahlbauer A (2015) Witness validation and stepwise testification across software verifiers. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 721–733
Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories. ACM, pp 137–143
Catal C, Diri B (2009) A systematic review of software fault prediction studies. Expert Syst Appl 36(4):7346–7354
Cavalcanti G, Accioly P, Borba P (2015) Assessing semistructured merge in version control systems: a replicated experiment. In: ACM/IEEE international symposium on empirical software engineering and measurement 2015 (ESEM). IEEE, pp 1–10
Claes M, Mens T, Di Cosmo R, Vouillon J (2015) A historical analysis of debian package incompatibilities. In: IEEE/ACM 12th working conference on mining software repositories 2015 (MSR). IEEE, pp 212–223
Coelho R, Almeida L, Gousios G, van Deursen A (2015) Unveiling exception handling bug hazards in android based on github and google code issues. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 134–145
Ċubranić D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465
Czerwonka J, Nagappan N, Schulte W (2013) CODEMINE: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71
Devanbu P, Zimmermann T, Bird C (2016) Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering. ACM, pp 108–119
Dhar A, Purandare R, Dhawan M, Rangaswamy S (2015) Clotho: saving programs from malformed strings and incorrect string-handling. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 555–566
Di Ruscio D, Kolovos DS, Korkontzelos I, Matragkas N, Vinju J (2015) Ossmeter: a software measurement platform for automatically analysing open source software projects. In: ESEC/FSE 2015 tool demonstrations track
Di Sorbo A, Panichella S, Visaggio C, Di Penta M, Canfora G, Gall H (2015) Development emails content analyzer: Intention mining in developer discussions. In: Proceedings of the IEEE/ACM 30th international conference on automated software engineering (ASE)
Draisbach U, Naumann F (2010) Dude: The duplicate detection toolkit. In: Proceedings of the international workshop on quality in databases (QDB)
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the IEEE/ACM 35th international conference on software engineering (ICSE)
Dyer R, Nguyen HA, Rajan H, Nguyen T (2015) Boa: ultra-large-scale software repository and source code mining. ACM Transactions on Software Engineering and Methodology forthcoming
Eichberg M, Hermann B, Mezini M, Glanz L (2015) Hidden truths in dead software paths. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 474–484
Fernandez-Ramil J, Izquierdo-Cortazar D, Mens T (2009) What does it take to develop a million lines of open source code?. In: Open source ecosystems: diverse communities interacting. Springer, pp 170–184
Foucault M, Palyart M, Blanc X, Murphy GC, Falleri JR (2015) Impact of developer turnover on quality in open-source software. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 829–841
FrontEndART Ltd (2016a) SourceMeter Documentation for Java. https://www.sourcemeter.com/resources/java, [accessed 02-August-2016]
FrontEndART Ltd (2016b) SourceMeter Documentation for Python. https://www.sourcemeter.com/resources/python, [accessed 02-August-2016]
FrontEndART Ltd (2016c) SourceMeter Webpage. https://www.sourcemeter.com/, [accessed 02-August-2016]
Gallaba K, Mesbah A, Beschastnikh I (2015) Don’t call us, we’ll call you: characterizing callbacks in javascript. In: ACM/IEEE international symposium on empirical software engineering and measurement 2015 (ESEM). IEEE, pp 1–10
German DM (2004) Mining CVS repositories, the softChange experience. Evolution 245(5,402):92–688
Giger E, Pinzger M, Gall H (2010) Predicting the fix time of bugs. In: Proceedings of the 2nd International workshop on recommendation systems for software engineering (RSSE). ACM, pp. 52–56
Gong L, Pradel M, Sen K (2015) Jitprof: Pinpointing jit-unfriendly javascript code. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 357–368
González-Barahona JM, Robles G (2012) On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir Softw Eng 17(1-2):75–89
Gousios G, Spinellis D (2009) Alitheia core: An extensible software quality monitoring platform. In: Proceedings of the IEEE/ACM 31st international conference on software engineering (ICSE)
Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th IEEE working conference on mining software repositories (MSR). IEEE, pp 12–21
Gousios G, Kalliamvakou E, Spinellis D (2008) Measuring developer contribution from software repository data. In: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, NY, USA, MSR ’08, pp 129–132. doi:10.1145/1370750.1370781
Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th IEEE working conference on mining software repositories (MSR). ACM, pp 384–387
Gupta M, Sureka A, Padmanabhuni S, Asadullah AM (2015) Identifying software process management challenges: Survey of practitioners in a large global it company. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 346–356
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18. doi:10.1145/1656274.1656278
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. doi:10.1109/TSE.2011.103
He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190
Herbold S (2017) A systematic mapping study on cross-project defect prediction. CoRR abs/1705.06429. https://arxiv.org/abs/1705.06429. arXiv:1705.06429
Hermann B, Reif M, Eichberg M, Mezini M (2015) Getting to know you: towards a capability model for java. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 758–769
Herraiz I, Robles G, Amor JJ, Romera T, González Barahona JM (2006) The processes of joining in global distributed software projects. In: Proceedings of the 2006 international workshop on global software development for the practitioner. ACM, pp 27–33
Herraiz I, Gonzalez-Barahona JM, Robles G (2007) Forecasting the number of changes in eclipse using time series analysis. In: Proceedings of the 4th IEEE working conference on mining software repositories (MSR)
Honsel V, Honsel D, Herbold S, Grabowski J, Waack S (2015) Mining software dependency networks for agent-based simulation of software evolution. In: Proceedings of the 4th international workshop on software mining (SoftMine)
Howison J, Conklin MS, Crowston K (2005) Ossmole: A collaborative repository for floss research data and analyses. In: Proceedings of the 1st international conference on open source software
ISO/IEC (1998) 9241-11 Ergonomic requirements for office work with visual display terminals (VDTs). ISO/IEC 9241-14
Jermakovics A, Sillitti A, Succi G (2011) Mining and visualizing developer networks from version control systems. In: Proceedings of the 4th international workshop on cooperative and human aspects of software engineering (CHASE). ACM, New York, NY, USA, CHASE ’11, pp 24–31. doi:10.1145/1984642.1984647
Joblin M, Mauerer W, Apel S, Siegmund J, Riehle D (2015) From developer networks to verified communities: a fine-grained approach. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 563–573
Jorgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53. doi:10.1109/TSE.2007.256943
Kouroshfar E, Mirakhorli M, Bagheri H, Xiao L, Malek S, Cai Y (2015) A study on the role of software architecture in the evolution and quality of software. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 246–257
Le DM, Behnamghader P, Garcia J, Link D, Shahbazian A, Medvidovic N (2015) An empirical study of architectural change in open-source software systems. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 235–245
Lin Z, Whitehead J (2015) Why power laws? An explanation from fine-grained code changes. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 68–75
Linares-Vásquez M, Bavota G, Cárdenas CEB, Oliveto R, Di Penta M, Poshyvanyk D (2015a) Optimizing energy consumption of guis in android apps: A multi-objective approach. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 143–154
Linares-Vásquez M, White M, Bernal-Cárdenas C, Moran K, Poshyvanyk D (2015b) Mining android app usages for generating actionable gui-based execution scenarios. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 111–122
Long F, Rinard M (2015) Staged program repair with condition synthesis. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 166–178
Makedonski P, Grabowski J (2016) Weighted Multi-factor multi-layer identification of potential causes for events of interest in software repositories. In: Proceedings of the seminar series on advanced techniques and tools for software evolution (SATToSE) 2015, forthcoming 2016
Makedonski P, Sudau F, Grabowski J (2015) Towards a model-based software mining infrastructure. ACM SIGSOFT Software Engineering Notes 40(1):1–8
Menzies T, Rees-Jones M, Krishna R, Pape C (2015) The promise repository of empirical software engineering data. http://openscience.us/repo. North Carolina State University, Department of Computer Science [accessed 22-January-2015]
Nanz S, Furia CA (2015) A comparative study of programming languages in rosetta code. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 778–788
Nguyen HV, Kästner C, Nguyen TN (2015a) Cross-language program slicing for dynamic web applications. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 369–380
Nguyen TH, Grundy J, Almorsy M (2015b) Rule-based extraction of goal-use case models from text. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 591–601
Park J, Esmaeilzadeh H, Zhang X, Naik M, Harris W (2015) Flexjava: language support for safe and modular approximate programming. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 745–757
Robles G (2010) Replicating msr: a study of the potential replicability of papers published in the mining software repositories proceedings. In: 2010 7th IEEE working conference on mining software repositories (MSR). IEEE, pp 171–180
Robles G, González-Barahona JM, Cervigón C, Capiluppi A, Izquierdo-Cortázar D (2014) Estimating development effort in free/open source software projects by mining software repositories: a case study of openstack. In: Proceedings of the 11th working conference on mining software repositories. ACM, New York, NY, USA, MSR 2014, pp 222–231. doi:10.1145/2597073.2597107
Safi G, Shahbazian A, Halfond WG, Medvidovic N (2015) Detecting event anomalies in event-based systems. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 25–37
Samak M, Ramanathan MK (2015) Synthesizing tests for detecting atomicity violations. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 131–142
Scheidgen M, Zubow A, Fischer J, Kolbe TH (2012) Automated and transparent model fragmentation for persisting large models. Springer
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Shi A, Yung T, Gyori A, Marinov D (2015) Comparing and combining test-suite reduction and regression test selection. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 237–247
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13(2):211–218
Siegmund J, Siegmund N, Apel S (2015a) Views on internal and external validity in empirical software engineering. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 1. IEEE, pp 9–19
Siegmund N, Grebhahn A, Apel S, Kästner C (2015b) Performance-influence models for highly configurable systems. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 284–294
Sjøberg DI, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg NK, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Smith EK, Barr ET, Le Goues C, Brun Y (2015) Is the cure worse than the disease? Overfitting in automated program repair. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 532–543
Smith EK, Bird C, Zimmermann T (2016) Beliefs, practices, and personalities of software engineers: a survey in a large software company. In: Proceedings of the 9th international workshop on cooperative and human aspects of software engineering. ACM, pp 15–18
Sun X, Liu X, Li B, Duan Y, Yang H, Hu J (2016) Exploring topic models in software engineering data analysis: A survey. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD), pp 357–362. doi:10.1109/SNPD.2016.7515925
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: Proceedings of the IEEE/ACM 37th international conference on software engineering (ICSE)
Tao Y, Kim S (2015) Partitioning composite code changes to facilitate code review. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 180–190
Thomas JJ, Cook KA (2006) A visual analytics agenda. IEEE Comput Graph Appl 26(1):10–13
Trautsch F, Herbold S, Makedonski P, Grabowski J (2016) Adressing problems with external validity of repository mining studies through a smart data platform. In: Proceedings of the 13th international workshop on mining software repositories. ACM, pp 97–108
Van Rysselberghe F, Demeyer S (2004) Studying software evolution information by visualizing the change history. In: Proceedings of the 20th IEEE international conference on software maintenance, 2004. IEEE, pp 328–337
Walden J, Stuckman J, Scandariato R (2014) Predicting vulnerable components: Software metrics vs text mining. In: Proceedings of the IEEE 25th international symposium on software reliability engineering (ISSRE). IEEE, pp 23–33
Wheeler DA (2004) Sloccount Documentation. http://www.dwheeler.com/sloccount/sloccount.html, [accessed 02-August-2016]
Xu T, Jin L, Fan X, Zhou Y, Pasupathy S, Talwadker R (2015) Hey, you have given me too many knobs!: understanding and dealing with over-designed configuration in system software. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 307–319
Yang W, Xiao X, Andow B, Li S, Xie T, Enck W (2015) Appcontext: Differentiating malicious and benign mobile app behaviors using context. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 303–313
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing (HotCloud)
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on network system design and implementation (NSDI)
Zhu J, He P, Fu Q, Zhang H, Lyu MR, Zhang D (2015) Learning to log: helping developers make informed logging decisions. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 415–425
Acknowledgements
The authors would like to thank Fabian Glaser, Michael Göttsche, and Gunnar Krull for their support regarding the cloud technologies and the deployment as well as the GWDG for allowing us to use their HPC resources.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Romain Robbes, Christian Bird and Emily Hill
Rights and permissions
About this article
Cite this article
Trautsch, F., Herbold, S., Makedonski, P. et al. Addressing problems with replicability and validity of repository mining studies through a smart data platform. Empir Software Eng 23, 1036–1083 (2018). https://doi.org/10.1007/s10664-017-9537-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9537-x