Addressing problems with replicability and validity of repository mining studies through a smart data platform

Trautsch, Fabian; Herbold, Steffen; Makedonski, Philip; Grabowski, Jens

doi:10.1007/s10664-017-9537-x

Addressing problems with replicability and validity of repository mining studies through a smart data platform

Published: 08 August 2017

Volume 23, pages 1036–1083, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Fabian Trautsch ORCID: orcid.org/0000-0002-8374-9142¹,
Steffen Herbold¹,
Philip Makedonski¹ &
…
Jens Grabowski¹

1375 Accesses
32 Citations
3 Altmetric
Explore all metrics

Abstract

The usage of empirical methods has grown common in software engineering. This trend spawned hundreds of publications, whose results are helping to understand and improve the software development process. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the replicability and validity of approaches. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Furthermore, many studies use small data sets, which comprise of less than 10 projects. This poses a threat especially to the external validity of these studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created SmartSHARK, which implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present SmartSHARK and discuss our experiences regarding the use of it and the mentioned problems. Additionally, we show how we have addressed the issues that we have identified during our work with SmartSHARK.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

Steven Euijong Whang, Yuji Roh, … Jae-Gil Lee

Future of software development with generative AI

Article Open access 11 March 2024

Jaakko Sauvola, Sasu Tarkoma, … David Doermann

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

https://www.r-project.org/
https://www.githubarchive.org/
https://hadoop.apache.org/
http://spark.apache.org/docs/latest/mllib-guide.html
http://spark.apache.org/graphx/
http://bitergia.com/
https://www.openhub.net/
https://www.topcoder.com/
The complete source code as well as deployment scripts are available in our public SVN: http://trex.informatik.uni-goettingen.de/svn/smartshark/. A running instance is located at the following URL: http://smartshark.informatik.uni-goettingen.de.
The developing company Intooitus does not exist anymore and the tool is also not available anymore.
http://github.com/MetricsGrimoire/CVSAnalY
http://smartshark.informatik.uni-goettingen.de/index.php?r=site%2Fmongodesign
http://www.gnu.org/software/diffutils/
The default set of regular expressions includes: "defect(s)?", "patch(ing|es|ed)?", "bug(s|fix(es)?)?", "(re)?fix(es|ed|ing|age|∖s?up(s)?)?", "debug(ged)?", "∖#∖d+", "back∖s?out", "revert(ing|ed)?"
The default assumption may be overridden by applying different strategies based on the size of the changes or other information.
http://www.yiiframework.com/
https://www.vagrantup.com/
http://www.ansible.com/
http://smartshark.informatik.uni-goettingen.de/
http://github.com/
https://github.com/apache/mahout
There is no ksudoku specific list. Instead we collected the whole kde-games-devel mailing: https://mail.kde.org/pipermail/kde-games-devel
The project is not available anymore.
Note that all messages were always additionally sent to the mailing list.
See: http://www.highcharts.com/
See: http://visjs.org/
https://wiki.apache.org/hadoop/HowToDebugMapReducePrograms
http://www.yiiframework.com/doc-2.0/guide-structure-widgets.html
http://www.neoemf.com
This problem does not occur anymore with the current version of CVSAnalY.
https://github.com/smartshark/vcsSHARK
https://github.com/smartshark/mecoSHARK
We use the official git library for analysis: https://libgit2.github.com/
Currently, mecoSHARK is able to detect Type-2 clones, which are clones that are syntactically identical except for variations in layout, comments, whitespaces, type references, identifier names, and literals. Details can be found in the SourceMeter documentation: FrontEndART Ltd (2016a)
http://gwdg.de
https://github.com/smartshark/issueSHARK
https://github.com/smartshark/coastSHARK
http://smartshark2.informatik.uni-goettingen.de/documentation/
According to Google Scholar on 2017-07-06.
https://d3js.org/
https://github.com/smartshark

References

Alexandru CV, Gall HC (2015) Rapid multi-purpose, multi-commit code analysis. In: Proceedings of the IEEE/ACM 37th international conference on software engineering (ICSE). IEEE/ACM, pp 635–638
Google Scholar
Alliance O (2007) Osgi service platform, core specification, release 4, version 4.3. https://www.osgi.org/release-4-version-4-3/
Arcuri A, Fraser G, Galeotti JP (2015) Generating tcp/udp network data for automated unit test generation. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 155–165
Google Scholar
Avdiienko V, Kuznetsov K, Gorla A, Zeller A, Arzt S, Rasthofer S, Bodden E (2015) Mining apps for abnormal usage of sensitive data. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 426–436
Google Scholar
Bang L, Aydin A, Bultan T (2015) Automatically computing path complexity of programs. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 61–72
Google Scholar
Benelallam A, Gómez A, Sunyé G, Tisi M, Launay D (2014) Neo4emf, a scalable persistence layer for emf models. In: Proceedings of the 10th European conference on modelling foundations and applications - volume 8569. Springer-Verlag New York, Inc., New York, NY, USA, pp 230–241. doi:10.1007/978-3-319-09195-2_15
Google Scholar
Bevan J, Whitehead E J, Kim S, Godfrey M (2005) Facilitating software evolution research with kenyon. In: ACM SIGSOFT software engineering notes, vol 30. ACM, pp 177186
Google Scholar
Beyer D, Dangl M, Dietsch D, Heizmann M, Stahlbauer A (2015) Witness validation and stepwise testification across software verifiers. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 721–733
Google Scholar
Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories. ACM, pp 137–143
Google Scholar
Catal C, Diri B (2009) A systematic review of software fault prediction studies. Expert Syst Appl 36(4):7346–7354
Article Google Scholar
Cavalcanti G, Accioly P, Borba P (2015) Assessing semistructured merge in version control systems: a replicated experiment. In: ACM/IEEE international symposium on empirical software engineering and measurement 2015 (ESEM). IEEE, pp 1–10
Google Scholar
Claes M, Mens T, Di Cosmo R, Vouillon J (2015) A historical analysis of debian package incompatibilities. In: IEEE/ACM 12th working conference on mining software repositories 2015 (MSR). IEEE, pp 212–223
Google Scholar
Coelho R, Almeida L, Gousios G, van Deursen A (2015) Unveiling exception handling bug hazards in android based on github and google code issues. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 134–145
Google Scholar
Ċubranić D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465
Article Google Scholar
Czerwonka J, Nagappan N, Schulte W (2013) CODEMINE: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71
Article Google Scholar
Devanbu P, Zimmermann T, Bird C (2016) Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering. ACM, pp 108–119
Google Scholar
Dhar A, Purandare R, Dhawan M, Rangaswamy S (2015) Clotho: saving programs from malformed strings and incorrect string-handling. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 555–566
Google Scholar
Di Ruscio D, Kolovos DS, Korkontzelos I, Matragkas N, Vinju J (2015) Ossmeter: a software measurement platform for automatically analysing open source software projects. In: ESEC/FSE 2015 tool demonstrations track
Google Scholar
Di Sorbo A, Panichella S, Visaggio C, Di Penta M, Canfora G, Gall H (2015) Development emails content analyzer: Intention mining in developer discussions. In: Proceedings of the IEEE/ACM 30th international conference on automated software engineering (ASE)
Google Scholar
Draisbach U, Naumann F (2010) Dude: The duplicate detection toolkit. In: Proceedings of the international workshop on quality in databases (QDB)
Google Scholar
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the IEEE/ACM 35th international conference on software engineering (ICSE)
Google Scholar
Dyer R, Nguyen HA, Rajan H, Nguyen T (2015) Boa: ultra-large-scale software repository and source code mining. ACM Transactions on Software Engineering and Methodology forthcoming
Eichberg M, Hermann B, Mezini M, Glanz L (2015) Hidden truths in dead software paths. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 474–484
Google Scholar
Fernandez-Ramil J, Izquierdo-Cortazar D, Mens T (2009) What does it take to develop a million lines of open source code?. In: Open source ecosystems: diverse communities interacting. Springer, pp 170–184
Google Scholar
Foucault M, Palyart M, Blanc X, Murphy GC, Falleri JR (2015) Impact of developer turnover on quality in open-source software. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 829–841
Google Scholar
FrontEndART Ltd (2016a) SourceMeter Documentation for Java. https://www.sourcemeter.com/resources/java, [accessed 02-August-2016]
FrontEndART Ltd (2016b) SourceMeter Documentation for Python. https://www.sourcemeter.com/resources/python, [accessed 02-August-2016]
FrontEndART Ltd (2016c) SourceMeter Webpage. https://www.sourcemeter.com/, [accessed 02-August-2016]
Gallaba K, Mesbah A, Beschastnikh I (2015) Don’t call us, we’ll call you: characterizing callbacks in javascript. In: ACM/IEEE international symposium on empirical software engineering and measurement 2015 (ESEM). IEEE, pp 1–10
Google Scholar
German DM (2004) Mining CVS repositories, the softChange experience. Evolution 245(5,402):92–688
Google Scholar
Giger E, Pinzger M, Gall H (2010) Predicting the fix time of bugs. In: Proceedings of the 2nd International workshop on recommendation systems for software engineering (RSSE). ACM, pp. 52–56
Google Scholar
Gong L, Pradel M, Sen K (2015) Jitprof: Pinpointing jit-unfriendly javascript code. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 357–368
Google Scholar
González-Barahona JM, Robles G (2012) On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir Softw Eng 17(1-2):75–89
Article Google Scholar
Gousios G, Spinellis D (2009) Alitheia core: An extensible software quality monitoring platform. In: Proceedings of the IEEE/ACM 31st international conference on software engineering (ICSE)
Google Scholar
Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In: Proceedings of the 9th IEEE working conference on mining software repositories (MSR). IEEE, pp 12–21
Google Scholar
Gousios G, Kalliamvakou E, Spinellis D (2008) Measuring developer contribution from software repository data. In: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, NY, USA, MSR ’08, pp 129–132. doi:10.1145/1370750.1370781
Google Scholar
Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th IEEE working conference on mining software repositories (MSR). ACM, pp 384–387
Google Scholar
Gupta M, Sureka A, Padmanabhuni S, Asadullah AM (2015) Identifying software process management challenges: Survey of practitioners in a large global it company. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 346–356
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18. doi:10.1145/1656274.1656278
Article Google Scholar
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. doi:10.1109/TSE.2011.103
Article Google Scholar
He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190
Article Google Scholar
Herbold S (2017) A systematic mapping study on cross-project defect prediction. CoRR abs/1705.06429. https://arxiv.org/abs/1705.06429. arXiv:1705.06429
Hermann B, Reif M, Eichberg M, Mezini M (2015) Getting to know you: towards a capability model for java. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 758–769
Google Scholar
Herraiz I, Robles G, Amor JJ, Romera T, González Barahona JM (2006) The processes of joining in global distributed software projects. In: Proceedings of the 2006 international workshop on global software development for the practitioner. ACM, pp 27–33
Google Scholar
Herraiz I, Gonzalez-Barahona JM, Robles G (2007) Forecasting the number of changes in eclipse using time series analysis. In: Proceedings of the 4th IEEE working conference on mining software repositories (MSR)
Google Scholar
Honsel V, Honsel D, Herbold S, Grabowski J, Waack S (2015) Mining software dependency networks for agent-based simulation of software evolution. In: Proceedings of the 4th international workshop on software mining (SoftMine)
Google Scholar
Howison J, Conklin MS, Crowston K (2005) Ossmole: A collaborative repository for floss research data and analyses. In: Proceedings of the 1st international conference on open source software
Google Scholar
ISO/IEC (1998) 9241-11 Ergonomic requirements for office work with visual display terminals (VDTs). ISO/IEC 9241-14
Jermakovics A, Sillitti A, Succi G (2011) Mining and visualizing developer networks from version control systems. In: Proceedings of the 4th international workshop on cooperative and human aspects of software engineering (CHASE). ACM, New York, NY, USA, CHASE ’11, pp 24–31. doi:10.1145/1984642.1984647
Google Scholar
Joblin M, Mauerer W, Apel S, Siegmund J, Riehle D (2015) From developer networks to verified communities: a fine-grained approach. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 563–573
Google Scholar
Jorgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53. doi:10.1109/TSE.2007.256943
Article Google Scholar
Kouroshfar E, Mirakhorli M, Bagheri H, Xiao L, Malek S, Cai Y (2015) A study on the role of software architecture in the evolution and quality of software. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 246–257
Google Scholar
Le DM, Behnamghader P, Garcia J, Link D, Shahbazian A, Medvidovic N (2015) An empirical study of architectural change in open-source software systems. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 235–245
Google Scholar
Lin Z, Whitehead J (2015) Why power laws? An explanation from fine-grained code changes. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 68–75
Google Scholar
Linares-Vásquez M, Bavota G, Cárdenas CEB, Oliveto R, Di Penta M, Poshyvanyk D (2015a) Optimizing energy consumption of guis in android apps: A multi-objective approach. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 143–154
Google Scholar
Linares-Vásquez M, White M, Bernal-Cárdenas C, Moran K, Poshyvanyk D (2015b) Mining android app usages for generating actionable gui-based execution scenarios. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 111–122
Google Scholar
Long F, Rinard M (2015) Staged program repair with condition synthesis. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 166–178
Google Scholar
Makedonski P, Grabowski J (2016) Weighted Multi-factor multi-layer identification of potential causes for events of interest in software repositories. In: Proceedings of the seminar series on advanced techniques and tools for software evolution (SATToSE) 2015, forthcoming 2016
Google Scholar
Makedonski P, Sudau F, Grabowski J (2015) Towards a model-based software mining infrastructure. ACM SIGSOFT Software Engineering Notes 40(1):1–8
Article Google Scholar
Menzies T, Rees-Jones M, Krishna R, Pape C (2015) The promise repository of empirical software engineering data. http://openscience.us/repo. North Carolina State University, Department of Computer Science [accessed 22-January-2015]
Nanz S, Furia CA (2015) A comparative study of programming languages in rosetta code. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 778–788
Google Scholar
Nguyen HV, Kästner C, Nguyen TN (2015a) Cross-language program slicing for dynamic web applications. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 369–380
Google Scholar
Nguyen TH, Grundy J, Almorsy M (2015b) Rule-based extraction of goal-use case models from text. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 591–601
Google Scholar
Park J, Esmaeilzadeh H, Zhang X, Naik M, Harris W (2015) Flexjava: language support for safe and modular approximate programming. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 745–757
Google Scholar
Robles G (2010) Replicating msr: a study of the potential replicability of papers published in the mining software repositories proceedings. In: 2010 7th IEEE working conference on mining software repositories (MSR). IEEE, pp 171–180
Google Scholar
Robles G, González-Barahona JM, Cervigón C, Capiluppi A, Izquierdo-Cortázar D (2014) Estimating development effort in free/open source software projects by mining software repositories: a case study of openstack. In: Proceedings of the 11th working conference on mining software repositories. ACM, New York, NY, USA, MSR 2014, pp 222–231. doi:10.1145/2597073.2597107
Google Scholar
Safi G, Shahbazian A, Halfond WG, Medvidovic N (2015) Detecting event anomalies in event-based systems. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 25–37
Google Scholar
Samak M, Ramanathan MK (2015) Synthesizing tests for detecting atomicity violations. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 131–142
Google Scholar
Scheidgen M, Zubow A, Fischer J, Kolbe TH (2012) Automated and transparent model fragmentation for persisting large models. Springer
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Article Google Scholar
Shi A, Yung T, Gyori A, Marinov D (2015) Comparing and combining test-suite reduction and regression test selection. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 237–247
Google Scholar
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Softw Eng 13(2):211–218
Article Google Scholar
Siegmund J, Siegmund N, Apel S (2015a) Views on internal and external validity in empirical software engineering. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 1. IEEE, pp 9–19
Google Scholar
Siegmund N, Grebhahn A, Apel S, Kästner C (2015b) Performance-influence models for highly configurable systems. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 284–294
Google Scholar
Sjøberg DI, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg NK, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Article Google Scholar
Smith EK, Barr ET, Le Goues C, Brun Y (2015) Is the cure worse than the disease? Overfitting in automated program repair. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 532–543
Google Scholar
Smith EK, Bird C, Zimmermann T (2016) Beliefs, practices, and personalities of software engineers: a survey in a large software company. In: Proceedings of the 9th international workshop on cooperative and human aspects of software engineering. ACM, pp 15–18
Google Scholar
Sun X, Liu X, Li B, Duan Y, Yang H, Hu J (2016) Exploring topic models in software engineering data analysis: A survey. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD), pp 357–362. doi:10.1109/SNPD.2016.7515925
Google Scholar
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: Proceedings of the IEEE/ACM 37th international conference on software engineering (ICSE)
Google Scholar
Tao Y, Kim S (2015) Partitioning composite code changes to facilitate code review. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 180–190
Google Scholar
Thomas JJ, Cook KA (2006) A visual analytics agenda. IEEE Comput Graph Appl 26(1):10–13
Article Google Scholar
Trautsch F, Herbold S, Makedonski P, Grabowski J (2016) Adressing problems with external validity of repository mining studies through a smart data platform. In: Proceedings of the 13th international workshop on mining software repositories. ACM, pp 97–108
Google Scholar
Van Rysselberghe F, Demeyer S (2004) Studying software evolution information by visualizing the change history. In: Proceedings of the 20th IEEE international conference on software maintenance, 2004. IEEE, pp 328–337
Google Scholar
Walden J, Stuckman J, Scandariato R (2014) Predicting vulnerable components: Software metrics vs text mining. In: Proceedings of the IEEE 25th international symposium on software reliability engineering (ISSRE). IEEE, pp 23–33
Google Scholar
Wheeler DA (2004) Sloccount Documentation. http://www.dwheeler.com/sloccount/sloccount.html, [accessed 02-August-2016]
Xu T, Jin L, Fan X, Zhou Y, Pasupathy S, Talwadker R (2015) Hey, you have given me too many knobs!: understanding and dealing with over-designed configuration in system software. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, pp 307–319
Google Scholar
Yang W, Xiao X, Andow B, Li S, Xie T, Enck W (2015) Appcontext: Differentiating malicious and benign mobile app behaviors using context. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 303–313
Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing (HotCloud)
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on network system design and implementation (NSDI)
Google Scholar
Zhu J, He P, Fu Q, Zhang H, Lyu MR, Zhang D (2015) Learning to log: helping developers make informed logging decisions. In: IEEE/ACM 37th IEEE international conference on software engineering 2015 (ICSE), vol 1. IEEE, pp 415–425
Google Scholar

Download references

Acknowledgements

The authors would like to thank Fabian Glaser, Michael Göttsche, and Gunnar Krull for their support regarding the cloud technologies and the deployment as well as the GWDG for allowing us to use their HPC resources.

Author information

Authors and Affiliations

Institute of Computer Science, Georg-August-Universtität Göttingen, Göttingen, Germany
Fabian Trautsch, Steffen Herbold, Philip Makedonski & Jens Grabowski

Authors

Fabian Trautsch
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Herbold
View author publications
You can also search for this author in PubMed Google Scholar
Philip Makedonski
View author publications
You can also search for this author in PubMed Google Scholar
Jens Grabowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Trautsch.

Additional information

Communicated by: Romain Robbes, Christian Bird and Emily Hill

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trautsch, F., Herbold, S., Makedonski, P. et al. Addressing problems with replicability and validity of repository mining studies through a smart data platform. Empir Software Eng 23, 1036–1083 (2018). https://doi.org/10.1007/s10664-017-9537-x

Download citation

Published: 08 August 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10664-017-9537-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Addressing problems with replicability and validity of repository mining studies through a smart data platform

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Future of software development with generative AI

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Addressing problems with replicability and validity of repository mining studies through a smart data platform

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

Future of software development with generative AI

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation