ABSTRACT
The increasing interest in collaborative software development on platforms like GitHub has led to the availability of large amounts of data about development activities. The GHTorrent project has recorded a significant proportion of GitHub’s public event stream and hosts the currently largest public dataset of meta-data about open-source development. We describe our infrastructure that makes this data locally available to researchers and students, examples for research activities carried out on this infrastructure, and what we learned from building the system. We identify a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and outline open challenges to use them more effectively for research and machine learning settings.
- Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An Open Source Software for Exploring and Manipulating Networks. http://www.aaai.org/ ocs/index.php/ICWSM/09/paper/view/154Google Scholar
- Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. In Proceedings of the 14th working conference on mining software repositories.Google ScholarDigital Library
- Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, Piscataway, NJ, USA, 233–236. http://dl.acm.org/ citation.cfm?id=2487085.2487132Google ScholarDigital Library
- Siegfried Horschig, Toni Mattis, and Robert Hirschfeld. 2018. Do Java Programmers Write Better Python? Studying off-Language Code Quality on GitHub. In Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming - Programming’18 Companion. ACM Press, Nice, France, 127–134.Google ScholarDigital Library
- Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. 1997. Back to the Future: The Story of Squeak, a Practical Smalltalk Written in Itself. In Proceedings of the 12th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’97). ACM, New York, NY, USA, 318–326.Google ScholarDigital Library
- Jens Lincke, Patrick Rein, Stefan Ramson, Robert Hirschfeld, Marcel Taeumel, and Tim Felgentreff. 2017. Designing a live development experience for webcomponents. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Programming Experience, PX/17.2, Vancouver, BC, Canada, October 23-27, 2017. 28–35.Google ScholarDigital Library
- https://dl.acm.org/citation.cfm?id=3167109Google Scholar
- Toni Mattis and Robert Hirschfeld. 2020. Lightweight Lexical Test Prioritization for Immediate Feedback. Programming Journal 4, 3 (2020), 12. 22152/programming-journal.org/2020/4/12Google ScholarCross Ref
- Toni Mattis, Patrick Rein, Falco Dürsch, and Robert Hirschfeld. 2020. RTPTorrent: An Open-source Dataset for Evaluating Regression Test Prioritization. In Proceedings of the Conference on Mining Software Repositories (MSR) 2020. To Appear. Google ScholarDigital Library
- Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, USA. Abstract 1 Introduction 2 Dataset 3 Infrastructure 3.1 Hardware and Software Setup 4 Data Procurement at Scale 4.1 Failure Model 5 Tools and Experience 5.1 Research Questions 6 Conclusion and Future Work Acknowledgments ReferencesGoogle ScholarDigital Library
Index Terms
- Three trillion lines: infrastructure for mining GitHub in the classroom
Recommendations
The promises and perils of mining GitHub
MSR 2014: Proceedings of the 11th Working Conference on Mining Software RepositoriesWith over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ ...
Mining GitHub: Why Commit Stops -- Exploring the Relationship between Developer's Commit Pattern and File Version Evolution
APSEC '13: Proceedings of the 2013 20th Asia-Pacific Software Engineering Conference (APSEC) - Volume 02Using the freeware in GitHub, we are often confused by a phenomenon: the new version of GitHub freeware usually was released in an indefinite frequency, and developers often committed nothing for a long time. This evolution phenomenon interferes with ...
The Relevance of SourceForge Data in the Age of GitHub
GitHub's current prominence over SourceForge among Open Source Software (OSS) developers calls into question the continued relevance of SourceForge data, as well as the external validity and relevance of studies that investigate OSS theories using ...
Comments