ABSTRACT
Modern code review (MCR) processes are prevalent in most organizations that develop software due to benefits in quality assurance and knowledge transfer. With the rise of collaborative software development platforms like GitHub and Bitbucket, today, millions of projects share not only their code but also their review data. Although researchers have tried to exploit this data for more than a decade, most of that knowledge remains a buried treasure. A crucial catalyst for many advances in deep learning, however, is the accessibility of large-scale standard datasets for different learning tasks. This paper presents the ETCR (Exploit Those Code Reviews!) infrastructure for mining MCR datasets from any GitHub project practicing pull-request-based development. We demonstrate its effectiveness with ETCR-Elasticsearch, a dataset of >231𝑘 review comments for >47𝑘 Java file revisions in >40𝑘 pull-requests from the Elasticsearch project. ETCR is designed with the challenge of deep learning in mind. Compared to previous datasets, ETCR datasets include all information for linking review comments to nodes in the respective program’s Abstract Syntax Tree.
- Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Challenges of Modern Code Review. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 712–721. isbn:9781467330763 https://doi.org/10.1109/ICSE.2013.6606617 Google ScholarCross Ref
- Amiangshu Bosu and Jeffrey C. Carver. 2013. Impact of Peer Code Review on Peer Impression Formation: A Survey. In Proceedings of the 7th International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 133–142. https://doi.org/10.1109/ESEM.2013.23 Google ScholarCross Ref
- Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-48567-7 ISBN-13: 978-0-201-48567-7.Google ScholarDigital Library
- Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-63361-0 ISBN-13: 978-0-201-63361-0.Google ScholarDigital Library
- Georgios Gousios and Andy Zaidman. 2014. A dataset for pull-based development research. In Proceedings of the 11th International Workshop on Mining Software Repositories (MSR). ACM, 368–371. https://doi.org/10.1145/2597073.2597122 Google ScholarDigital Library
- Chenkai Guo, Dengrong Huang, Naipeng Dong, Quanqi Ye, Jing Xu, Yaqing Fan, Hui Yang, and Yifan Xu. 2019. Deep Review Sharing. In Proceedings of the 26th International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 61–72. https://doi.org/10.1109/SANER.2019.8668037 Google ScholarCross Ref
- Kazuki Hamasaki, Raula Gaikovina Kula, Norihiro Yoshida, A. E. Camargo Cruz, Kenji Fujiwara, and Hajimu Iida. 2013. Who does what during a code review? Datasets of OSS peer review repositories. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE. https://doi.org/10.1109/msr.2013.6624003 Google ScholarCross Ref
- Robert Heumüller. 2021. Learning to Boost the Efficiency of Modern Code Review. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 275–277. https://doi.org/10.1109/ICSE-Companion52605.2021.00126 Google ScholarDigital Library
- Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR-Elasticsearch Dataset. https://doi.org/10.5281/ZENODO.5079076 Google ScholarCross Ref
- Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR Infrastructure. https://doi.org/10.5281/ZENODO.4739592 Google ScholarCross Ref
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meetings of the Association for Computational Linguistics (ACL) Human Language Technologies. ACL, 142–150. http://www.aclweb.org/anthology/P11-1015Google Scholar
- Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2.20331.18729 Google ScholarCross Ref
- Murtuza Mukadam, Christian Bird, and Peter C. Rigby. 2013. Gerrit software code review data from Android. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE, 45–48. https://doi.org/10.1109/MSR.2013.6624002 Google ScholarCross Ref
- Matheus Paixao, Jens Krinke, Donggyun Han, and Mark Harman. 2018. CROP: Linking Code Reviews to Source Code Changes. In Proceedings of the 15th International Workshop on Mining Software Repositories (MSR). ACM, 46–49. https://doi.org/10.1145/3196398.3196466 Google ScholarDigital Library
- Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of the 40th International Conference on Software Engineering (ICSE) Software Engineering in Practice Track. ACM, 181–190. isbn:9781450356596 https://doi.org/10.1145/3183519.3183525 Google ScholarDigital Library
- Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE, 25–36. https://doi.org/10.1109/ICSE.2019.00021 Google ScholarDigital Library
- Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE. https://doi.org/10.1109/icse43902.2021.00027 Google ScholarDigital Library
- Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining the modern code review repositories: a dataset of people, process and product. In Proceedings of the 13th International Workshop on Mining Software Repositories (MSR). ACM, 460–463. https://doi.org/10.1145/2901739.2903504 Google ScholarDigital Library
- Farida El Zanaty, Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto. 2018. An empirical study of design discussions in code review. In Proceedings of the 12th International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM. https://doi.org/10.1145/3239235.3239525 Google ScholarDigital Library
- Xunhui Zhang, Ayushi Rastogi, and Yue Yu. 2020. On the Shoulders of Giants: A New Dataset for Pull-based Development Research. In Proceedings of the 17th International Workshop on Mining Software Repositories (MSR). ACM, 543–547. https://doi.org/10.1145/3379597.3387489 Google ScholarDigital Library
Index Terms
- Exploit those code reviews! bigger data for deeper learning
Recommendations
Automating code review activities by large-scale pre-training
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software EngineeringCode review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, ...
Understanding code snippets in code reviews: a preliminary study of the OpenStack community
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program ComprehensionCode review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and ...
Continuous Code Reviews: A Social Coding tool for Code Reviews inside the IDE
Programming '17: Companion Proceedings of the 1st International Conference on the Art, Science, and Engineering of ProgrammingCode reviews play an important and successful role in modern software development. But usually they happen only once before new code is merged into the main branch. We present a concept which helps developers to continuously give feedback on their ...
Comments