Exploit those code reviews! bigger data for deeper learning

Authors:
Robert Heumüller

University of Magdeburg, Germany

University of Magdeburg, Germany

0000-0002-9906-0323
View Profile

,
Sebastian Nielebock

University of Magdeburg, Germany

University of Magdeburg, Germany

0000-0002-0147-3526
View Profile

,
Frank Ortmeier

University of Magdeburg, Germany

University of Magdeburg, Germany

0000-0001-6186-4142
View Profile

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringAugust 2021Pages 1505–1509https://doi.org/10.1145/3468264.3473110

Published:18 August 2021Publication History

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1505–1509

ABSTRACT

Modern code review (MCR) processes are prevalent in most organizations that develop software due to benefits in quality assurance and knowledge transfer. With the rise of collaborative software development platforms like GitHub and Bitbucket, today, millions of projects share not only their code but also their review data. Although researchers have tried to exploit this data for more than a decade, most of that knowledge remains a buried treasure. A crucial catalyst for many advances in deep learning, however, is the accessibility of large-scale standard datasets for different learning tasks. This paper presents the ETCR (Exploit Those Code Reviews!) infrastructure for mining MCR datasets from any GitHub project practicing pull-request-based development. We demonstrate its effectiveness with ETCR-Elasticsearch, a dataset of >231𝑘 review comments for >47𝑘 Java file revisions in >40𝑘 pull-requests from the Elasticsearch project. ETCR is designed with the challenge of deep learning in mind. Compared to previous datasets, ETCR datasets include all information for linking review comments to nodes in the respective program’s Abstract Syntax Tree.

References

Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Challenges of Modern Code Review. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 712–721. isbn:9781467330763 https://doi.org/10.1109/ICSE.2013.6606617 Google ScholarCross Ref
Amiangshu Bosu and Jeffrey C. Carver. 2013. Impact of Peer Code Review on Peer Impression Formation: A Survey. In Proceedings of the 7th International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 133–142. https://doi.org/10.1109/ESEM.2013.23 Google ScholarCross Ref
Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-48567-7 ISBN-13: 978-0-201-48567-7.Google ScholarDigital Library
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-63361-0 ISBN-13: 978-0-201-63361-0.Google ScholarDigital Library
Georgios Gousios and Andy Zaidman. 2014. A dataset for pull-based development research. In Proceedings of the 11th International Workshop on Mining Software Repositories (MSR). ACM, 368–371. https://doi.org/10.1145/2597073.2597122 Google ScholarDigital Library
Chenkai Guo, Dengrong Huang, Naipeng Dong, Quanqi Ye, Jing Xu, Yaqing Fan, Hui Yang, and Yifan Xu. 2019. Deep Review Sharing. In Proceedings of the 26th International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 61–72. https://doi.org/10.1109/SANER.2019.8668037 Google ScholarCross Ref
Kazuki Hamasaki, Raula Gaikovina Kula, Norihiro Yoshida, A. E. Camargo Cruz, Kenji Fujiwara, and Hajimu Iida. 2013. Who does what during a code review? Datasets of OSS peer review repositories. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE. https://doi.org/10.1109/msr.2013.6624003 Google ScholarCross Ref
Robert Heumüller. 2021. Learning to Boost the Efficiency of Modern Code Review. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 275–277. https://doi.org/10.1109/ICSE-Companion52605.2021.00126 Google ScholarDigital Library
Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR-Elasticsearch Dataset. https://doi.org/10.5281/ZENODO.5079076 Google ScholarCross Ref
Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR Infrastructure. https://doi.org/10.5281/ZENODO.4739592 Google ScholarCross Ref
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meetings of the Association for Computational Linguistics (ACL) Human Language Technologies. ACL, 142–150. http://www.aclweb.org/anthology/P11-1015Google Scholar
Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2.20331.18729 Google ScholarCross Ref
Murtuza Mukadam, Christian Bird, and Peter C. Rigby. 2013. Gerrit software code review data from Android. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE, 45–48. https://doi.org/10.1109/MSR.2013.6624002 Google ScholarCross Ref
Matheus Paixao, Jens Krinke, Donggyun Han, and Mark Harman. 2018. CROP: Linking Code Reviews to Source Code Changes. In Proceedings of the 15th International Workshop on Mining Software Repositories (MSR). ACM, 46–49. https://doi.org/10.1145/3196398.3196466 Google ScholarDigital Library
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of the 40th International Conference on Software Engineering (ICSE) Software Engineering in Practice Track. ACM, 181–190. isbn:9781450356596 https://doi.org/10.1145/3183519.3183525 Google ScholarDigital Library
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE, 25–36. https://doi.org/10.1109/ICSE.2019.00021 Google ScholarDigital Library
Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE. https://doi.org/10.1109/icse43902.2021.00027 Google ScholarDigital Library
Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining the modern code review repositories: a dataset of people, process and product. In Proceedings of the 13th International Workshop on Mining Software Repositories (MSR). ACM, 460–463. https://doi.org/10.1145/2901739.2903504 Google ScholarDigital Library
Farida El Zanaty, Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto. 2018. An empirical study of design discussions in code review. In Proceedings of the 12th International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM. https://doi.org/10.1145/3239235.3239525 Google ScholarDigital Library
Xunhui Zhang, Ayushi Rastogi, and Yue Yu. 2020. On the Shoulders of Giants: A New Dataset for Pull-based Development Research. In Proceedings of the 17th International Workshop on Mining Software Repositories (MSR). ACM, 543–547. https://doi.org/10.1145/3379597.3387489 Google ScholarDigital Library

Index Terms

Exploit those code reviews! bigger data for deeper learning

Recommendations

Automating code review activities by large-scale pre-training
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, ...
Read More
Understanding code snippets in code reviews: a preliminary study of the OpenStack community
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

Code review is a mature practice for software quality assurance in software development with which reviewers check the code that has been committed by developers, and verify the quality of code. During the code review discussions, reviewers and ...
Read More
Continuous Code Reviews: A Social Coding tool for Code Reviews inside the IDE
Programming '17: Companion Proceedings of the 1st International Conference on the Art, Science, and Engineering of Programming

Code reviews play an important and successful role in modern software development. But usually they happen only once before new code is merged into the main branch. We present a concept which helps developers to continuously give feedback on their ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2021
1690 pages
ISBN:9781450385626
DOI:10.1145/3468264
General Chairs:
Diomidis Spinellis
Athens University of Economics and Business, Greece
,
Georgios Gousios
Facebook, Netherlands / Delft University of Technology, Netherlands
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Massimiliano Di Penta
University of Sannio, Italy
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code review
datasets
deep learning
source code
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 715
  Total Downloads
- Downloads (Last 12 months)234
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploit those code reviews! bigger data for deeper learning

ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automating code review activities by large-scale pre-training

Understanding code snippets in code reviews: a preliminary study of the OpenStack community

Continuous Code Reviews: A Social Coding tool for Code Reviews inside the IDE