skip to main content
10.1145/3468264.3473110acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open Access

Exploit those code reviews! bigger data for deeper learning

Published:18 August 2021Publication History

ABSTRACT

Modern code review (MCR) processes are prevalent in most organizations that develop software due to benefits in quality assurance and knowledge transfer. With the rise of collaborative software development platforms like GitHub and Bitbucket, today, millions of projects share not only their code but also their review data. Although researchers have tried to exploit this data for more than a decade, most of that knowledge remains a buried treasure. A crucial catalyst for many advances in deep learning, however, is the accessibility of large-scale standard datasets for different learning tasks. This paper presents the ETCR (Exploit Those Code Reviews!) infrastructure for mining MCR datasets from any GitHub project practicing pull-request-based development. We demonstrate its effectiveness with ETCR-Elasticsearch, a dataset of >231𝑘 review comments for >47𝑘 Java file revisions in >40𝑘 pull-requests from the Elasticsearch project. ETCR is designed with the challenge of deep learning in mind. Compared to previous datasets, ETCR datasets include all information for linking review comments to nodes in the respective program’s Abstract Syntax Tree.

References

  1. Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Challenges of Modern Code Review. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 712–721. isbn:9781467330763 https://doi.org/10.1109/ICSE.2013.6606617 Google ScholarGoogle ScholarCross RefCross Ref
  2. Amiangshu Bosu and Jeffrey C. Carver. 2013. Impact of Peer Code Review on Peer Impression Formation: A Survey. In Proceedings of the 7th International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 133–142. https://doi.org/10.1109/ESEM.2013.23 Google ScholarGoogle ScholarCross RefCross Ref
  3. Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-48567-7 ISBN-13: 978-0-201-48567-7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publishing Co., Inc., USA. isbn:978-0-201-63361-0 ISBN-13: 978-0-201-63361-0.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Georgios Gousios and Andy Zaidman. 2014. A dataset for pull-based development research. In Proceedings of the 11th International Workshop on Mining Software Repositories (MSR). ACM, 368–371. https://doi.org/10.1145/2597073.2597122 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chenkai Guo, Dengrong Huang, Naipeng Dong, Quanqi Ye, Jing Xu, Yaqing Fan, Hui Yang, and Yifan Xu. 2019. Deep Review Sharing. In Proceedings of the 26th International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 61–72. https://doi.org/10.1109/SANER.2019.8668037 Google ScholarGoogle ScholarCross RefCross Ref
  7. Kazuki Hamasaki, Raula Gaikovina Kula, Norihiro Yoshida, A. E. Camargo Cruz, Kenji Fujiwara, and Hajimu Iida. 2013. Who does what during a code review? Datasets of OSS peer review repositories. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE. https://doi.org/10.1109/msr.2013.6624003 Google ScholarGoogle ScholarCross RefCross Ref
  8. Robert Heumüller. 2021. Learning to Boost the Efficiency of Modern Code Review. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 275–277. https://doi.org/10.1109/ICSE-Companion52605.2021.00126 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR-Elasticsearch Dataset. https://doi.org/10.5281/ZENODO.5079076 Google ScholarGoogle ScholarCross RefCross Ref
  10. Robert Heumüller, Sebastian Nielebock, and Frank Ortmeier. 2021. ETCR Infrastructure. https://doi.org/10.5281/ZENODO.4739592 Google ScholarGoogle ScholarCross RefCross Ref
  11. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meetings of the Association for Computational Linguistics (ACL) Human Language Technologies. ACL, 142–150. http://www.aclweb.org/anthology/P11-1015Google ScholarGoogle Scholar
  12. Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2.20331.18729 Google ScholarGoogle ScholarCross RefCross Ref
  13. Murtuza Mukadam, Christian Bird, and Peter C. Rigby. 2013. Gerrit software code review data from Android. In Proceedings of the 10th International Workshop on Mining Software Repositories (MSR). IEEE, 45–48. https://doi.org/10.1109/MSR.2013.6624002 Google ScholarGoogle ScholarCross RefCross Ref
  14. Matheus Paixao, Jens Krinke, Donggyun Han, and Mark Harman. 2018. CROP: Linking Code Reviews to Source Code Changes. In Proceedings of the 15th International Workshop on Mining Software Repositories (MSR). ACM, 46–49. https://doi.org/10.1145/3196398.3196466 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of the 40th International Conference on Software Engineering (ICSE) Software Engineering in Practice Track. ACM, 181–190. isbn:9781450356596 https://doi.org/10.1145/3183519.3183525 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE, 25–36. https://doi.org/10.1109/ICSE.2019.00021 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE. https://doi.org/10.1109/icse43902.2021.00027 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining the modern code review repositories: a dataset of people, process and product. In Proceedings of the 13th International Workshop on Mining Software Repositories (MSR). ACM, 460–463. https://doi.org/10.1145/2901739.2903504 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Farida El Zanaty, Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto. 2018. An empirical study of design discussions in code review. In Proceedings of the 12th International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM. https://doi.org/10.1145/3239235.3239525 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xunhui Zhang, Ayushi Rastogi, and Yue Yu. 2020. On the Shoulders of Giants: A New Dataset for Pull-based Development Research. In Proceedings of the 17th International Workshop on Mining Software Repositories (MSR). ACM, 543–547. https://doi.org/10.1145/3379597.3387489 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploit those code reviews! bigger data for deeper learning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
            August 2021
            1690 pages
            ISBN:9781450385626
            DOI:10.1145/3468264

            Copyright © 2021 Owner/Author

            This work is licensed under a Creative Commons Attribution International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 August 2021

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate112of543submissions,21%

            Upcoming Conference

            FSE '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader