skip to main content
research-article

Boa: Ultra-Large-Scale Software Repository and Source-Code Mining

Published:02 December 2015Publication History
Skip Abstract Section

Abstract

In today's software-centric world, ultra-large-scale software repositories, such as SourceForge, GitHub, and Google Code, are the new library of Alexandria. They contain an enormous corpus of software and related information. Scientists and engineers alike are interested in analyzing this wealth of information. However, systematic extraction and analysis of relevant data from these repositories for testing hypotheses is hard, and best left for mining software repository (MSR) experts! Specifically, mining source code yields significant insights into software development artifacts and processes. Unfortunately, mining source code at a large scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse grained, or sacrifice studying the history of the code. In this article we address mining source code: (a) at a very large scale; (b) at a fine-grained level of detail; and (c) with full history information. To address these challenges, we present domain-specific language features for source-code mining in our language and infrastructure called Boa. The goal of Boa is to ease testing MSR-related hypotheses. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also show drastic improvements in scalability.

References

  1. Apache Software Foundation. 2015a. Hadoop: Open source implementation of MapReduce. http://hadoop. apache.org/.Google ScholarGoogle Scholar
  2. Apache Software Foundation. 2015b. HBase: Open source implementation of Bigtable. http://hbase. apache.org/Google ScholarGoogle Scholar
  3. Jennifer Bevan, E. James Whitehead, Jr., Sunghun Kim, and Michael Godfrey. 2005. Facilitating software evolution research with Kenyon. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE'05). 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Black Duck Software. 2015. Black duck open HUB. https://www.openhub.net/.Google ScholarGoogle Scholar
  5. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 4th USENIX Conference on Operating Systems Design and Implementation (OSDI'04). 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert Di Falco. 2011. Hierarchical visitor pattern, C2 pattern repository. http://c2.com/cgi/wiki?Google ScholarGoogle Scholar
  10. Paul Dourish and Victoria Bellotti. 1992. Awareness and coordination in shared workspaces. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work (CSCW'92). 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2012. An exploratory study of the design impact of language features for aspect-oriented interfaces. In Proceedings of the 11th International Conference on Aspect-Oriented Software Development (AOSD'12). 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013a. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 35th International Conference on Software Engineering (ICSE'13). 422--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2013b. Language features for software evolution and aspect-oriented interfaces: An exploratory study. Trans. Aspect-Orient. Softw. Devel. 10, 148--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Robert Dyer, Hridesh Rajan, and Tien N. Nguyen. 2013c. Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes. In Proceedings of the 12th International Conference on Generative Programming: Concepts and Experiences (GPCE'13). 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robert Dyer, Hridesh Rajan, Hoan Anh Nguyen, and Tien N. Nguyen. 2014. Mining billions of AST nodes to study actual and potential usage of Java language features. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 779--790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'10). 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Harald C. Gall, Beat Fluri, and Martin Pinzger. 2009. Change analysis with Evolizer and Changedistiller. IEEE Softw. 26, 1, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yongqin Gao, Matthew Van Antwerp, Scott Christley, and Greg Madey. 2007. A research collaboratory for open source software research. In Proceedings of the 1st International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07). IEEE Computer Society, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jesús M. GonzáLez-Barahona and Gregorio Robles. 2012. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Engin. 17, 1--2, 75--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Seymour Goodman, Peter Wolcott, and Grey Burkhart. 1995. Building on the basics: An examination of high-performance computing export control policy in the 1990s. http://fsi.stanford.edu/sites/default/files/buildingbasics.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR'13). IEEE Press, 233--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Georgios Gousios and Diomidis Spinellis. 2009a. Alitheia core: An extensible software quality monitoring platform. In Proceedings of the 31st International Conference on Software Engineering (ICSE'09). IEEE Computer Society, 579--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Georgios Gousios and Diomidis Spinellis. 2009b. A platform for software engineering research. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub's data from a firehose. In Proceedings of the 9th Working Conference on Mining Software Repositories (MSR'12). IEEE, 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mark Grechanik, Collin McMillan, Luca Deferrari, Marco Comi, Stefano Crespi, Denys Poshyvanyk, Chen Fu, Qing Xie, and Carlo Ghezzi. 2010. An empirical investigation into a large-scale Java open source code repository. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM'10). 11:1--11:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Israel Herraiz, Daniel Izquierdo-Cortazar, and Francisco Rivas-Hernández. 2009. FLOSSMetrics: Free/Libre/Open source software metrics. In Proceedings of the European Conference on Software Maintenance and Reengineering (CSMR'09). IEEE Computer Society, 281--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Abram Hindle and Daniel M. German. 2005. SCQL: A formal model and a query language for source control repositories. In Proceedings of the International Workshop on Mining Software Repositories (MSR'05). 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Anastasia Izmaylova, Paul Klint, Ashim Shahi, and Jurgen J. Vinju. 2013. M3: An open model for measuring code artifacts. http://arxiv.org/abs/1312.1188v1Google ScholarGoogle Scholar
  31. Simon Peyton Jones. 2003. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press.Google ScholarGoogle Scholar
  32. Paul Klint, Tijs Van Der Storm, and Jurgen Vinju. 2009. RASCAL: A domain specific language for source code analysis and manipulation. In Proceedings of the 9th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'09). IEEE Computer Society, 168--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Susan Landau. 2000. Standing the test of time: The data encryption standard. Not. Amer. Math. Soc. 47, 3, 341.Google ScholarGoogle Scholar
  34. Janusz Laski and Wojciech Szermer. 1992. Identification of program modifications and its applications in software maintenance. In Proceedings of the International Conference on Software Maintenance (ICSM'92). 282--290.Google ScholarGoogle ScholarCross RefCross Ref
  35. Josh Lerner and Jean Tirole. 2002. Some simple economics of open source. The J. Industr. Econ. 50, 2, 197--234.Google ScholarGoogle ScholarCross RefCross Ref
  36. Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining Knowl. Discov. 18, 2, 300--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Michael Martin, Benjamin Livshits, and Monica S. Lam. 2005. Finding application errors and security flaws using PQL: A program query language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'05). 365--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Bruno C. D. S. Oliveira, Meng Wang, and Jeremy Gibbons. 2008. The visitor pattern as a reusable, generic, type-safe component. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'08). 439--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Doug Orleans and Karl J. Lieberherr. 2001. DJ: Dynamic adaptive programming in Java. In Proceedings of the 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns (REFLECTION'01). 73--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Program. 13, 4, 277--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Promise Dataset 2009. Promise 2009. http://promisedata.org/2009/datasets.html.Google ScholarGoogle Scholar
  43. Hridesh Rajan. 2008. Mining software repositories for evaluating software engineering properties of language designs. In Proceedings of the 2nd Workshop on Assessment of Contemporary Modularization Techniques (ACoM'08).Google ScholarGoogle Scholar
  44. Hridesh Rajan, Tien N. Nguyen, Robert Dyer, and Hoan Anh Nguyen. 2015. Boa website. http://boa.cs. iastate.edu/.Google ScholarGoogle Scholar
  45. Eric Raymond. 1999. The cathedral and the bazaar. Knowl. Technol. Policy 12, 3, 23--49.Google ScholarGoogle ScholarCross RefCross Ref
  46. Gregor Richards, Christian Hammer, Brian Burg, and Jan Vitek. 2011. The Eval that men do: A large-scale study of the use of Eval in JavaScript applications. In Proceedings of the 25th European Conference on Object-Oriented Programming (ECOOP'11). 52--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Weiyi Shang, Bram Adams, and Ahmed E. Hassan. 2010. An experience report on scaling tools for mining software repositories using MapReduce. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE'10). 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. SourceForge. 2015. SourceForge website. http://sourceforge.net/.Google ScholarGoogle Scholar
  49. Sander Tichelaar, Stéphane Ducasse, and Serge Demeyer. 2000. FAMIX and XMI. In Proceedings of the 7th Working Conference on Reverse Engineering (WCRE'00). IEEE Computer Society, 296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tiobe Software BV. 2012. TIOBE programming community index for July 2012. Tech. rep. http://www. tiobe.com/tpci.htm.Google ScholarGoogle Scholar
  51. Anthony Urso. 2013. Sizzle: A compiler and runtime for Sawzall, optimized for Hadoop. https://github.com/anthonyu/Sizzle.Google ScholarGoogle Scholar
  52. Joost Visser. 2001. Visitor combination and traversal control. In Proceedings of the 16th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'01). 270--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug? In Proceedings of the 4th International Workshop on Mining Software Repositories (MSR'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Software Engineering and Methodology
        ACM Transactions on Software Engineering and Methodology  Volume 25, Issue 1
        December 2015
        339 pages
        ISSN:1049-331X
        EISSN:1557-7392
        DOI:10.1145/2852270
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 December 2015
        • Accepted: 1 July 2015
        • Revised: 1 June 2014
        • Received: 1 January 2014
        Published in tosem Volume 25, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader