Abstract
In today's software-centric world, ultra-large-scale software repositories, such as SourceForge, GitHub, and Google Code, are the new library of Alexandria. They contain an enormous corpus of software and related information. Scientists and engineers alike are interested in analyzing this wealth of information. However, systematic extraction and analysis of relevant data from these repositories for testing hypotheses is hard, and best left for mining software repository (MSR) experts! Specifically, mining source code yields significant insights into software development artifacts and processes. Unfortunately, mining source code at a large scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse grained, or sacrifice studying the history of the code. In this article we address mining source code: (a) at a very large scale; (b) at a fine-grained level of detail; and (c) with full history information. To address these challenges, we present domain-specific language features for source-code mining in our language and infrastructure called Boa. The goal of Boa is to ease testing MSR-related hypotheses. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also show drastic improvements in scalability.
- Apache Software Foundation. 2015a. Hadoop: Open source implementation of MapReduce. http://hadoop. apache.org/.Google Scholar
- Apache Software Foundation. 2015b. HBase: Open source implementation of Bigtable. http://hbase. apache.org/Google Scholar
- Jennifer Bevan, E. James Whitehead, Jr., Sunghun Kim, and Michael Godfrey. 2005. Facilitating software evolution research with Kenyon. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE'05). 177--186. Google ScholarDigital Library
- Black Duck Software. 2015. Black duck open HUB. https://www.openhub.net/.Google Scholar
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google ScholarDigital Library
- Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 4th USENIX Conference on Operating Systems Design and Implementation (OSDI'04). 107--113. Google ScholarDigital Library
- Robert Di Falco. 2011. Hierarchical visitor pattern, C2 pattern repository. http://c2.com/cgi/wiki?Google Scholar
- Paul Dourish and Victoria Bellotti. 1992. Awareness and coordination in shared workspaces. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work (CSCW'92). 107--114. Google ScholarDigital Library
- Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2012. An exploratory study of the design impact of language features for aspect-oriented interfaces. In Proceedings of the 11th International Conference on Aspect-Oriented Software Development (AOSD'12). 143--154. Google ScholarDigital Library
- Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013a. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 35th International Conference on Software Engineering (ICSE'13). 422--431. Google ScholarDigital Library
- Robert Dyer, Hridesh Rajan, and Yuanfang Cai. 2013b. Language features for software evolution and aspect-oriented interfaces: An exploratory study. Trans. Aspect-Orient. Softw. Devel. 10, 148--183. Google ScholarDigital Library
- Robert Dyer, Hridesh Rajan, and Tien N. Nguyen. 2013c. Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes. In Proceedings of the 12th International Conference on Generative Programming: Concepts and Experiences (GPCE'13). 23--32. Google ScholarDigital Library
- Robert Dyer, Hridesh Rajan, Hoan Anh Nguyen, and Tien N. Nguyen. 2014. Mining billions of AST nodes to study actual and potential usage of Java language features. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 779--790. Google ScholarDigital Library
- Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'10). 147--156. Google ScholarDigital Library
- Harald C. Gall, Beat Fluri, and Martin Pinzger. 2009. Change analysis with Evolizer and Changedistiller. IEEE Softw. 26, 1, 26--33. Google ScholarDigital Library
- Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional. Google ScholarDigital Library
- Yongqin Gao, Matthew Van Antwerp, Scott Christley, and Greg Madey. 2007. A research collaboratory for open source software research. In Proceedings of the 1st International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07). IEEE Computer Society, 4. Google ScholarDigital Library
- Jesús M. GonzáLez-Barahona and Gregorio Robles. 2012. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Engin. 17, 1--2, 75--89. Google ScholarDigital Library
- Seymour Goodman, Peter Wolcott, and Grey Burkhart. 1995. Building on the basics: An examination of high-performance computing export control policy in the 1990s. http://fsi.stanford.edu/sites/default/files/buildingbasics.pdf. Google ScholarDigital Library
- Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR'13). IEEE Press, 233--236. Google ScholarDigital Library
- Georgios Gousios and Diomidis Spinellis. 2009a. Alitheia core: An extensible software quality monitoring platform. In Proceedings of the 31st International Conference on Software Engineering (ICSE'09). IEEE Computer Society, 579--582. Google ScholarDigital Library
- Georgios Gousios and Diomidis Spinellis. 2009b. A platform for software engineering research. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 31--40. Google ScholarDigital Library
- Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub's data from a firehose. In Proceedings of the 9th Working Conference on Mining Software Repositories (MSR'12). IEEE, 12--21. Google ScholarDigital Library
- Mark Grechanik, Collin McMillan, Luca Deferrari, Marco Comi, Stefano Crespi, Denys Poshyvanyk, Chen Fu, Qing Xie, and Carlo Ghezzi. 2010. An empirical investigation into a large-scale Java open source code repository. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM'10). 11:1--11:10. Google ScholarDigital Library
- Israel Herraiz, Daniel Izquierdo-Cortazar, and Francisco Rivas-Hernández. 2009. FLOSSMetrics: Free/Libre/Open source software metrics. In Proceedings of the European Conference on Software Maintenance and Reengineering (CSMR'09). IEEE Computer Society, 281--284. Google ScholarDigital Library
- Abram Hindle and Daniel M. German. 2005. SCQL: A formal model and a query language for source control repositories. In Proceedings of the International Workshop on Mining Software Repositories (MSR'05). 1--5. Google ScholarDigital Library
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72. Google ScholarDigital Library
- Anastasia Izmaylova, Paul Klint, Ashim Shahi, and Jurgen J. Vinju. 2013. M3: An open model for measuring code artifacts. http://arxiv.org/abs/1312.1188v1Google Scholar
- Simon Peyton Jones. 2003. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press.Google Scholar
- Paul Klint, Tijs Van Der Storm, and Jurgen Vinju. 2009. RASCAL: A domain specific language for source code analysis and manipulation. In Proceedings of the 9th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'09). IEEE Computer Society, 168--177. Google ScholarDigital Library
- Susan Landau. 2000. Standing the test of time: The data encryption standard. Not. Amer. Math. Soc. 47, 3, 341.Google Scholar
- Janusz Laski and Wojciech Szermer. 1992. Identification of program modifications and its applications in software maintenance. In Proceedings of the International Conference on Software Maintenance (ICSM'92). 282--290.Google ScholarCross Ref
- Josh Lerner and Jean Tirole. 2002. Some simple economics of open source. The J. Industr. Econ. 50, 2, 197--234.Google ScholarCross Ref
- Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining Knowl. Discov. 18, 2, 300--336. Google ScholarDigital Library
- Michael Martin, Benjamin Livshits, and Monica S. Lam. 2005. Finding application errors and security flaws using PQL: A program query language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'05). 365--383. Google ScholarDigital Library
- Bruno C. D. S. Oliveira, Meng Wang, and Jeremy Gibbons. 2008. The visitor pattern as a reusable, generic, type-safe component. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'08). 439--456. Google ScholarDigital Library
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110. Google ScholarDigital Library
- Doug Orleans and Karl J. Lieberherr. 2001. DJ: Dynamic adaptive programming in Java. In Proceedings of the 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns (REFLECTION'01). 73--80. Google ScholarDigital Library
- Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Program. 13, 4, 277--298. Google ScholarDigital Library
- Promise Dataset 2009. Promise 2009. http://promisedata.org/2009/datasets.html.Google Scholar
- Hridesh Rajan. 2008. Mining software repositories for evaluating software engineering properties of language designs. In Proceedings of the 2nd Workshop on Assessment of Contemporary Modularization Techniques (ACoM'08).Google Scholar
- Hridesh Rajan, Tien N. Nguyen, Robert Dyer, and Hoan Anh Nguyen. 2015. Boa website. http://boa.cs. iastate.edu/.Google Scholar
- Eric Raymond. 1999. The cathedral and the bazaar. Knowl. Technol. Policy 12, 3, 23--49.Google ScholarCross Ref
- Gregor Richards, Christian Hammer, Brian Burg, and Jan Vitek. 2011. The Eval that men do: A large-scale study of the use of Eval in JavaScript applications. In Proceedings of the 25th European Conference on Object-Oriented Programming (ECOOP'11). 52--78. Google ScholarDigital Library
- Weiyi Shang, Bram Adams, and Ahmed E. Hassan. 2010. An experience report on scaling tools for mining software repositories using MapReduce. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE'10). 275--284. Google ScholarDigital Library
- SourceForge. 2015. SourceForge website. http://sourceforge.net/.Google Scholar
- Sander Tichelaar, Stéphane Ducasse, and Serge Demeyer. 2000. FAMIX and XMI. In Proceedings of the 7th Working Conference on Reverse Engineering (WCRE'00). IEEE Computer Society, 296. Google ScholarDigital Library
- Tiobe Software BV. 2012. TIOBE programming community index for July 2012. Tech. rep. http://www. tiobe.com/tpci.htm.Google Scholar
- Anthony Urso. 2013. Sizzle: A compiler and runtime for Sawzall, optimized for Hadoop. https://github.com/anthonyu/Sizzle.Google Scholar
- Joost Visser. 2001. Visitor combination and traversal control. In Proceedings of the 16th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'01). 270--282. Google ScholarDigital Library
- Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug? In Proceedings of the 4th International Workshop on Mining Software Repositories (MSR'07). Google ScholarDigital Library
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14. Google ScholarDigital Library
Index Terms
- Boa: Ultra-Large-Scale Software Repository and Source-Code Mining
Recommendations
Polyglot and Distributed Software Repository Mining with Crossflow
MSR '20: Proceedings of the 17th International Conference on Mining Software RepositoriesMining software repositories at a large scale typically requires substantial computational and storage resources. This creates an increasing need for repository mining programs to be executed in a distributed manner, such that remote collaborators can ...
Boa meets python: a boa dataset of data science software in python language
MSR '19: Proceedings of the 16th International Conference on Mining Software RepositoriesThe popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the ...
Reusing metadata across components, applications, and languages
Among the well-known means to increase programmer productivity and decrease development effort is systematic software reuse. Although large scale reuse remains an elusive goal, programmers have been successfully reusing individual software artifacts, ...
Comments