Abstract
Data science is increasingly important and challenging. It requires computational tools and programming environments that handle big data and difficult computations, while supporting creative, high-quality analysis. The R language and related software play a major role in computing for data science. R is featured in most programs for training in the field. R packages provide tools for a wide range of purposes and users. The description of a new technique, particularly from research in statistics, is frequently accompanied by an R package, greatly increasing the usefulness of the description.
The history of R makes clear its connection to data science. R was consciously designed to replicate in open-source software the contents of the S software. S in turn was written by data analysis researchers at Bell Labs as part of the computing environment for research in data analysis and collaborations to apply that research, rather than as a separate project to create a programming language. The features of S and the design decisions made for it need to be understood in this broader context of supporting effective data analysis (which would now be called data science). These characteristics were all transferred to R and remain central to its effectiveness. Thus, R can be viewed as based historically on a domain-specific language for the domain of data science.
- H. Abelson and G. J. Sussman. 1983. Structure and Interpretation of Computer Programs. MIT Press, Cambridge, MA.Google Scholar
- ACM. 1998. ACM Software System Award. https://awards.acm.org/award_winners/chambers_6640862 .Google Scholar
- Richard A. Becker and John M. Chambers. 1976. GR-Z: A System of Graphical Subroutines for Data Analysis. In Proc. 9th Interface Symp. Computer Science and Statistics .Google Scholar
- Richard A. Becker and John M. Chambers. 1984. S: An Interactive Environment for Data Analysis and Graphics. Wadsworth, Belmont CA.Google Scholar
- Richard A. Becker and John M. Chambers. 1985. Extending the S System. Wadsworth, Belmont CA.Google Scholar
- Richard A. Becker, John M. Chambers, and Allan R. Wilks. 1988. The New S Language. Chapman & Hall, Boca Raton, FL.Google Scholar
- John M. Chambers. 1998. Programming with Data: A Guide to the S Language. Springer, New York.Google Scholar
- John M. Chambers. 2016. Extending R. Chapman & Hall/CRC.Google Scholar
- John M. Chambers and Trevor Hastie (Eds.). 1992. Statistical Models in S. Chapman & Hall, Boca Raton, FL.Google Scholar
- F. J. Corbató and V. A. Vyssotsky. 1965. Introduction and overview of the Multics system. In Proceedings of the November 30–December 1, 1965, Fall Joint Computer Conference, Part I (AFIPS ’65 (Fall, part I)) . ACM, New York, NY, USA, 185–196. Google ScholarDigital Library
- David Donoho. 2017. 50 Years of Data Science. Journal of Computational and Graphical Statistics 26, 4 (2017), 745–766. Google ScholarCross Ref
- Dirk Eddelbuettel and Romain François. 2011. Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 8 (2011), 1–18. Google ScholarCross Ref
- A. E. Freeny and J. D. Gabbe. 1969. A statistical description of intense rainfall. Bell System Technical Journal 48 (1969), 1789–1851.Google ScholarCross Ref
- Jon Gertner. 2013. The Idea Factory: Bell Labs and the Great Age of American Innovation. Penguin.Google Scholar
- Ross Ihaka. 1998. R : Past and Future History. (draft for Interface Symp. Computer Science and Statistics): https://cran.rproject.org/doc/html/interface98-paper/paper.html .Google Scholar
- Ross Ihaka and Robert Gentleman. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5 (1996), 299–314.Google ScholarCross Ref
- Louis Jaeckel and John Gabbe. 1974. Crawford Hill rainfall data. In Exploring Data Analysis: The Computer Revolution in Statistics . University of California Press, Chapter 3.Google Scholar
- S.C. Johnson and D. M. Ritchie. 1978. UNIX time-sharing system: portability of C programs and the UNIX system. Bell System Technical Journal 57, 6 (1978), 2021–2048.Google ScholarCross Ref
- Daniel Kaplan and Deborah Nolan. 2015. Modeling Runners’ Times in the Cherry Blossom Race. In Data Science in R, Deborah Nolan and Duncan Temple Lang (Eds.). Chapman and Hall/CRC, Chapter 2, 45–103.Google Scholar
- D. M. Ritchie. 1984. The evolution of the UNIX time-sharing system. AT&T Bell Laboratories Technical Journal 63, 8 (1984), 1577–1593.Google Scholar
- Duncan Temple Lang. 1997. A Multi Threaded Extension to a High Level Interactive Statistical Computing Environment. Ph.D. Dissertation. University of California, Berkeley.Google Scholar
- Nick Thieme. 2018. R Generation. Significance 15, 4 (August 2018), 14–19.Google ScholarCross Ref
- John W. Tukey. 1962. The future of data analysis. The Annals of Mathematical Statistics 33, 1 (1962), 1–67.Google ScholarCross Ref
- John W. Tukey. 1977. Exploratory Data Analysis. Addison-Wesley, Reading, Massachusetts.Google Scholar
- Hadley Wickham and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly.Google Scholar
- Martin B. Wilk and Ram Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55, 1 (1968), 1–17.Google Scholar
Index Terms
- S, R, and data science
Recommendations
Item-centric mining of frequent patterns from big uncertain data
AbstractHigh volumes of wide varieties of valuable data of different veracity (e.g., imprecise and uncertain data) can be easily generated or collected at a high velocity for various knowledge-based and intelligent information & engineering systems in ...
Intrinsic Relations between Data Science, Big Data, Business Analytics and Datafication
SAICSIT '14: Proceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by TechnologyData recording and storage have evolved over the past decades from manual gathering of data by using simple writing materials to the automation of data collection. Data storage has evolved significantly in the past decades and today databases no longer ...
Evolution of symposia on the interface of computing and statistics defines data science to be the interface
Goal of this article is to document evolution of the Interface and its Symposia, from their conception and birth when small data were analyzed with Statistics in the mid 20th Century until Big Data are now analyzed with Data Science in the early 21st ...
Comments