Automating data science

Authors:
Tijl De Bie

Ghent University, Belgium

Ghent University, Belgium
View Profile

,
Luc De Raedt

KU Leuven, Belgium and Örebro University, Sweden

KU Leuven, Belgium and Örebro University, Sweden
View Profile

,
José Hernández-Orallo

Universitat Politècnica de València, Spain

Universitat Politècnica de València, Spain
View Profile

,
Holger H. Hoos

Leiden University, The Netherlands and University of British Columbia in Vancouver, Canada

Leiden University, The Netherlands and University of British Columbia in Vancouver, Canada
View Profile

,
Padhraic Smyth

University of California, Irvine

University of California, Irvine
View Profile

,
Christopher K. I. Williams

University of Edinburgh, U.K and Alan Turing Institute, London, U.K

University of Edinburgh, U.K and Alan Turing Institute, London, U.K
View Profile

Authors Info & Claims

Communications of the ACM Volume 65 Issue 3March 2022pp 76–87https://doi.org/10.1145/3495256

Published:23 February 2022Publication History

Communications of the ACM

Abstract

Given the complexity of data science projects and related demand for human expertise, automation has the potential to transform the data science process.

References

Amershi, S. et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conf. on Human Factors in Computing Systems, 2019, 1--13.Google Scholar
Bjorkman, A. et al. Plant functional trait change across a warming tundra biome. Nature 562, 7725 (2018), 57.Google Scholar
Blockeel, H., Calders, T., Fromont, É., Goethals, B., Prado, A., and Robardet, C. An inductive database system based on virtual mining views. Data Mining and Knowledge Discovery 24, 1 (2012), 247--287.Google ScholarDigital Library
Brazdil, P., Carrier, C., Soares, C., and Vilalta, R. Metalearning: Applications to Data Mining. Springer Science & Business Media, 2008.Google Scholar
Chapman, P. et al. CRISP-DM 1.0 Step-by-step data mining guide, 2000.Google Scholar
Chen, J., Jimenez-Ruiz, E., Horrocks, I., and Sutton, C. ColNet: Embedding the semantics of Web tables for column type prediction. In Proceedings of the 33^rd AAAI Conf. on Artificial Intelligence, 2019.Google ScholarDigital Library
Dasu, T. and Johnson, T. Exploratory Data Mining and Data Cleaning. Wiley, 2003.Google ScholarDigital Library
De Bie, T. Subjective interestingness in exploratory data mining. In Proceedings of the Intern. Symp. Intelligent Data Analysis. Springer, 2013, 19--31.Google ScholarDigital Library
De Raedt, L., Kersting, K., Natarajan, S., and Poole, D. Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial Intelligence and Machine Learning 10, 2 (2016), 1--189.Google ScholarCross Ref
Donoho, D. 50 years of data science. J. Computational and Graphical Statistics 26, 4 (2017), 745--766.Google ScholarCross Ref
Elbattah, M. and Molloy, O. Analytics using machine learning-guided simulations with application to healthcare scenarios. Analytics and Knowledge Mgmt. Auerbach Publications, 2018, 277--324.Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efficient and robust automated machine learning. Advances in Neural Information Processing Systems 28, 2015, 2962--2970.Google ScholarDigital Library
Gale, W. Statistical applications of artificial intelligence and knowledge engineering. The Knowledge Engineering Rev. 2, 4 (1987), 227--247.Google ScholarCross Ref
Geng, L. and Hamilton, H. Interestingness measures for data mining: A survey. ACM Computing Surveys 38, 3 (2006), 9.Google ScholarDigital Library
Gordon, A., Graepel, T., Rolland, N., Russo, C., Borgstrom, J., and Guiver, J. Tabular: A schema driven probabilistic programming language. ACM SIGPLAN Notices 49, 1 (2014), 321--334.Google ScholarDigital Library
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. A survey of methods for explaining black box models. ACM Computing Surveys 51, 5 (2018). 93.Google Scholar
Guyon, I., et al. A brief review of the ChaLearn AutoML Challenge: Any-time any-dataset learning without human intervention. In Proceedings of the Workshop on Automatic Machine Learning 64 (2016), 21--30. F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.Google Scholar
Heer, J., Hellerstein, J., and Kandel, S. Predictive interaction for data transformation. In Proceedings of Conf. on Innovative Data Systems Research, 2015.Google Scholar
Heer, J., Hellerstein, J., and Kandel, S. Data Wrangling. Encyclopedia of Big Data Technologies. S. Sakr and A. Zomaya, Eds. Springer, 2019.Google Scholar
Hernández-Orallo, J., et al. Reframing in context: A systematic approach for model reuse in machine learning. AI Commun. 29, 5 (2016), 551--566.Google ScholarCross Ref
Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Proceedings of the Intern. Conf. on Learning and Intelligent Optimization. Springer, 2011, 507--523.Google ScholarDigital Library
Hutter, F., Kotthoff, L., and Vanschoren, J., Eds. Automated Machine Learning---Methods, Systems, Challenges. Springer, 2019.Google Scholar
Keim, D., Andrienko, G., Fekete, J., Görg, C., Kohlhammer, J., and Melançon, G. Visual analytics: Definition, process, and challenges. Information Visualization. Springer, 2008, 154--175.Google Scholar
King, R., et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 6971 (2004, 247.Google Scholar
Kotthoff, L., Thornton, C., Hoos, H., Hutter, F., and Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Machine Learning Research 18, 1 (2017), 826--830.Google Scholar
Langley, P., Simon, H., Bradshaw, G., and Zytkow, J. Scientific Discovery: Computational Explorations of the Creative Processes. MIT Press, 1987.Google ScholarCross Ref
Le, V. and Gulwani, S. FlashExtract: A Framework for data extraction by examples. In Proceedings of the 35^th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2014, 542--553.Google ScholarDigital Library
Liu, C., et al. Progressive neural architecture search. In Proceedings of the European Conf. on Computer Vision, 2018, 19--34.Google ScholarDigital Library
Lloyd, J., Duvenaud, D., Grosse, R., Tenenbaum, J., and Ghahramani, Z. Automatic construction and natural-language description of nonparametric regression models. In Proceedings of the 28^th AAAI Conf. on Artificial Intelligence, 2014.Google ScholarCross Ref
Mansinghka, V., Tibbetts, R., Baxter, J., Shafto, P., and Eaves, B. BayesDB: A probabilistic programming system for querying the probable implications of data. 2015; arXiv:1512.05006.Google Scholar
Martínez-Plumed, F. et al. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE Trans. Knowledge and Data Engineering (2020), 1 Google ScholarCross Ref
Nazabal, A., Williams, C., Colavizza, G., Smith, C., and Williams, A. Data engineering for data analytics: A classification of the issues, and case studies. 2020; arXiv:2004.12929.Google Scholar
Rakotoarison, H., Schoenauer, M., and Sebag, M. Automated machine learning with Monte-Carlo tree search. In Proceedings of the 28^th Intern. Joint Conf. on Artificial Intelligence 7, (2019); https://doi.org/10.24963/ijcai.2019/457. Google ScholarCross Ref
Ratner, A., De Sa, C., Wu, S., Selsam, D., and Ré, C. Data programming: Creating large training sets, quickly. In Proceedings of the 30^th Intern. Conf. on Neural Information Processing Systems, 2016, 3574--3582.Google Scholar
Ruotsalo, T., Jacucci, G., Myllymäki, P., and Kaski, S. Interactive intent modeling: Information discovery beyond search. Commun. ACM 58, 1 (Jan. 2014), 86--92.Google Scholar
Sculley, D. et al. Hidden technical debt in machine learning systems. Advances in Neural Info. Processing Systems 28, (2015), 2503--2511.Google Scholar
Serban, F., Vanschoren, J., Kietz, J., and Bernstein, A. A survey of intelligent assistants for data analysis. ACM Computing Surveys 45, 3 (2013), 1--35.Google ScholarDigital Library
St. Amant, R. and Cohen, P. Intelligent support for exploratory data analysis. J. Computational and Graphical Statistics 7, 4 (1998), 545--558.Google Scholar
Sutton, C., Hobson, T., Geddes, J., and Caruana, R. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24^th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2018.Google ScholarDigital Library
Tao, F. and Qi, Q. Make more digital twins. Nature 573 (2019), 490--491.Google ScholarCross Ref
Thornton, C., Hutter, F., Hoos, H., and Leyton-Brown, K. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19^th ACM SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, 2013, 847--855.Google ScholarDigital Library
Tukey, J. Exploratory Data Analysis. Pearson, 1977.Google Scholar
Tukey, J. and Wilk, M. Data analysis and statistics: An expository overview. In Proceedings of 1966 Fall Joint Computer Conf. (Nov. 7--10, 1966), 695--709.Google ScholarDigital Library
Vanschoren, J., Van Rijn, J., Bischl, B., and Torgo, L. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014), 49--60.Google ScholarDigital Library
Vartak, M., Rahman, S., Madden, S., Parameswaran, A., and Polyzotis, N. SeeDB: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the Intern. Conf. on Very Large Data Bases 8 (2015), 2182.Google Scholar
Vartak, M., et al. ModelDB: A system for machine learning model management. In Proceedings of the ACM Workshop on Human-In-the-Loop Data Analytics, 2016, 14.Google ScholarDigital Library
Wang, D., et al. Human-AI collaboration in data science: Exploring data scientists' perceptions of automated AI. In Proceedings of the ACM Conf. on Human-Computer Interaction 3, 2019, 1--24.Google ScholarDigital Library
Wasay, A., Athanassoulis, M., and Idreos, S. Queriosity: Automated data exploration. In Proceedings of the 2015 IEEE Intern. Congress on Big Data, 716--719.Google Scholar
Wongsuphasawat, K., Moritz, D., Anand, A., Mackinlay, J., Howe, B., and Heer, J. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. IEEE Trans. Visualization and Computer Graphics 22, 1 (2015), 649--658.Google Scholar
Wulder, M., Coops, N., Roy, D., White, J., and Hermosilla, T. Land cover 2.0. Intern. J. Remote Sensing 39, 12 (2018), 4254--4284.Google ScholarCross Ref
Zgraggen, E., Zhao, Z., Zeleznik, R., and Kraska, T. Investigating the effect of the multiple comparisons problem in visual analysis. In Proceedings of the 2018 CHI Conf. on Human Factors in Computing Systems, 1--12.Google Scholar

Index Terms

Automating data science

Recommendations

Automating the loading of business process data warehouses
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Business processes drive the operations of an enterprise. In the past, the focus was primarily on business process design, modeling, and automation. Recently, enterprises have realized that they can benefit tremendously from analyzing the behavior of ...
Read More
Towards Automating Data Narratives
IUI '17: Proceedings of the 22nd International Conference on Intelligent User Interfaces

We propose a new area of research on automating data narratives. Data narratives are containers of information about computationally generated research findings. They have three major components: 1) A record of events, that describe a new result through ...
Read More
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 65, Issue 3
March 2022
102 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3522546
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 19,150
  Total Downloads
- Downloads (Last 12 months)877
- Downloads (Last 6 weeks)235
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Automating data science

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Automating the loading of business process data warehouses

Towards Automating Data Narratives

Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science