skip to main content
research-article
Public Access

Automating data science

Published:23 February 2022Publication History
Skip Abstract Section

Abstract

Given the complexity of data science projects and related demand for human expertise, automation has the potential to transform the data science process.

References

  1. Amershi, S. et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conf. on Human Factors in Computing Systems, 2019, 1--13.Google ScholarGoogle Scholar
  2. Bjorkman, A. et al. Plant functional trait change across a warming tundra biome. Nature 562, 7725 (2018), 57.Google ScholarGoogle Scholar
  3. Blockeel, H., Calders, T., Fromont, É., Goethals, B., Prado, A., and Robardet, C. An inductive database system based on virtual mining views. Data Mining and Knowledge Discovery 24, 1 (2012), 247--287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brazdil, P., Carrier, C., Soares, C., and Vilalta, R. Metalearning: Applications to Data Mining. Springer Science & Business Media, 2008.Google ScholarGoogle Scholar
  5. Chapman, P. et al. CRISP-DM 1.0 Step-by-step data mining guide, 2000.Google ScholarGoogle Scholar
  6. Chen, J., Jimenez-Ruiz, E., Horrocks, I., and Sutton, C. ColNet: Embedding the semantics of Web tables for column type prediction. In Proceedings of the 33rd AAAI Conf. on Artificial Intelligence, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dasu, T. and Johnson, T. Exploratory Data Mining and Data Cleaning. Wiley, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. De Bie, T. Subjective interestingness in exploratory data mining. In Proceedings of the Intern. Symp. Intelligent Data Analysis. Springer, 2013, 19--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. De Raedt, L., Kersting, K., Natarajan, S., and Poole, D. Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial Intelligence and Machine Learning 10, 2 (2016), 1--189.Google ScholarGoogle ScholarCross RefCross Ref
  10. Donoho, D. 50 years of data science. J. Computational and Graphical Statistics 26, 4 (2017), 745--766.Google ScholarGoogle ScholarCross RefCross Ref
  11. Elbattah, M. and Molloy, O. Analytics using machine learning-guided simulations with application to healthcare scenarios. Analytics and Knowledge Mgmt. Auerbach Publications, 2018, 277--324.Google ScholarGoogle Scholar
  12. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efficient and robust automated machine learning. Advances in Neural Information Processing Systems 28, 2015, 2962--2970.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gale, W. Statistical applications of artificial intelligence and knowledge engineering. The Knowledge Engineering Rev. 2, 4 (1987), 227--247.Google ScholarGoogle ScholarCross RefCross Ref
  14. Geng, L. and Hamilton, H. Interestingness measures for data mining: A survey. ACM Computing Surveys 38, 3 (2006), 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gordon, A., Graepel, T., Rolland, N., Russo, C., Borgstrom, J., and Guiver, J. Tabular: A schema driven probabilistic programming language. ACM SIGPLAN Notices 49, 1 (2014), 321--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. A survey of methods for explaining black box models. ACM Computing Surveys 51, 5 (2018). 93.Google ScholarGoogle Scholar
  17. Guyon, I., et al. A brief review of the ChaLearn AutoML Challenge: Any-time any-dataset learning without human intervention. In Proceedings of the Workshop on Automatic Machine Learning 64 (2016), 21--30. F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.Google ScholarGoogle Scholar
  18. Heer, J., Hellerstein, J., and Kandel, S. Predictive interaction for data transformation. In Proceedings of Conf. on Innovative Data Systems Research, 2015.Google ScholarGoogle Scholar
  19. Heer, J., Hellerstein, J., and Kandel, S. Data Wrangling. Encyclopedia of Big Data Technologies. S. Sakr and A. Zomaya, Eds. Springer, 2019.Google ScholarGoogle Scholar
  20. Hernández-Orallo, J., et al. Reframing in context: A systematic approach for model reuse in machine learning. AI Commun. 29, 5 (2016), 551--566.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Proceedings of the Intern. Conf. on Learning and Intelligent Optimization. Springer, 2011, 507--523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hutter, F., Kotthoff, L., and Vanschoren, J., Eds. Automated Machine Learning---Methods, Systems, Challenges. Springer, 2019.Google ScholarGoogle Scholar
  23. Keim, D., Andrienko, G., Fekete, J., Görg, C., Kohlhammer, J., and Melançon, G. Visual analytics: Definition, process, and challenges. Information Visualization. Springer, 2008, 154--175.Google ScholarGoogle Scholar
  24. King, R., et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 6971 (2004, 247.Google ScholarGoogle Scholar
  25. Kotthoff, L., Thornton, C., Hoos, H., Hutter, F., and Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Machine Learning Research 18, 1 (2017), 826--830.Google ScholarGoogle Scholar
  26. Langley, P., Simon, H., Bradshaw, G., and Zytkow, J. Scientific Discovery: Computational Explorations of the Creative Processes. MIT Press, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  27. Le, V. and Gulwani, S. FlashExtract: A Framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2014, 542--553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liu, C., et al. Progressive neural architecture search. In Proceedings of the European Conf. on Computer Vision, 2018, 19--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lloyd, J., Duvenaud, D., Grosse, R., Tenenbaum, J., and Ghahramani, Z. Automatic construction and natural-language description of nonparametric regression models. In Proceedings of the 28th AAAI Conf. on Artificial Intelligence, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. Mansinghka, V., Tibbetts, R., Baxter, J., Shafto, P., and Eaves, B. BayesDB: A probabilistic programming system for querying the probable implications of data. 2015; arXiv:1512.05006.Google ScholarGoogle Scholar
  31. Martínez-Plumed, F. et al. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE Trans. Knowledge and Data Engineering (2020), 1 Google ScholarGoogle ScholarCross RefCross Ref
  32. Nazabal, A., Williams, C., Colavizza, G., Smith, C., and Williams, A. Data engineering for data analytics: A classification of the issues, and case studies. 2020; arXiv:2004.12929.Google ScholarGoogle Scholar
  33. Rakotoarison, H., Schoenauer, M., and Sebag, M. Automated machine learning with Monte-Carlo tree search. In Proceedings of the 28th Intern. Joint Conf. on Artificial Intelligence 7, (2019); https://doi.org/10.24963/ijcai.2019/457. Google ScholarGoogle ScholarCross RefCross Ref
  34. Ratner, A., De Sa, C., Wu, S., Selsam, D., and Ré, C. Data programming: Creating large training sets, quickly. In Proceedings of the 30th Intern. Conf. on Neural Information Processing Systems, 2016, 3574--3582.Google ScholarGoogle Scholar
  35. Ruotsalo, T., Jacucci, G., Myllymäki, P., and Kaski, S. Interactive intent modeling: Information discovery beyond search. Commun. ACM 58, 1 (Jan. 2014), 86--92.Google ScholarGoogle Scholar
  36. Sculley, D. et al. Hidden technical debt in machine learning systems. Advances in Neural Info. Processing Systems 28, (2015), 2503--2511.Google ScholarGoogle Scholar
  37. Serban, F., Vanschoren, J., Kietz, J., and Bernstein, A. A survey of intelligent assistants for data analysis. ACM Computing Surveys 45, 3 (2013), 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. St. Amant, R. and Cohen, P. Intelligent support for exploratory data analysis. J. Computational and Graphical Statistics 7, 4 (1998), 545--558.Google ScholarGoogle Scholar
  39. Sutton, C., Hobson, T., Geddes, J., and Caruana, R. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tao, F. and Qi, Q. Make more digital twins. Nature 573 (2019), 490--491.Google ScholarGoogle ScholarCross RefCross Ref
  41. Thornton, C., Hutter, F., Hoos, H., and Leyton-Brown, K. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, 2013, 847--855.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tukey, J. Exploratory Data Analysis. Pearson, 1977.Google ScholarGoogle Scholar
  43. Tukey, J. and Wilk, M. Data analysis and statistics: An expository overview. In Proceedings of 1966 Fall Joint Computer Conf. (Nov. 7--10, 1966), 695--709.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Vanschoren, J., Van Rijn, J., Bischl, B., and Torgo, L. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter 15, 2 (2014), 49--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vartak, M., Rahman, S., Madden, S., Parameswaran, A., and Polyzotis, N. SeeDB: Efficient data-driven visualization recommendations to support visual analytics. In Proceedings of the Intern. Conf. on Very Large Data Bases 8 (2015), 2182.Google ScholarGoogle Scholar
  46. Vartak, M., et al. ModelDB: A system for machine learning model management. In Proceedings of the ACM Workshop on Human-In-the-Loop Data Analytics, 2016, 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wang, D., et al. Human-AI collaboration in data science: Exploring data scientists' perceptions of automated AI. In Proceedings of the ACM Conf. on Human-Computer Interaction 3, 2019, 1--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wasay, A., Athanassoulis, M., and Idreos, S. Queriosity: Automated data exploration. In Proceedings of the 2015 IEEE Intern. Congress on Big Data, 716--719.Google ScholarGoogle Scholar
  49. Wongsuphasawat, K., Moritz, D., Anand, A., Mackinlay, J., Howe, B., and Heer, J. Voyager: Exploratory analysis via faceted browsing of visualization recommendations. IEEE Trans. Visualization and Computer Graphics 22, 1 (2015), 649--658.Google ScholarGoogle Scholar
  50. Wulder, M., Coops, N., Roy, D., White, J., and Hermosilla, T. Land cover 2.0. Intern. J. Remote Sensing 39, 12 (2018), 4254--4284.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zgraggen, E., Zhao, Z., Zeleznik, R., and Kraska, T. Investigating the effect of the multiple comparisons problem in visual analysis. In Proceedings of the 2018 CHI Conf. on Human Factors in Computing Systems, 1--12.Google ScholarGoogle Scholar

Index Terms

  1. Automating data science

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 65, Issue 3
          March 2022
          102 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3522546
          Issue’s Table of Contents

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 February 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Popular
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format