skip to main content
10.1145/3510003.3510057acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections

The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large

Published:05 July 2022Publication History

ABSTRACT

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Waleed Abdulla. 2017. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN.Google ScholarGoogle Scholar
  3. Jesse Hu Alexis Chan, Octavio Arriaga. 2017. Autopilot-TensorFlow. https://github.com/SullyChen/Autopilot-TensorFlow.Google ScholarGoogle Scholar
  4. Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 41st International Conference on Software Engineering. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Anonymous. 2021. Data Science Pipline Artifact. https://github.com/anonymous-authorss/DS-Pipeline.Google ScholarGoogle Scholar
  6. Octavio Arriaga. 2018. Face classification and detectionn. https://github.com/oarriaga/face_classification.Google ScholarGoogle Scholar
  7. Rob Ashmore, Radu Calinescu, and Colin Paterson. 2021. Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. ACM Comput. Surv. 54, 5, Article 111 (may 2021). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jakob Aungiers. 2019. LSTM Neural Network for Time Series Prediction. https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction.Google ScholarGoogle Scholar
  9. Francine Berman, Rob Rutenbar, Brent Hailpern, Henrik Christensen, Susan Davidson, Deborah Estrin, Michael Franklin, Margaret Martonosi, Padma Raghavan, Victoria Stodden, et al. 2018. Realizing the potential of data science. Commun. ACM 61, 4 (2018), 67--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa meets Python: a Boa dataset of data science software in Python language. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, 577--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sumon Biswas and Hridesh Rajan. 2020. Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA). 642--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sumon Biswas and Hridesh Rajan. 2021. Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline. In ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece).Google ScholarGoogle Scholar
  13. Denny Britz. 2018. Convolutional Neural Network for Text Classification in Tensorflow. https://github.com/dennybritz/cnn-text-classification-tf.Google ScholarGoogle Scholar
  14. Muffy Calder, Mario Kolberg, Evan H. Magill, and Stephan Reiff-Marganiec. 2003. Feature Interaction: A Critical Review and Considered Forecast. Comput. Netw. 41, 1 (Jan. 2003), 115--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences 275 (2014), 314--347.Google ScholarGoogle Scholar
  16. Z Ming Chen Mengda. 2018. reproduce MTCNN, a Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://github.com/AITTSMD/MTCNN-Tensorflow.Google ScholarGoogle Scholar
  17. George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. 2014. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231 (2014).Google ScholarGoogle Scholar
  18. Sam Crane Dat Tran. 2018. Real-Time Object Recognition App with Tensorflow and OpenCV. https://github.com/datitran/object_detector_app.Google ScholarGoogle Scholar
  19. Edsger W Dijkstra. 1982. On the role of scientific thought. In Selected writings on computing: a personal perspective. Springer, 60--66.Google ScholarGoogle Scholar
  20. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proceedings of the 35th International Conference on Software Engineering (San Francisco, CA) (ICSE'13). 422--431.Google ScholarGoogle Scholar
  21. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2015. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans. Softw. Eng. Methodol. 25, 1, Article 7 (Dec. 2015), 34 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Matthew Earl. 2016. Using neural networks to build an automatic number plate recognition system. https://github.com/matthewearl/deep-anpr.Google ScholarGoogle Scholar
  23. Erich Gamma, Richard Helm, Ralph E. Johnson, and John M. Vlissides. 1993. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In Proceedings of the 7th European Conference on Object-Oriented Programming (ECOOP '93). Springer-Verlag, Berlin, Heidelberg, 406--431.Google ScholarGoogle Scholar
  24. Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International journal of information management 35, 2 (2015), 137--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E Gonzalez, and Joseph M Hellerstein. 2018. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, Vol. 114.Google ScholarGoogle Scholar
  26. David Garlan. 2000. Software architecture: a roadmap. In Proceedings of the Conference on the Future of Software Engineering. 91--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Google Cloud Blog. 2019. Machine Learning Workflow. https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview.Google ScholarGoogle Scholar
  28. Charles Hill, Rachel Bellamy, Thomas Erickson, and Margaret Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 162--170.Google ScholarGoogle ScholarCross RefCross Ref
  29. Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-Offs in Continuous Integration: Assurance, Security, and Flexibility. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sue Ann Hong and Tim Hunter. 2017. Build, Scale, and Deploy Deep Learning Pipelines with Ease. https://databricks.com/blog/2017/09/06/build-scale-deploy-deep-learning-pipelines-ease.html.Google ScholarGoogle Scholar
  31. GitHub Inc. 2019. Octoverse 2018. https://octoverse.github.com/projects.Google ScholarGoogle Scholar
  32. Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In ESEC/FSE'19: The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing Deep Neural Networks: Fix Patterns and Challenges. In ICSE'20: The 42nd International Conference on Software Engineering (Seoul, South Korea).Google ScholarGoogle Scholar
  34. Kathryn Jepsen. 2014. The machine learning community takes on the Higgs. https://www.symmetrymagazine.org/article/july-2014/the-machine-learning-community-takes-on-the-higgs.Google ScholarGoogle Scholar
  35. Kaggle. 2021. Kaggle Notebook. www.kaggle.com/competitions.Google ScholarGoogle Scholar
  36. Kaggle. 2021. Kaggle Notebook. www.kaggle.com/thousandvoices/simple-lstm.Google ScholarGoogle Scholar
  37. Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/zfturbo/simple-ru-baseline-lb-0-9627.Google ScholarGoogle Scholar
  38. Kaggle. 2021. Kaggle Notebook. www.kaggle.com/seesee/siamese-pretrained-0-822.Google ScholarGoogle Scholar
  39. Kaggle. 2021. Kaggle Notebook. www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.Google ScholarGoogle Scholar
  40. Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/danielbecker/careervillage-org-recommendation-engine.Google ScholarGoogle Scholar
  41. Bojan Karlaš, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020. Building continuous integration services for machine learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2407--2415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Keras. 2021. Keras API Reference. https://keras.io/api/.Google ScholarGoogle Scholar
  43. Keras. 2021. Scikit-Learn API Reference. https://scikit-learn.org/stable/modules/classes.html.Google ScholarGoogle Scholar
  44. Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--242.Google ScholarGoogle Scholar
  46. Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Namju Kim. 2018. Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow. https://github.com/buriburisuri/speech-to-text-wavenet.Google ScholarGoogle Scholar
  48. Bennet P Lientz, E. Burton Swanson, and Gail E Tompkins. 1978. Characteristics of application software maintenance. Commun. ACM 21, 6 (1978), 466--471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hui Miao, Amit Chavan, and Amol Deshpande. 2017. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 571--582.Google ScholarGoogle ScholarCross RefCross Ref
  51. Microsoft Blog. 2019. What are ML pipelines in Azure Machine Learning service? https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines.Google ScholarGoogle Scholar
  52. Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google ScholarGoogle Scholar
  53. Valohai MLOps. 2020. What Is a Machine Learning Pipeline? https://valohai.com/machine-learning-pipeline/.Google ScholarGoogle Scholar
  54. Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, Álvaro López García, Ignacio Heredia, Peter Malík, and Ladislav Hluchỳ. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review (2019), 1--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Giang Nguyen, Johir Islam, Rangeet Pan, and Hridesh Rajan. 2022. Manas: Mining Software Repositories to Assist AutoML. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016. ACM, 485--492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Alex Paino. 2017. Deep learning models trained to correct input errors in short, message-like text. https://github.com/atpaino/deep-text-corrector.Google ScholarGoogle Scholar
  58. Rangeet Pan and Hridesh Rajan. 2020. On Decomposing a Deep Neural Network into Modules. In ESEC/FSE'2020: The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Sacramento, California, United States).Google ScholarGoogle Scholar
  59. Rangeet Pan and Hridesh Rajan. 2022. Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Kyubyong Park. 2018. A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. https://github.com/Kyubyong/tacotron.Google ScholarGoogle Scholar
  61. David Lorge Parnas, Paul C Clements, and David M Weiss. 1985. The modular structure of complex systems. IEEE Transactions on software Engineering 3 (1985), 259--266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723--1726.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record 47, 2 (2018), 17--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Christian Prehofer. 1997. Feature-oriented programming: A fresh look at objects. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 419--443.Google ScholarGoogle Scholar
  65. Chuan Qi. 2019. Caffe implementation of Google MobileNet SSD detection network. https://github.com/chuanqi305/MobileNet-SSD.Google ScholarGoogle Scholar
  66. Václav Rajlich. 2014. Software evolution and maintenance. In Future of Software Engineering Proceedings. 133--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Muhammad Habib Rehman, Victor Chang, Aisha Batool, and Teh Ying Wah. 2016. Big data reduction framework for value creation in sustainable enterprises. International Journal of Information Management 36, 6 (2016), 917--928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. David Robinson. 2017. The Incredible Growth of Python. https://stackoverflow.blog/2017/09/06/incredible-growth-python/.Google ScholarGoogle Scholar
  69. Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019. A Survey on Data Collection for Machine Learning: a Big Data-AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering (2019).Google ScholarGoogle Scholar
  70. Eragon Ruan. 2019. Scene text detection based on ctpn (connectionist text proposal network). https://github.com/eragonruan/text-detection-ctpn.Google ScholarGoogle Scholar
  71. Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Philippe Rémy. 2018. Deep Learning model to analyze a large corpus of clear text passwords. https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis.Google ScholarGoogle Scholar
  73. David Sandberg. 2018. Face Recognition using Tensorflow. https://github.com/davidsandberg/facenet.Google ScholarGoogle Scholar
  74. Carlton E Sapp. 2017. Preparing and architecting machine learning. Gartner Technical Professional Advice (2017), 1--37.Google ScholarGoogle Scholar
  75. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).Google ScholarGoogle Scholar
  76. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511.Google ScholarGoogle Scholar
  77. Roald Bradley Severtson. 2017. What is the Team Data Science Process? https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview.Google ScholarGoogle Scholar
  78. Mary Shaw and David Garlan. 1996. Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall, Inc., USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Guocong Song. 2017. Tensorflow-based Recommendation systems. https://github.com/songgc/TF-recomm.Google ScholarGoogle Scholar
  80. Marvin Teichmann. 2018. A Kitti Road Segmentation Model Implemented in Tensorflow. https://github.com/MarvinTeichmann/KittiSeg.Google ScholarGoogle Scholar
  81. Stephen Todd and David Dietrich. 2017. Computing resource re-provisioning during data analytic lifecycle. US Patent 9,619,550.Google ScholarGoogle Scholar
  82. Andrew Bagshaw Trieu. 2018. Real-time object detection and classification. https://github.com/thtrieu/darkflow.Google ScholarGoogle Scholar
  83. Tom Van Der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, and Tim Van Kasteren. 2017. Versioning for end-to-end machine learning pipelines. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. ACM, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google ScholarGoogle Scholar
  85. Ben Wagner. 2020. Accountability by design in technology research. Computer Law & Security Review 37 (2020), 105398.Google ScholarGoogle ScholarCross RefCross Ref
  86. Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How does machine learning change software development practices? IEEE Transactions on Software Engineering (2019).Google ScholarGoogle Scholar
  87. Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1622--1633.Google ScholarGoogle Scholar
  88. Mohammad Wardat, Breno Dantas Cruz, Wei Le, and Hridesh Rajan. 2022. Deep-Diagnosis: Automatically Diagnosing Faults and Recommending Actionable Fixes in Deep Learning Programs. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarGoogle Scholar
  89. Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In ICSE'21: The 43nd International Conference on Software Engineering (Virtual Conference).Google ScholarGoogle Scholar
  90. Hadley Wickham. 2019. Data science: how is it different to statistics? IMS Bulletin 48 (2019).Google ScholarGoogle Scholar
  91. Jeannette M Wing. 2019. The Data Life Cycle. Harvard Data Science Review (2019).Google ScholarGoogle Scholar
  92. Rüdiger Wirth and Jochen Hipp. 2000. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Citeseer, 29--39.Google ScholarGoogle Scholar
  93. Max Woolf. 2018. Automatically "block" people in images (like Black Mirror) using a pretrained neural network. https://github.com/minimaxir/person-blocker.Google ScholarGoogle Scholar
  94. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. A Tensorflow implementation of QANet for machine reading comprehension. https://github.com/NLPLearn/QANet.Google ScholarGoogle Scholar
  95. Charlie Bickerton Zhilin Yang, Zihang Dai. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. https://github.com/zihangdai/xlnet.Google ScholarGoogle Scholar
  96. Linda Zhou. 2019. How to Build a Better Machine Learning Pipeline. https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-pipeline.Google ScholarGoogle Scholar

Index Terms

  1. The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader