ABSTRACT
Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarDigital Library
- Waleed Abdulla. 2017. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN.Google Scholar
- Jesse Hu Alexis Chan, Octavio Arriaga. 2017. Autopilot-TensorFlow. https://github.com/SullyChen/Autopilot-TensorFlow.Google Scholar
- Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 41st International Conference on Software Engineering. ACM.Google ScholarDigital Library
- Anonymous. 2021. Data Science Pipline Artifact. https://github.com/anonymous-authorss/DS-Pipeline.Google Scholar
- Octavio Arriaga. 2018. Face classification and detectionn. https://github.com/oarriaga/face_classification.Google Scholar
- Rob Ashmore, Radu Calinescu, and Colin Paterson. 2021. Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. ACM Comput. Surv. 54, 5, Article 111 (may 2021). Google ScholarDigital Library
- Jakob Aungiers. 2019. LSTM Neural Network for Time Series Prediction. https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction.Google Scholar
- Francine Berman, Rob Rutenbar, Brent Hailpern, Henrik Christensen, Susan Davidson, Deborah Estrin, Michael Franklin, Margaret Martonosi, Padma Raghavan, Victoria Stodden, et al. 2018. Realizing the potential of data science. Commun. ACM 61, 4 (2018), 67--72.Google ScholarDigital Library
- Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa meets Python: a Boa dataset of data science software in Python language. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, 577--581.Google ScholarDigital Library
- Sumon Biswas and Hridesh Rajan. 2020. Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA). 642--653. Google ScholarDigital Library
- Sumon Biswas and Hridesh Rajan. 2021. Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline. In ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece).Google Scholar
- Denny Britz. 2018. Convolutional Neural Network for Text Classification in Tensorflow. https://github.com/dennybritz/cnn-text-classification-tf.Google Scholar
- Muffy Calder, Mario Kolberg, Evan H. Magill, and Stephan Reiff-Marganiec. 2003. Feature Interaction: A Critical Review and Considered Forecast. Comput. Netw. 41, 1 (Jan. 2003), 115--141. Google ScholarDigital Library
- CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences 275 (2014), 314--347.Google Scholar
- Z Ming Chen Mengda. 2018. reproduce MTCNN, a Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://github.com/AITTSMD/MTCNN-Tensorflow.Google Scholar
- George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. 2014. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231 (2014).Google Scholar
- Sam Crane Dat Tran. 2018. Real-Time Object Recognition App with Tensorflow and OpenCV. https://github.com/datitran/object_detector_app.Google Scholar
- Edsger W Dijkstra. 1982. On the role of scientific thought. In Selected writings on computing: a personal perspective. Springer, 60--66.Google Scholar
- Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proceedings of the 35th International Conference on Software Engineering (San Francisco, CA) (ICSE'13). 422--431.Google Scholar
- Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2015. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans. Softw. Eng. Methodol. 25, 1, Article 7 (Dec. 2015), 34 pages. Google ScholarDigital Library
- Matthew Earl. 2016. Using neural networks to build an automatic number plate recognition system. https://github.com/matthewearl/deep-anpr.Google Scholar
- Erich Gamma, Richard Helm, Ralph E. Johnson, and John M. Vlissides. 1993. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In Proceedings of the 7th European Conference on Object-Oriented Programming (ECOOP '93). Springer-Verlag, Berlin, Heidelberg, 406--431.Google Scholar
- Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International journal of information management 35, 2 (2015), 137--144.Google ScholarDigital Library
- Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E Gonzalez, and Joseph M Hellerstein. 2018. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, Vol. 114.Google Scholar
- David Garlan. 2000. Software architecture: a roadmap. In Proceedings of the Conference on the Future of Software Engineering. 91--101.Google ScholarDigital Library
- Google Cloud Blog. 2019. Machine Learning Workflow. https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview.Google Scholar
- Charles Hill, Rachel Bellamy, Thomas Erickson, and Margaret Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 162--170.Google ScholarCross Ref
- Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-Offs in Continuous Integration: Assurance, Security, and Flexibility. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197--207. Google ScholarDigital Library
- Sue Ann Hong and Tim Hunter. 2017. Build, Scale, and Deploy Deep Learning Pipelines with Ease. https://databricks.com/blog/2017/09/06/build-scale-deploy-deep-learning-pipelines-ease.html.Google Scholar
- GitHub Inc. 2019. Octoverse 2018. https://octoverse.github.com/projects.Google Scholar
- Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In ESEC/FSE'19: The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).Google ScholarDigital Library
- Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing Deep Neural Networks: Fix Patterns and Challenges. In ICSE'20: The 42nd International Conference on Software Engineering (Seoul, South Korea).Google Scholar
- Kathryn Jepsen. 2014. The machine learning community takes on the Higgs. https://www.symmetrymagazine.org/article/july-2014/the-machine-learning-community-takes-on-the-higgs.Google Scholar
- Kaggle. 2021. Kaggle Notebook. www.kaggle.com/competitions.Google Scholar
- Kaggle. 2021. Kaggle Notebook. www.kaggle.com/thousandvoices/simple-lstm.Google Scholar
- Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/zfturbo/simple-ru-baseline-lb-0-9627.Google Scholar
- Kaggle. 2021. Kaggle Notebook. www.kaggle.com/seesee/siamese-pretrained-0-822.Google Scholar
- Kaggle. 2021. Kaggle Notebook. www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.Google Scholar
- Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/danielbecker/careervillage-org-recommendation-engine.Google Scholar
- Bojan Karlaš, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020. Building continuous integration services for machine learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2407--2415. Google ScholarDigital Library
- Keras. 2021. Keras API Reference. https://keras.io/api/.Google Scholar
- Keras. 2021. Scikit-Learn API Reference. https://scikit-learn.org/stable/modules/classes.html.Google Scholar
- Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarDigital Library
- Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--242.Google Scholar
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107.Google ScholarDigital Library
- Namju Kim. 2018. Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow. https://github.com/buriburisuri/speech-to-text-wavenet.Google Scholar
- Bennet P Lientz, E. Burton Swanson, and Gail E Tompkins. 1978. Characteristics of application software maintenance. Commun. ACM 21, 6 (1978), 466--471.Google ScholarDigital Library
- Hui Miao, Amit Chavan, and Amol Deshpande. 2017. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 7.Google ScholarDigital Library
- Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 571--582.Google ScholarCross Ref
- Microsoft Blog. 2019. What are ML pipelines in Azure Machine Learning service? https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines.Google Scholar
- Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google Scholar
- Valohai MLOps. 2020. What Is a Machine Learning Pipeline? https://valohai.com/machine-learning-pipeline/.Google Scholar
- Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, Álvaro López García, Ignacio Heredia, Peter Malík, and Ladislav Hluchỳ. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review (2019), 1--48.Google ScholarDigital Library
- Giang Nguyen, Johir Islam, Rangeet Pan, and Hridesh Rajan. 2022. Manas: Mining Software Repositories to Assist AutoML. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarDigital Library
- Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016. ACM, 485--492.Google ScholarDigital Library
- Alex Paino. 2017. Deep learning models trained to correct input errors in short, message-like text. https://github.com/atpaino/deep-text-corrector.Google Scholar
- Rangeet Pan and Hridesh Rajan. 2020. On Decomposing a Deep Neural Network into Modules. In ESEC/FSE'2020: The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Sacramento, California, United States).Google Scholar
- Rangeet Pan and Hridesh Rajan. 2022. Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarDigital Library
- Kyubyong Park. 2018. A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. https://github.com/Kyubyong/tacotron.Google Scholar
- David Lorge Parnas, Paul C Clements, and David M Weiss. 1985. The modular structure of complex systems. IEEE Transactions on software Engineering 3 (1985), 259--266.Google ScholarDigital Library
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723--1726.Google ScholarDigital Library
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record 47, 2 (2018), 17--28.Google ScholarDigital Library
- Christian Prehofer. 1997. Feature-oriented programming: A fresh look at objects. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 419--443.Google Scholar
- Chuan Qi. 2019. Caffe implementation of Google MobileNet SSD detection network. https://github.com/chuanqi305/MobileNet-SSD.Google Scholar
- Václav Rajlich. 2014. Software evolution and maintenance. In Future of Software Engineering Proceedings. 133--144.Google ScholarDigital Library
- Muhammad Habib Rehman, Victor Chang, Aisha Batool, and Teh Ying Wah. 2016. Big data reduction framework for value creation in sustainable enterprises. International Journal of Information Management 36, 6 (2016), 917--928.Google ScholarDigital Library
- David Robinson. 2017. The Incredible Growth of Python. https://stackoverflow.blog/2017/09/06/incredible-growth-python/.Google Scholar
- Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019. A Survey on Data Collection for Machine Learning: a Big Data-AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering (2019).Google Scholar
- Eragon Ruan. 2019. Scene text detection based on ctpn (connectionist text proposal network). https://github.com/eragonruan/text-detection-ctpn.Google Scholar
- Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--12.Google ScholarDigital Library
- Philippe Rémy. 2018. Deep Learning model to analyze a large corpus of clear text passwords. https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis.Google Scholar
- David Sandberg. 2018. Face Recognition using Tensorflow. https://github.com/davidsandberg/facenet.Google Scholar
- Carlton E Sapp. 2017. Preparing and architecting machine learning. Gartner Technical Professional Advice (2017), 1--37.Google Scholar
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).Google Scholar
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511.Google Scholar
- Roald Bradley Severtson. 2017. What is the Team Data Science Process? https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview.Google Scholar
- Mary Shaw and David Garlan. 1996. Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall, Inc., USA.Google ScholarDigital Library
- Guocong Song. 2017. Tensorflow-based Recommendation systems. https://github.com/songgc/TF-recomm.Google Scholar
- Marvin Teichmann. 2018. A Kitti Road Segmentation Model Implemented in Tensorflow. https://github.com/MarvinTeichmann/KittiSeg.Google Scholar
- Stephen Todd and David Dietrich. 2017. Computing resource re-provisioning during data analytic lifecycle. US Patent 9,619,550.Google Scholar
- Andrew Bagshaw Trieu. 2018. Real-time object detection and classification. https://github.com/thtrieu/darkflow.Google Scholar
- Tom Van Der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, and Tim Van Kasteren. 2017. Versioning for end-to-end machine learning pipelines. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. ACM, 2.Google ScholarDigital Library
- Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google Scholar
- Ben Wagner. 2020. Accountability by design in technology research. Computer Law & Security Review 37 (2020), 105398.Google ScholarCross Ref
- Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How does machine learning change software development practices? IEEE Transactions on Software Engineering (2019).Google Scholar
- Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1622--1633.Google Scholar
- Mohammad Wardat, Breno Dantas Cruz, Wei Le, and Hridesh Rajan. 2022. Deep-Diagnosis: Automatically Diagnosing Faults and Recommending Actionable Fixes in Deep Learning Programs. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google Scholar
- Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In ICSE'21: The 43nd International Conference on Software Engineering (Virtual Conference).Google Scholar
- Hadley Wickham. 2019. Data science: how is it different to statistics? IMS Bulletin 48 (2019).Google Scholar
- Jeannette M Wing. 2019. The Data Life Cycle. Harvard Data Science Review (2019).Google Scholar
- Rüdiger Wirth and Jochen Hipp. 2000. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Citeseer, 29--39.Google Scholar
- Max Woolf. 2018. Automatically "block" people in images (like Black Mirror) using a pretrained neural network. https://github.com/minimaxir/person-blocker.Google Scholar
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. A Tensorflow implementation of QANet for machine reading comprehension. https://github.com/NLPLearn/QANet.Google Scholar
- Charlie Bickerton Zhilin Yang, Zihang Dai. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. https://github.com/zihangdai/xlnet.Google Scholar
- Linda Zhou. 2019. How to Build a Better Machine Learning Pipeline. https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-pipeline.Google Scholar
Index Terms
- The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large
Recommendations
P2D: A Transpiler Framework for Optimizing Data Science Pipelines
DEEM '23: Proceedings of the Seventh Workshop on Data Management for End-to-End Machine LearningIn this paper, we propose a transpilation-based approach to optimize data science pipelines that comprise database management systems (DBMSes) and data science runtimes (e.g., Python). Our approach allows to identify DBMS-supported operations and ...
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on ...
Understanding Parallelization Tradeoffs for Linear Pipelines
PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and ManycoresPipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is ...
Comments