research-article

Open Access

The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large

Authors:
Sumon Biswas

Iowa State University

Iowa State University
View Profile

,
Mohammad Wardat

Iowa State University

Iowa State University
View Profile

,
Hridesh Rajan

Iowa State University

Iowa State University
View Profile

ICSE '22: Proceedings of the 44th International Conference on Software EngineeringMay 2022Pages 2091–2103https://doi.org/10.1145/3510003.3510057

Published:05 July 2022Publication History

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 2091–2103

ABSTRACT

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

References

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarDigital Library
Waleed Abdulla. 2017. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN.Google Scholar
Jesse Hu Alexis Chan, Octavio Arriaga. 2017. Autopilot-TensorFlow. https://github.com/SullyChen/Autopilot-TensorFlow.Google Scholar
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 41st International Conference on Software Engineering. ACM.Google ScholarDigital Library
Anonymous. 2021. Data Science Pipline Artifact. https://github.com/anonymous-authorss/DS-Pipeline.Google Scholar
Octavio Arriaga. 2018. Face classification and detectionn. https://github.com/oarriaga/face_classification.Google Scholar
Rob Ashmore, Radu Calinescu, and Colin Paterson. 2021. Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. ACM Comput. Surv. 54, 5, Article 111 (may 2021). Google ScholarDigital Library
Jakob Aungiers. 2019. LSTM Neural Network for Time Series Prediction. https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction.Google Scholar
Francine Berman, Rob Rutenbar, Brent Hailpern, Henrik Christensen, Susan Davidson, Deborah Estrin, Michael Franklin, Margaret Martonosi, Padma Raghavan, Victoria Stodden, et al. 2018. Realizing the potential of data science. Commun. ACM 61, 4 (2018), 67--72.Google ScholarDigital Library
Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan. 2019. Boa meets Python: a Boa dataset of data science software in Python language. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, 577--581.Google ScholarDigital Library
Sumon Biswas and Hridesh Rajan. 2020. Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA). 642--653. Google ScholarDigital Library
Sumon Biswas and Hridesh Rajan. 2021. Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline. In ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece).Google Scholar
Denny Britz. 2018. Convolutional Neural Network for Text Classification in Tensorflow. https://github.com/dennybritz/cnn-text-classification-tf.Google Scholar
Muffy Calder, Mario Kolberg, Evan H. Magill, and Stephan Reiff-Marganiec. 2003. Feature Interaction: A Critical Review and Considered Forecast. Comput. Netw. 41, 1 (Jan. 2003), 115--141. Google ScholarDigital Library
CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences 275 (2014), 314--347.Google Scholar
Z Ming Chen Mengda. 2018. reproduce MTCNN, a Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. https://github.com/AITTSMD/MTCNN-Tensorflow.Google Scholar
George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. 2014. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231 (2014).Google Scholar
Sam Crane Dat Tran. 2018. Real-Time Object Recognition App with Tensorflow and OpenCV. https://github.com/datitran/object_detector_app.Google Scholar
Edsger W Dijkstra. 1982. On the role of scientific thought. In Selected writings on computing: a personal perspective. Springer, 60--66.Google Scholar
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proceedings of the 35th International Conference on Software Engineering (San Francisco, CA) (ICSE'13). 422--431.Google Scholar
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2015. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans. Softw. Eng. Methodol. 25, 1, Article 7 (Dec. 2015), 34 pages. Google ScholarDigital Library
Matthew Earl. 2016. Using neural networks to build an automatic number plate recognition system. https://github.com/matthewearl/deep-anpr.Google Scholar
Erich Gamma, Richard Helm, Ralph E. Johnson, and John M. Vlissides. 1993. Design Patterns: Abstraction and Reuse of Object-Oriented Design. In Proceedings of the 7th European Conference on Object-Oriented Programming (ECOOP '93). Springer-Verlag, Berlin, Heidelberg, 406--431.Google Scholar
Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International journal of information management 35, 2 (2015), 137--144.Google ScholarDigital Library
Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E Gonzalez, and Joseph M Hellerstein. 2018. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, Vol. 114.Google Scholar
David Garlan. 2000. Software architecture: a roadmap. In Proceedings of the Conference on the Future of Software Engineering. 91--101.Google ScholarDigital Library
Google Cloud Blog. 2019. Machine Learning Workflow. https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview.Google Scholar
Charles Hill, Rachel Bellamy, Thomas Erickson, and Margaret Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 162--170.Google ScholarCross Ref
Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-Offs in Continuous Integration: Assurance, Security, and Flexibility. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197--207. Google ScholarDigital Library
Sue Ann Hong and Tim Hunter. 2017. Build, Scale, and Deploy Deep Learning Pipelines with Ease. https://databricks.com/blog/2017/09/06/build-scale-deploy-deep-learning-pipelines-ease.html.Google Scholar
GitHub Inc. 2019. Octoverse 2018. https://octoverse.github.com/projects.Google Scholar
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In ESEC/FSE'19: The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).Google ScholarDigital Library
Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing Deep Neural Networks: Fix Patterns and Challenges. In ICSE'20: The 42nd International Conference on Software Engineering (Seoul, South Korea).Google Scholar
Kathryn Jepsen. 2014. The machine learning community takes on the Higgs. https://www.symmetrymagazine.org/article/july-2014/the-machine-learning-community-takes-on-the-higgs.Google Scholar
Kaggle. 2021. Kaggle Notebook. www.kaggle.com/competitions.Google Scholar
Kaggle. 2021. Kaggle Notebook. www.kaggle.com/thousandvoices/simple-lstm.Google Scholar
Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/zfturbo/simple-ru-baseline-lb-0-9627.Google Scholar
Kaggle. 2021. Kaggle Notebook. www.kaggle.com/seesee/siamese-pretrained-0-822.Google Scholar
Kaggle. 2021. Kaggle Notebook. www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction.Google Scholar
Kaggle. 2021. Kaggle Notebook. https://www.kaggle.com/danielbecker/careervillage-org-recommendation-engine.Google Scholar
Bojan Karlaš, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020. Building continuous integration services for machine learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2407--2415. Google ScholarDigital Library
Keras. 2021. Keras API Reference. https://keras.io/api/.Google Scholar
Keras. 2021. Scikit-Learn API Reference. https://scikit-learn.org/stable/modules/classes.html.Google Scholar
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarDigital Library
Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--242.Google Scholar
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107.Google ScholarDigital Library
Namju Kim. 2018. Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow. https://github.com/buriburisuri/speech-to-text-wavenet.Google Scholar
Bennet P Lientz, E. Burton Swanson, and Gail E Tompkins. 1978. Characteristics of application software maintenance. Commun. ACM 21, 6 (1978), 466--471.Google ScholarDigital Library
Hui Miao, Amit Chavan, and Amol Deshpande. 2017. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 7.Google ScholarDigital Library
Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 571--582.Google ScholarCross Ref
Microsoft Blog. 2019. What are ML pipelines in Azure Machine Learning service? https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines.Google Scholar
Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google Scholar
Valohai MLOps. 2020. What Is a Machine Learning Pipeline? https://valohai.com/machine-learning-pipeline/.Google Scholar
Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, Álvaro López García, Ignacio Heredia, Peter Malík, and Ladislav Hluchỳ. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review (2019), 1--48.Google ScholarDigital Library
Giang Nguyen, Johir Islam, Rangeet Pan, and Hridesh Rajan. 2022. Manas: Mining Software Repositories to Assist AutoML. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarDigital Library
Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016. ACM, 485--492.Google ScholarDigital Library
Alex Paino. 2017. Deep learning models trained to correct input errors in short, message-like text. https://github.com/atpaino/deep-text-corrector.Google Scholar
Rangeet Pan and Hridesh Rajan. 2020. On Decomposing a Deep Neural Network into Modules. In ESEC/FSE'2020: The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Sacramento, California, United States).Google Scholar
Rangeet Pan and Hridesh Rajan. 2022. Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google ScholarDigital Library
Kyubyong Park. 2018. A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. https://github.com/Kyubyong/tacotron.Google Scholar
David Lorge Parnas, Paul C Clements, and David M Weiss. 1985. The modular structure of complex systems. IEEE Transactions on software Engineering 3 (1985), 259--266.Google ScholarDigital Library
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723--1726.Google ScholarDigital Library
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record 47, 2 (2018), 17--28.Google ScholarDigital Library
Christian Prehofer. 1997. Feature-oriented programming: A fresh look at objects. In ECOOP'97 --- Object-Oriented Programming, Mehmet Akşit and Satoshi Matsuoka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 419--443.Google Scholar
Chuan Qi. 2019. Caffe implementation of Google MobileNet SSD detection network. https://github.com/chuanqi305/MobileNet-SSD.Google Scholar
Václav Rajlich. 2014. Software evolution and maintenance. In Future of Software Engineering Proceedings. 133--144.Google ScholarDigital Library
Muhammad Habib Rehman, Victor Chang, Aisha Batool, and Teh Ying Wah. 2016. Big data reduction framework for value creation in sustainable enterprises. International Journal of Information Management 36, 6 (2016), 917--928.Google ScholarDigital Library
David Robinson. 2017. The Incredible Growth of Python. https://stackoverflow.blog/2017/09/06/incredible-growth-python/.Google Scholar
Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019. A Survey on Data Collection for Machine Learning: a Big Data-AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering (2019).Google Scholar
Eragon Ruan. 2019. Scene text detection based on ctpn (connectionist text proposal network). https://github.com/eragonruan/text-detection-ctpn.Google Scholar
Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--12.Google ScholarDigital Library
Philippe Rémy. 2018. Deep Learning model to analyze a large corpus of clear text passwords. https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis.Google Scholar
David Sandberg. 2018. Face Recognition using Tensorflow. https://github.com/davidsandberg/facenet.Google Scholar
Carlton E Sapp. 2017. Preparing and architecting machine learning. Gartner Technical Professional Advice (2017), 1--37.Google Scholar
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).Google Scholar
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511.Google Scholar
Roald Bradley Severtson. 2017. What is the Team Data Science Process? https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview.Google Scholar
Mary Shaw and David Garlan. 1996. Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall, Inc., USA.Google ScholarDigital Library
Guocong Song. 2017. Tensorflow-based Recommendation systems. https://github.com/songgc/TF-recomm.Google Scholar
Marvin Teichmann. 2018. A Kitti Road Segmentation Model Implemented in Tensorflow. https://github.com/MarvinTeichmann/KittiSeg.Google Scholar
Stephen Todd and David Dietrich. 2017. Computing resource re-provisioning during data analytic lifecycle. US Patent 9,619,550.Google Scholar
Andrew Bagshaw Trieu. 2018. Real-time object detection and classification. https://github.com/thtrieu/darkflow.Google Scholar
Tom Van Der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, and Tim Van Kasteren. 2017. Versioning for end-to-end machine learning pipelines. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. ACM, 2.Google ScholarDigital Library
Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google Scholar
Ben Wagner. 2020. Accountability by design in technology research. Computer Law & Security Review 37 (2020), 105398.Google ScholarCross Ref
Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How does machine learning change software development practices? IEEE Transactions on Software Engineering (2019).Google Scholar
Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1622--1633.Google Scholar
Mohammad Wardat, Breno Dantas Cruz, Wei Le, and Hridesh Rajan. 2022. Deep-Diagnosis: Automatically Diagnosing Faults and Recommending Actionable Fixes in Deep Learning Programs. In ICSE'22: The 44th International Conference on Software Engineering (Pittsburgh, PA, USA).Google Scholar
Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In ICSE'21: The 43nd International Conference on Software Engineering (Virtual Conference).Google Scholar
Hadley Wickham. 2019. Data science: how is it different to statistics? IMS Bulletin 48 (2019).Google Scholar
Jeannette M Wing. 2019. The Data Life Cycle. Harvard Data Science Review (2019).Google Scholar
Rüdiger Wirth and Jochen Hipp. 2000. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Citeseer, 29--39.Google Scholar
Max Woolf. 2018. Automatically "block" people in images (like Black Mirror) using a pretrained neural network. https://github.com/minimaxir/person-blocker.Google Scholar
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. A Tensorflow implementation of QANet for machine reading comprehension. https://github.com/NLPLearn/QANet.Google Scholar
Charlie Bickerton Zhilin Yang, Zihang Dai. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. https://github.com/zihangdai/xlnet.Google Scholar
Linda Zhou. 2019. How to Build a Better Machine Learning Pipeline. https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-pipeline.Google Scholar

Index Terms

The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management

Recommendations

P2D: A Transpiler Framework for Optimizing Data Science Pipelines
DEEM '23: Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning

In this paper, we propose a transpilation-based approach to optimize data science pipelines that comprise database management systems (DBMSes) and data science runtimes (e.g., Python). Our approach allows to identify DBMS-supported operations and ...
Read More
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on ...
Read More
Understanding Parallelization Tradeoffs for Linear Pipelines
PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores

Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
ISBN:9781450392211
DOI:10.1145/3510003
General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany
Copyright © 2022 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 July 2022
Check for updates
Badges
- Artifacts Evaluated & Reusable / v1.1
- Artifacts Available / v1.1
Author Tags
data science pipelines
data science processes
descriptive
predictive
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 1,426
  Total Downloads
- Downloads (Last 12 months)825
- Downloads (Last 6 weeks)122
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

P2D: A Transpiler Framework for Optimizing Data Science Pipelines

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

Understanding Parallelization Tradeoffs for Linear Pipelines