Abstract
Despite the widespread adoption of computational notebooks, little is known about best practices for their usage in collaborative contexts. In this paper, we fill this gap by eliciting a catalog of best practices for collaborative data science with computational notebooks. With this aim, we first look for best practices through a multivocal literature review. Then, we conduct interviews with professional data scientists to assess their awareness of these best practices. Finally, we assess the adoption of best practices through the analysis of 1,380 Jupyter notebooks retrieved from the Kaggle platform. Findings reveal that experts are mostly aware of the best practices and tend to adopt them in their daily work. Nonetheless, they do not consistently follow all the recommendations as, depending on specific contexts, some are deemed unfeasible or counterproductive due to the lack of proper tool support. As such, we envision the design of notebook solutions that allow data scientists not to have to prioritize exploration and rapid prototyping over writing code of quality.
- Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: a Case Study. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE Press, 291--300.Google ScholarDigital Library
- Cecilia Aragon, Clayton Hutto, Andy Echenique, Brittany Fiore-Gartland, Yun Huang, Jinyoung Kim, Gina Neff, Wanli Xing, and Joseph Bayer. 2016. Developing a Research Agenda for Human-Centered Data Science. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion - CSCW '16 Companion. ACM Press, San Francisco, California, USA, 529--535. https://doi.org/10.1145/2818052.2855518Google ScholarDigital Library
- M. Beg, J. Taka, T. Kluyver, A. Konovalov, M. Ragan-Kelley, N. M. Thiéry, and H. Fangohr. 2021. Using jupyter for reproducible scientific workflows. Computing in Science Engineering, Vol. 23, 2 (2021), 36--46. https://doi.org/10.1109/MCSE.2021.3052101Google ScholarCross Ref
- Mary Beth Kery and Brad A. Myers. 2017. Exploring exploratory programming. In 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 25--29. https://doi.org/10.1109/VLHCC.2017.8103446 tex.ids= kery2017exploring tex.organization: IEEE ISSN: 1943--6106.Google ScholarCross Ref
- Christian Bird. 2011. Sociotechnical coordination and collaboration in open source software. In 2011 27th IEEE International Conference on Software Maintenance (ICSM). 568--573. https://doi.org/10.1109/ICSM.2011.6080832 ISSN: 1063--6773.Google ScholarDigital Library
- C. Boogerd and L. Moonen. 2008. Assessing the value of coding standards: An empirical study. In 2008 IEEE international conference on software maintenance. 277--286. https://doi.org/10.1109/ICSM.2008.4658076Google ScholarCross Ref
- Chris Bopp, Ellie Harmon, and Amy Voida. 2017. Disempowered by Data: Nonprofits, Social Enterprises, and the Consequences of Data-Driven Work. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 3608--3619. https://doi.org/10.1145/3025453.3025694Google ScholarDigital Library
- Joel Brandt, Philip J. Guo, Joel Lewenstein, and Scott R. Klemmer. 2008. Opportunistic programming: how rapid ideation and prototyping occur in practice. In Proceedings of the 4th international workshop on End-user software engineering (WEUSE '08). Association for Computing Machinery, Leipzig, Germany, 1--5. https://doi.org/10.1145/1370847.1370848Google ScholarDigital Library
- Petra Brosch, Martina Seidl, Konrad Wieland, Manuel Wimmer, and Philip Langer. 2009. We can work it out: Collaborative conflict resolution in model versioning. In ECSCW 2009. Springer, 207--214.Google ScholarCross Ref
- Burak Karakan. 2020. Jupyter Notebook Best Practices. https://levelup.gitconnected.com/jupyter-notebook-best-practices-fc326eb5cd22Google Scholar
- Fabio Calefato and Christof Ebert. 2019. Agile Collaboration for Distributed Teams [Software Technology]. IEEE Software, Vol. 36, 1 (Jan. 2019), 72--78. https://doi.org/10.1109/MS.2018.2874668Google ScholarDigital Library
- Stevie Chancellor, Shion Guha, Jofish Kaye, Jen King, Niloufar Salehi, Sarita Schoenebeck, and Elizabeth Stowell. 2019. The Relationships between Data, Power, and Justice in CSCW Research. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing. ACM, Austin TX USA, 102--105. https://doi.org/10.1145/3311957.3358609Google ScholarDigital Library
- Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What's wrong with computational notebooks? Pain points, needs, and design opportunities. In Proc. of the 2020 CHI conference on human factors in computing systems. 1--12. https://doi.org/10.1145/3313831.3376729Google ScholarDigital Library
- Ruijia Cheng and Mark Zachry. 2020. Building Community Knowledge In Online Competitions: Motivation, Practices and Challenges. Proceedings of the ACM on Human-Computer Interaction, Vol. 4, CSCW2 (Oct. 2020), 1--22. https://doi.org/10.1145/3415250Google ScholarDigital Library
- Victoria Clarke, Virginia Braun, and Nikki Hayfield. 2015. Thematic analysis. Qualitative psychology: A practical guide to research methods (2015), 222--248.Google Scholar
- Thomas H Davenport and DJ Patil. 2012. Data scientist: The Sexiest Job of the 21st Century. Harvard business review, Vol. 90, 5 (2012), 70--76.Google Scholar
- Tawanna R. Dillahunt, Xinyi Wang, Earnest Wheeler, Hao Fei Cheng, Brent Hecht, and Haiyi Zhu. 2017. The Sharing Economy in Computing: A Systematic Literature Review. Proceedings of the ACM on Human-Computer Interaction, Vol. 1, CSCW (Dec. 2017), 1--26. https://doi.org/10.1145/3134673Google ScholarDigital Library
- Dominik Haitz. 2019. Jupyter Notebook Best Practices - Concise advice to use Jupyter notebooks more effectively. http://towardsdatascience.com/jupyter-notebook-best-practices-f430a6ba8c69Google Scholar
- Florian Wilhelm. 2018. Working efficiently with Jupyter Notebooks. https://www.inovex.de/blog/working-efficiently-with-jupyter-notebooks/Google Scholar
- Martin Fowler. 2018. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional. Google-Books-ID: 2H1_DwAAQBAJ.Google Scholar
- Juliana Freire, David Koop, Emanuele Santos, and Cláudio T Silva. 2008. Provenance for computational tasks: A survey. Computing in Science & Engineering, Vol. 10, 3 (2008), 11--21. Publisher: IEEE.Google Scholar
- Vahid Garousi, Michael Felderer, and Mika V. Mäntylä. 2016. The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering - EASE '16. ACM Press, Limerick, Ireland, 1--6. https://doi.org/10.1145/2915970.2916008Google ScholarDigital Library
- Vahid Garousi, Michael Felderer, and Mika V. Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering. Information and Software Technology, Vol. 106 (2019), 101 -- 121. https://doi.org/10.1016/j.infsof.2018.09.006Google ScholarCross Ref
- Vahid Garousi, Michael Felderer, Mika V. Mäntylä, and Austen Rainer. 2020. Benefitting from the grey literature in software engineering research. In Contemporary empirical methods in software engineering, Michael Felderer and Guilherme Horta Travassos (Eds.). Springer International Publishing, Cham, 385--413. https://doi.org/10.1007/978--3-030--32489--6_14Google Scholar
- Joel Grus. 2018. I don't like notebooks. https://conferences.oreilly.com/jupyter/jup-ny/public/schedule/detail/68282.htmlGoogle Scholar
- Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. 2019. Managing Messes in Computational Notebooks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, Glasgow, Scotland Uk, 1--12. https://doi.org/10.1145/3290605.3300500Google ScholarDigital Library
- James D. Herbsleb. 2007. Global Software Engineering: The Future of Socio-technical Coordination. In Future of Software Engineering (FOSE '07). IEEE, Minneapolis, MN, USA, 188--198. https://doi.org/10.1109/FOSE.2007.11Google ScholarDigital Library
- Youyang Hou and Dakuo Wang. 2017. Hacking with NPOs: Collaborative Analytics and Broker Roles in Civic Data Hackathons. Proceedings of the ACM on Human-Computer Interaction, Vol. 1, CSCW (Dec. 2017), 53:1--53:16. https://doi.org/10.1145/3134688Google ScholarDigital Library
- Jeremy Howard. 2019. nbdev: use Jupyter Notebooks for everything. https://www.fast.ai/2019/12/02/nbdev/Google Scholar
- Jeremiah W. Johnson. 2020. Benefits and Pitfalls of Jupyter Notebooks in the Classroom. In Proceedings of the 21st Annual Conference on Information Technology Education. ACM, Virtual Event USA, 32--37. https://doi.org/10.1145/3368308.3415397Google ScholarDigital Library
- Jonathan Whitmore. 2016. Jupyter Notebook Best Practices for Data Science. https://www.kdnuggets.com/2016/10/jupyter-notebook-best-practices-data-science.htmlGoogle Scholar
- Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner. 2009. Do code clones matter?. In 2009 IEEE 31st International Conference on Software Engineering. 485--495. https://doi.org/10.1109/ICSE.2009.5070547 ISSN: 1558--1225.Google ScholarDigital Library
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics, Vol. 18, 12 (Dec. 2012), 2917--2926. https://doi.org/10.1109/TVCG.2012.219 Conference Name: IEEE Transactions on Visualization and Computer Graphics.Google ScholarDigital Library
- Karlijn Willems. 2019. Jupyter Notebook Tutorial: The Definitive Guide. https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebookGoogle Scholar
- Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, Denver Colorado USA, 1265--1276. https://doi.org/10.1145/3025453.3025626Google ScholarDigital Library
- Mary Beth Kery and Brad A Myers. 2018. Interactions for untangling messy history in a computational notebook. In 2018 IEEE symposium on visual languages and human-centric computing (VL/HCC). 147--155. tex.organization: IEEE.Google ScholarCross Ref
- Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1--11.Google ScholarDigital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107. tex.ids: kimEmergingRoleData2016a.Google ScholarDigital Library
- Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, and others. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas. 87--90. https://doi.org/10.3233/978--1--61499--649--1--87Google Scholar
- Donald Ervin Knuth. 1984. Literate programming. Comput. J., Vol. 27, 2 (1984), 97--111.Google ScholarDigital Library
- Andreas Koenzen, Neil Ernst, and Margaret-Anne Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. In Proc. of the 2020 Symposium on Visual Languages and Human-Centric Computing. https://doi.org/10.1109/VL/HCC50065.2020.9127202Google ScholarCross Ref
- P. Kruchten, R. L. Nord, and I. Ozkaya. 2012. Technical debt: From metaphor to theory and practice. IEEE Software, Vol. 29, 6 (2012), 18--21. https://doi.org/10.1109/MS.2012.167Google ScholarDigital Library
- J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159--174. Publisher: JSTOR.Google Scholar
- Atma Mani. 2018. Coding Standards for Jupyter Notebook. https://www.esri.com/about/newsroom/arcuser/coding-standards-for-jupyter-notebook/Google Scholar
- Jennifer Marlow and Laura Dabbish. 2013. Activity traces and signals in software developer recruitment and hiring. In Proceedings of the 2013 conference on Computer supported cooperative work. 145--156.Google ScholarDigital Library
- Kate Matsudaira. 2015. The science of managing data science. Commun. ACM, Vol. 58, 6 (May 2015), 44--47. https://doi.org/10.1145/2745390Google ScholarDigital Library
- Michael Cheng and Viacheslav Kovalevskyi. 2019. Jupyter Notebook Manifesto: Best practices that can improve the life of any developer using Jupyter notebooks. https://cloud.google.com/blog/products/ai-machine-learning/best-practices-that-can-improve-the-life-of-any-developer-using-jupyter-notebooksGoogle Scholar
- Lester James V. Miranda. 2020. How to use Jupyter Notebooks in 2020 (Part 2: Ecosystem growth). https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-notebooks-in-2020-part-2/Google Scholar
- Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, Glasgow, Scotland Uk, 1--15. https://doi.org/10.1145/3290605.3300356Google ScholarDigital Library
- Samir Passi and Steven J. Jackson. 2018. Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proceedings of the ACM on Human-Computer Interaction, Vol. 2, CSCW (Nov. 2018), 1--28. https://doi.org/10.1145/3274405Google ScholarDigital Library
- Jeffrey M Perkel. 2018. Why Jupyter is data scientists' computational notebook of choice. Nature, Vol. 563, 7732 (2018), 145--147.Google Scholar
- Alex Perrier. 2020. Best practices when sharing your data analysis - Jupyter Notebooks. https://alexisperrier.com/datascience/2020/02/15/jupyter_notebooks_sharing_best_practices.htmlGoogle Scholar
- Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks. In Proc. of the 16th International Conference on Mining Software Repositories. 507--517. https://doi.org/10.1109/MSR.2019.00077Google ScholarDigital Library
- Fernando Pérez and Brian E Granger. 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering, Vol. 9, 3 (2007), 21--29.Google ScholarDigital Library
- Fernando Pérez and Brian E. Granger. 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Technical Report. UC Berkeley and Cal Poly. 24 pages. http://archive.ipython.org/JupyterGrantNarrative-2015.pdfGoogle Scholar
- Roman Kierzkowski. 2017. 10 tips on using Jupyter Notebook. https://medium.com/@r_kierzkowski/10-tips-on-using-jupyter-notebook-abc0ba7028a4Google Scholar
- Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H Nguyen, Sara Brin Rosenthal, Fernando Pérez, and others. 2019. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLOS Computational Biology, Vol. 15, 7 (2019). Publisher: Public Library of Science.Google Scholar
- Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H. Nguyen, Sara Brin Rosenthal, Fernando Pérez, and Peter W. Rose. 2018a. Ten Simple Rules for Reproducible Research in Jupyter Notebooks. arXiv:1810.08055 [cs] (Oct. 2018). http://arxiv.org/abs/1810.08055 arXiv: 1810.08055.Google Scholar
- Adam Rule, Ian Drosos, Aurélien Tabard, and James D. Hollan. 2018b. Aiding Collaborative Reuse of Computational Notebooks with Annotated Cell Folding. Proceedings of the ACM on Human-Computer Interaction, Vol. 2, CSCW (Nov. 2018), 1--12. https://doi.org/10.1145/3274419Google ScholarDigital Library
- Adam Rule, Aurélien Tabard, and James D. Hollan. 2018c. Exploration and Explanation in Computational Notebooks. In Proc. of the 2018 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3173574.3173606Google ScholarDigital Library
- Anita Sarma, Xiaofan Chen, Sandeep Kuttal, Laura Dabbish, and Zhendong Wang. 2016. Hiring in the global stage: Profiles of online contributions. In 2016 IEEE 11th international conference on global software engineering (ICGSE). 1--10. tex.organization: IEEE.Google ScholarCross Ref
- Saturn Cloud Dev Team. 2020. Best Practices for Jupyter Notebook. http://site.saturncloud.io/s/best-practices-for-jupyter-notebooksGoogle Scholar
- Matthew Seal, Kyle Kelley, and Michelle Ufford. 2018. Part 2: Scheduling Notebooks at Netflix. https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6Google Scholar
- Dan Toomey. 2017. Jupyter for data science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter. Packt Publishing Ltd.Google Scholar
- Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2015. When and Why Your Code Starts to Smell Bad. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 403--414. https://doi.org/10.1109/ICSE.2015.59 ISSN: 1558--1225.Google ScholarCross Ref
- Michelle Ufford, M. Pacer, Matthew Seal, and Kyle Kelley. 2018. Beyond Interactive: Notebook Innovation at Netflix. https://medium.com/netflix-techblog/notebook-innovation-591ee3221233Google Scholar
- VanderPlas, Jake. 2016. IPython Magic Commands. In Python data science handbook: Essential tools for working with data. O'Reilly Media, Inc., 10--30.Google Scholar
- Elizabeth Walter. 2008. Cambridge advanced learner's dictionary .Cambridge university press.Google Scholar
- April Yi Wang, Anant Mittal, Christopher Brooks, and Steve Oney. 2019 a. How Data Scientists Use Computational Notebooks for Real-Time Collaboration. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (Nov. 2019), 39:1--39:30. https://doi.org/10.1145/3359141Google ScholarDigital Library
- April Yi Wang, Zihan Wu, Christopher Brooks, and Steve Oney. 2020 c. Callisto: Capturing the "Why" by Connecting Conversations with Computational Narratives. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu HI USA, 1--13. https://doi.org/10.1145/3313831.3376740Google ScholarDigital Library
- Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. 2019 b. Human-AI Collaboration in Data Science: Exploring Data Scientists' Perceptions of Automated AI. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (Nov. 2019), 1--24. https://doi.org/10.1145/3359313Google ScholarDigital Library
- Jiawei Wang, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020 a. Restoring reproducibility of jupyter notebooks. In Proceedings of the ACM/IEEE 42nd international conference on software engineering: Companion proceedings (ICSE '20). Association for Computing Machinery, New York, NY, USA, 288--289. https://doi.org/10.1145/3377812.3390803 Number of pages: 2 Place: Seoul, South Korea.Google ScholarDigital Library
- Jiawei Wang, Li Li, and Andreas Zeller. 2020 b. Better code, better sharing: On the need of analyzing jupyter notebooks. In Proc. of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. ACM, 53--56. https://doi.org/10.1145/3377816.3381724Google ScholarDigital Library
- Mark Weiser. 1984. Program slicing. IEEE Transactions on Software Engineering, Vol. 10, 4 (July 1984), 352--357. https://doi.org/10.1109/TSE.1984.5010248Google ScholarDigital Library
- Jonathan Whitmore. 2015. Jupyter Notebook Best Practices for Data Science. https://www.svds.com/jupyter-notebook-best-practices-for-data-science/Google Scholar
- Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE '14. ACM Press, London, England, United Kingdom, 1--10. https://doi.org/10.1145/2601248.2601268Google ScholarDigital Library
- Mark Woodbridge. 2017. Jupyter Notebooks and Reproducible Data Science. https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.htmlGoogle Scholar
- Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proceedings of the ACM on Human-Computer Interaction, Vol. 4, CSCW (May 2020), 022:1--022:23. https://doi.org/10.1145/3392826Google ScholarDigital Library
Index Terms
- Eliciting Best Practices for Collaboration with Computational Notebooks
Recommendations
What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsComputational notebooks - such as Azure, Databricks, and Jupyter - are a popular, interactive paradigm for data scientists to author code, analyze data, and interleave visualizations, all within a single document. Nevertheless, as data scientists ...
Exploration and Explanation in Computational Notebooks
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing SystemsComputational notebooks combine code, visualizations, and text in a single document. Researchers, data analysts, and even journalists are rapidly adopting this new medium. We present three studies of how they are using notebooks to document and share ...
How Data Scientists Use Computational Notebooks for Real-Time Collaboration
Effective collaboration in data science can leverage domain expertise from each team member and thus improve the quality and efficiency of the work. Computational notebooks give data scientists a convenient interactive solution for sharing and keeping ...
Comments