research-article

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Authors:
Carlo A. Furia

Software Institute, USI Università della Svizzera italiana, Switzerland

Software Institute, USI Università della Svizzera italiana, Switzerland

0000-0003-1040-3201
Search about this author

,
Richard Torkar

University of Gothenburg, Sweden, and Stellenbosch Institute for Advanced Study (STIAS), South Africa

University of Gothenburg, Sweden, and Stellenbosch Institute for Advanced Study (STIAS), South Africa

0000-0002-0118-8143
Search about this author

,
Robert Feldt

Chalmers and University of Gothenburg, Sweden

Chalmers and University of Gothenburg, Sweden

0000-0002-5179-4205
Search about this author

ACM Transactions on Software Engineering and Methodology Volume 33 Issue 1Article No.: 13pp 1–35https://doi.org/10.1145/3611667

Published:24 November 2023Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations—instead of potentially more insightful and robust causal relations.

To support analyzing purely observational data for causal relations and to assess any differences between purely predictive and causal models of the same data, this article discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses and then use the causal information to guide the construction of a statistical model that captures genuine causal relations—such that correlation does imply causation.

We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant’s performance in the contest. While the overall effect associated with programming languages is weak compared to other variables—regardless of whether we consider correlational or causal links—we found considerable differences between a purely associational and a causal analysis of the very same data.

The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques—where genuine causal effects may be confounded.

REFERENCES

[1] Abran Alain, Moore James W., Bourque Pierre, Dupuis Robert, and Tripp L.. 2004. Software Engineering Body of Knowledge. IEEE Computer Society, Angela Burgess, 25.Google Scholar
[2] Amrehin Valentin, Greenland Sander, and McShane Blake. 2019. Scientists rise up against statistical significance. Nature 567 (2019), 305–307.Google ScholarCross Ref
[3] Baah George K., Podgurski Andy, and Harrold Mary Jean. 2010. Causal inference for statistical fault localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis. 73–84.Google ScholarDigital Library
[4] Back Alexandra and Westman Emma. 2017. Comparing Programming Languages in Google Code Jam. Master’s thesis. Chalmers University of Technology. https://publications.lib.chalmers.se/records/fulltext/250672/250672.pdfGoogle Scholar
[5] Benjamin Daniel J., Berger James O., Johannesson Magnus, Nosek Brian A., Wagenmakers E.-J., Berk Richard, Bollen Kenneth A., Brembs Björn, Brown Lawrence, Camerer Colin, Cesarini David, Chambers Christopher D., Clyde Merlise, Cook Thomas D., Boeck Paul De, Dienes Zoltan, Dreber Anna, Easwaran Kenny, Efferson Charles, Fehr Ernst, Fidler Fiona, Field Andy P., Forster Malcolm, George Edward I., Gonzalez Richard, Goodman Steven, Green Edwin, Green Donald P., Greenwald Anthony G., Hadfield Jarrod D., Hedges Larry V., Held Leonhard, Ho Teck Hua, Hoijtink Herbert, Hruschka Daniel J., Imai Kosuke, Imbens Guido, Ioannidis John P. A., Jeon Minjeong, Jones James Holland, Kirchler Michael, Laibson David, List John, Little Roderick, Lupia Arthur, Machery Edouard, Maxwell Scott E., McCarthy Michael, Moore Don A., Morgan Stephen L., Munafó Marcus, Nakagawa Shinichi, Nyhan Brendan, Parker Timothy H., Pericchi Luis, Perugini Marco, Rouder Jeff, Rousseau Judith, Savalei Victoria, Schönbrodt Felix D., Sellke Thomas, Sinclair Betsy, Tingley Dustin, Zandt Trisha Van, Vazire Simine, Watts Duncan J., Winship Christopher, Wolpert Robert L., Xie Yu, Young Cristobal, Zinman Jonathan, and Johnson Valen E.. 2018. Redefine statistical significance. Nature Human Behaviour 2, 6-10 (2018).Google Scholar
[6] Berger Emery D., Hollenbeck Celeste, Maj Petr, Vitek Olga, and Vitek Jan. 2019. On the impact of programming languages on code quality: A reproduction study. ACM Transactions on Programming Languages and Systems 41, 4 (2019), 21:1–21:24. Google ScholarDigital Library
[7] Caillet Pascal, Klemm Sarah, Ducher Michel, Aussem Alexandre, and Schott Anne-Marie. 2015. Hip fracture in the elderly: A re-analysis of the EPIDOS study with causal Bayesian networks. PLoS One 10, 3 (2015), e0120125.Google ScholarCross Ref
[8] Cinelli Carlos, Forney Andrew, and Pearl Judea. 2022. A crash course in good and bad controls. Sociological Methods & Research (2022). DOI:Google ScholarCross Ref
[9] Clark Andrew G., Foster Michael, Prifling Benedikt, Walkinshaw Neil, Hierons Robert M., Schmidt Volker, and Turner Robert D.. 2022. Testing causality in scientific modelling software. arXiv preprint arXiv:2209.00357 (2022).Google Scholar
[10] Cohen Jacob. 1994. The earth is round (\(p \lt .05\)). American Psychologist 49, 12 (1994), 997–1003.Google ScholarCross Ref
[11] Dubslaff Clemens, Weis Kallistos, Baier Christel, and Apel Sven. 2022. Causality in configurable software systems. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google Scholar
[12] Fang Hongbo, Lamba Hemank, Herbsleb James, and Vasilescu Bogdan. 2022. “This is damn slick!” Estimating the impact of tweets on open source project popularity and new contributors. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google Scholar
[13] Furia Carlo A., Feldt Robert, and Torkar Richard. 2021. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering 47, 9 (September 2021), 1786–1810.Google Scholar
[14] Furia Carlo A., Torkar Richard, and Feldt Robert. 2022. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. ACM Transactions on Software Engineering and Methodology 31, 3 (2022), 40:1–40:38.Google ScholarDigital Library
[15] Furia Carlo A., Torkar Richard, and Feldt Robert. 2023. Replication Package. Google ScholarCross Ref
[16] Gelman Andrew. 2016. The problems with P-values are not just with P-values. American Statistician 70 (2016). Online discussion: http://www.stat.columbia.edu/gelman/research/published/asa_pvalues.pdfGoogle Scholar
[17] Gelman Andrew. 2016. Why I Prefer 50% Rather Than 95% Intervals. https://statmodeling.stat.columbia.edu/2016/11/05/why-i-prefer-50-to-95-intervals/ From the blog Statistical Modeling, Causal Inference, and Social Science.Google Scholar
[18] Gelman Andrew, Hill Jennifer, and Vehtari Aki. 2020. Regression and Other Stories. Cambridge University Press, Cambridge, UK. https://books.google.se/books?id=SZFKzQEACAAJGoogle ScholarCross Ref
[19] Gelman Andrew and Weakliem David. 2009. Of beauty, sex and power. American Scientist 97 (2009), 310–316.Google ScholarCross Ref
[20] Goodman Steven N.. 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130, 12 (1999), 995–1004.Google ScholarCross Ref
[21] Hakulinen Lasse. 2011. Survey on informatics competitions: Developing tasks. In Olympiads in Informatics, Vol. 5. IOI, 12–25.Google Scholar
[22] Halpern Joseph. 2015. A modification of the Halpern-Pearl definition of causality. In 24th International Joint Conference on Artificial Intelligence.Google ScholarDigital Library
[23] Hanenberg Stefan. 2010. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, New York, NY, 22–35.Google ScholarDigital Library
[24] Henckel Leonard, Perković Emilija, and Maathuis Marloes H.. 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84, 2 (2022), 579–599. Also https://arxiv.org/abs/1907.02435Google ScholarCross Ref
[25] Hernán Miguel A. and Monge Susana. 2023. Selection bias due to conditioning on a collider. BMJ 381 (2023), 1135. arXiv: https://www.bmj.com/content/381/bmj.p1135.full.pdfGoogle ScholarCross Ref
[26] Heyn Hans-Martin and Knauss Eric. 2022. Structural causal models as boundary objects in AI system development. In 1st International Conference on AI Engineering-Software Engineering for AI.Google ScholarDigital Library
[27] Imbens Guido W.. 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 1129–1179.Google ScholarCross Ref
[28] Jaynes Edwin T.. 2003. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.Google ScholarCross Ref
[29] Keuning Hieke, Jeuring Johan, and Heeren Bastiaan. 2019. A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education 19, 1 (2019), 3:1–3:43. Google ScholarDigital Library
[30] Lee Seongmin, Binkley Dave, Feldt Robert, Gold Nicolas, and Yoo Shin. 2021. Causal program dependence analysis. arXiv preprint arXiv:2104.09107 (2021).Google Scholar
[31] Li Xing, Yu Yinbo, Bu Kai, Chen Yan, Yang Jianfeng, and Quan Ruijie. 2019. Thinking inside the box: Differential fault localization for SDN control plane. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM’19). IEEE, 353–359.Google Scholar
[32] Liu Yuchu, Mattos David Issa, Bosch Jan, Olsson Helena Holmström, and Lantz Jonn. 2022. Bayesian causal inference in automotive software engineering and online evaluation. arXiv preprint arXiv:2207.00222 (2022).Google Scholar
[33] McElreath Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd Ed.). CRC Press.Google ScholarCross Ref
[34] McShane Blakeley B., Gal David, Gelman Andrew, Robert Christian, and Tackett Jennifer L.. 2019. Abandon statistical significance. American Statistician 73, S1 (2019), 235–245.Google ScholarCross Ref
[35] Menzies Tim and Shepperd Martin. 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 35–47. https://arxiv.org/abs/1803.05518Google ScholarDigital Library
[36] Meyerovich Leo A. and Rabkin Ariel S.. 2013. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’13). ACM, New York, NY, 1–18.Google ScholarDigital Library
[37] Nanz Sebastian and Furia Carlo A.. 2014. A Comparative Study of Programming Languages in Rosetta Code. http://arxiv.org/abs/1409.0252Google Scholar
[38] Nanz Sebastian and Furia Carlo A.. 2015. A comparative study of programming languages in Rosetta code. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Bertolino Antonia, Canfora Gerardo, and Elbaum Sebastian (Eds.). ACM, 778–788.Google ScholarCross Ref
[39] Nanz Sebastian, Torshizi Faraz, Pedroni Michela, and Meyer Bertrand. 2011. Design of an empirical study for comparing the usability of concurrent programming languages. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM’11). IEEE Computer Society, Washington, DC, 325–334.Google ScholarDigital Library
[40] Pearl Judea. 2009. Causality. Cambridge University Press. Google ScholarCross Ref
[41] Pearl Judea. 2009. Causality: Models, Reasoning and Inference (2nd Ed.). Cambridge University Press.Google ScholarCross Ref
[42] Pearl Judea. 2011. The mathematics of causal inference. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). Association for Computing Machinery, New York, NY, 5. Google ScholarDigital Library
[43] Pearl Judea. 2019. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62, 3 (2019), 54–60.Google ScholarDigital Library
[44] Pearl Judea. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96–146.Google ScholarCross Ref
[45] Pearl Judea and Mackenzie Dana. 2018. The Book of Why. Penguin Random House.Google ScholarDigital Library
[46] Peters Jonas, Janzing Dominik, and Schölkopf Bernhard. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.Google ScholarDigital Library
[47] Prechelt Lutz. 2000. An empirical comparison of seven programming languages. IEEE Computer 33, 10 (Oct. 2000), 23–29.Google ScholarDigital Library
[48] Ray Baishakhi, Posnett Daryl, Filkov Vladimir, and Devanbu Premkumar. 2014. A large scale study of programming languages and code quality in Github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, 155–165. Google ScholarDigital Library
[49] Richens Jonathan G., Lee Ciarán M., and Johri Saurabh. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11, 1 (2020), 1–9.Google Scholar
[50] Rossbach Christopher J., Hofmann Owen S., and Witchel Emmett. 2010. Is transactional programming actually easier? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 47–56.Google ScholarDigital Library
[51] Rubin Donald B.. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.Google ScholarCross Ref
[52] Scholz Maximilian and Bürkner Paul-Christian. 2022. Prediction Can be Safely Used as a Proxy for Explanation in Causally Consistent Bayesian Generalized Linear Models. https://arxiv.org/abs/2210.06927Google Scholar
[53] Scholz Maximilian and Torkar Richard. 2021. An empirical study of linespots: A novel past-fault algorithm. Software Testing, Verification and Reliability 31, 8 (2021), e1787.Google ScholarCross Ref
[54] Seo Hyunmin, Sadowski Caitlin, Elbaum Sebastian G., Aftandilian Edward, and Bowdidge Robert W.. 2014. Programmers’ build errors: A case study (at Google). In 36th International Conference on Software Engineering (ICSE’14). ACM, 724–734. Google ScholarDigital Library
[55] Siebert Julian. 2022. Applications of statistical causal inference in software engineering. arXiv preprint arXiv:2211.11482 (2022).Google Scholar
[56] Simmons Joseph P., Nelson Leif D., and Simonsohn Uri. 2011. False-positive psychology. Psychological Science 22, 11 (2011), 1359–1366.Google ScholarCross Ref
[57] Spirtes Peter and Zhang Kun. 2016. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, Vol. 3. SpringerOpen, 1–28.Google Scholar
[58] Sterne Jonathan. 2020. We Should Be Cautious about Associations of Patient Characteristics with COVID-19 Outcomes That Are Identified in Hospitalised Patients. Health Data Research UK – https://www.hdruk.ac.uk/news/we-should-be-cautious-about-associations-of-patient-characteristics-with-covid-19-outcomes-that-are-identified-in-hospitalised-patients/Google Scholar
[59] Torkar Richard, Furia Carlo A., Feldt Robert, Neto Francisco Gomes de Oliveira, Gren Lucas, Lenberg Per, and Ernst Neil A.. 2022. A method to assess and argue for practical significance in software engineering. IEEE Transactions on Software Engineering 48, 6 (June 2022), 2053–2065.Google ScholarDigital Library
[60] Vehtari Aki, Gelman Andrew, and Gabry Jonah. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 5 (2017), 1413–1432. Google ScholarDigital Library
[61] Verhoeff Tom. 1997. The role of competitions in education. In Future World: Educating for the 21st Century. IOI.Google Scholar
[62] Wasserstein Ronald L. and Lazar Nicole A.. 2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70, 2 (2016), 129–133. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdfGoogle ScholarCross Ref

Index Terms

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Recommendations

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality
Statistical analysis is the tool of choice to turn data into information and then information into empirical knowledge. However, the process that goes from data to knowledge is long, uncertain, and riddled with pitfalls. To be valid, it should be ...
Read More
Empirical software engineering for agent programming
AGERE! 2012: Proceedings of the 2nd edition on Programming systems, languages and applications based on actors, agents, and decentralized control abstractions

Empirical software engineering is a branch of software engineering in which empirical methods are used to evaluate and develop tools, languages and techniques. In this position paper we argue for the use of empirical methods to advance the area of agent ...
Read More
Data quality in empirical software engineering: a targeted review
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 33, Issue 1
January 2024
933 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3613536
Editor:
Mauro Pezzè
USI Universitá della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 November 2023
- Online AM: 19 August 2023
- Accepted: 7 July 2023
- Revised: 27 June 2023
- Received: 22 January 2023
Published in tosem Volume 33, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Causality analysis
statistical analysis
empirical software engineering
programming contests
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)212
- Downloads (Last 6 weeks)45
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

ACM Transactions on Software Engineering and Methodology

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality

Empirical software engineering for agent programming

Data quality in empirical software engineering: a targeted review