Abstract
There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations—instead of potentially more insightful and robust causal relations.
To support analyzing purely observational data for causal relations and to assess any differences between purely predictive and causal models of the same data, this article discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses and then use the causal information to guide the construction of a statistical model that captures genuine causal relations—such that correlation does imply causation.
We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant’s performance in the contest. While the overall effect associated with programming languages is weak compared to other variables—regardless of whether we consider correlational or causal links—we found considerable differences between a purely associational and a causal analysis of the very same data.
The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques—where genuine causal effects may be confounded.
- [1] . 2004. Software Engineering Body of Knowledge. IEEE Computer Society, Angela Burgess, 25.Google Scholar
- [2] . 2019. Scientists rise up against statistical significance. Nature 567 (2019), 305–307.Google ScholarCross Ref
- [3] . 2010. Causal inference for statistical fault localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis. 73–84.Google ScholarDigital Library
- [4] . 2017. Comparing Programming Languages in Google Code Jam. Master’s thesis. Chalmers University of Technology. https://publications.lib.chalmers.se/records/fulltext/250672/250672.pdfGoogle Scholar
- [5] . 2018. Redefine statistical significance. Nature Human Behaviour 2, 6-10 (2018).Google Scholar
- [6] . 2019. On the impact of programming languages on code quality: A reproduction study. ACM Transactions on Programming Languages and Systems 41, 4 (2019), 21:1–21:24. Google ScholarDigital Library
- [7] . 2015. Hip fracture in the elderly: A re-analysis of the EPIDOS study with causal Bayesian networks. PLoS One 10, 3 (2015), e0120125.Google ScholarCross Ref
- [8] . 2022. A crash course in good and bad controls. Sociological Methods & Research (2022).
DOI: Google ScholarCross Ref - [9] . 2022. Testing causality in scientific modelling software. arXiv preprint arXiv:2209.00357 (2022).Google Scholar
- [10] . 1994. The earth is round (\(p \lt .05\)). American Psychologist 49, 12 (1994), 997–1003.Google ScholarCross Ref
- [11] . 2022. Causality in configurable software systems. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google Scholar
- [12] . 2022. “This is damn slick!” Estimating the impact of tweets on open source project popularity and new contributors. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google Scholar
- [13] . 2021. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering 47, 9 (
September 2021), 1786–1810.Google Scholar - [14] . 2022. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. ACM Transactions on Software Engineering and Methodology 31, 3 (2022), 40:1–40:38.Google ScholarDigital Library
- [15] . 2023. Replication Package. Google ScholarCross Ref
- [16] . 2016. The problems with P-values are not just with P-values. American Statistician 70 (2016). Online discussion: http://www.stat.columbia.edu/gelman/research/published/asa_pvalues.pdfGoogle Scholar
- [17] . 2016. Why I Prefer 50% Rather Than 95% Intervals. https://statmodeling.stat.columbia.edu/2016/11/05/why-i-prefer-50-to-95-intervals/
From the blog Statistical Modeling, Causal Inference, and Social Science. Google Scholar - [18] . 2020. Regression and Other Stories. Cambridge University Press, Cambridge, UK. https://books.google.se/books?id=SZFKzQEACAAJGoogle ScholarCross Ref
- [19] . 2009. Of beauty, sex and power. American Scientist 97 (2009), 310–316.Google ScholarCross Ref
- [20] . 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130, 12 (1999), 995–1004.Google ScholarCross Ref
- [21] . 2011. Survey on informatics competitions: Developing tasks. In Olympiads in Informatics, Vol. 5. IOI, 12–25.Google Scholar
- [22] . 2015. A modification of the Halpern-Pearl definition of causality. In 24th International Joint Conference on Artificial Intelligence.Google ScholarDigital Library
- [23] . 2010. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, New York, NY, 22–35.Google ScholarDigital Library
- [24] . 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84, 2 (2022), 579–599. Also https://arxiv.org/abs/1907.02435Google ScholarCross Ref
- [25] . 2023. Selection bias due to conditioning on a collider. BMJ 381 (2023), 1135. arXiv: https://www.bmj.com/content/381/bmj.p1135.full.pdfGoogle ScholarCross Ref
- [26] . 2022. Structural causal models as boundary objects in AI system development. In 1st International Conference on AI Engineering-Software Engineering for AI.Google ScholarDigital Library
- [27] . 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 1129–1179.Google ScholarCross Ref
- [28] . 2003. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.Google ScholarCross Ref
- [29] . 2019. A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education 19, 1 (2019), 3:1–3:43. Google ScholarDigital Library
- [30] . 2021. Causal program dependence analysis. arXiv preprint arXiv:2104.09107 (2021).Google Scholar
- [31] . 2019. Thinking inside the box: Differential fault localization for SDN control plane. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM’19). IEEE, 353–359.Google Scholar
- [32] . 2022. Bayesian causal inference in automotive software engineering and online evaluation. arXiv preprint arXiv:2207.00222 (2022).Google Scholar
- [33] . 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd Ed.). CRC Press.Google ScholarCross Ref
- [34] . 2019. Abandon statistical significance. American Statistician 73, S1 (2019), 235–245.Google ScholarCross Ref
- [35] . 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 35–47. https://arxiv.org/abs/1803.05518Google ScholarDigital Library
- [36] . 2013. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’13). ACM, New York, NY, 1–18.Google ScholarDigital Library
- [37] . 2014. A Comparative Study of Programming Languages in Rosetta Code. http://arxiv.org/abs/1409.0252Google Scholar
- [38] . 2015. A comparative study of programming languages in Rosetta code. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), , , and (Eds.). ACM, 778–788.Google ScholarCross Ref
- [39] . 2011. Design of an empirical study for comparing the usability of concurrent programming languages. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM’11). IEEE Computer Society, Washington, DC, 325–334.Google ScholarDigital Library
- [40] . 2009. Causality. Cambridge University Press. Google ScholarCross Ref
- [41] . 2009. Causality: Models, Reasoning and Inference (2nd Ed.). Cambridge University Press.Google ScholarCross Ref
- [42] . 2011. The mathematics of causal inference. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). Association for Computing Machinery, New York, NY, 5. Google ScholarDigital Library
- [43] . 2019. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62, 3 (2019), 54–60.Google ScholarDigital Library
- [44] . 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96–146.Google ScholarCross Ref
- [45] . 2018. The Book of Why. Penguin Random House.Google ScholarDigital Library
- [46] . 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.Google ScholarDigital Library
- [47] . 2000. An empirical comparison of seven programming languages. IEEE Computer 33, 10 (
Oct. 2000), 23–29.Google ScholarDigital Library - [48] . 2014. A large scale study of programming languages and code quality in Github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, 155–165. Google ScholarDigital Library
- [49] . 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11, 1 (2020), 1–9.Google Scholar
- [50] . 2010. Is transactional programming actually easier? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 47–56.Google ScholarDigital Library
- [51] . 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.Google ScholarCross Ref
- [52] . 2022. Prediction Can be Safely Used as a Proxy for Explanation in Causally Consistent Bayesian Generalized Linear Models. https://arxiv.org/abs/2210.06927Google Scholar
- [53] . 2021. An empirical study of linespots: A novel past-fault algorithm. Software Testing, Verification and Reliability 31, 8 (2021), e1787.Google ScholarCross Ref
- [54] . 2014. Programmers’ build errors: A case study (at Google). In 36th International Conference on Software Engineering (ICSE’14). ACM, 724–734. Google ScholarDigital Library
- [55] . 2022. Applications of statistical causal inference in software engineering. arXiv preprint arXiv:2211.11482 (2022).Google Scholar
- [56] . 2011. False-positive psychology. Psychological Science 22, 11 (2011), 1359–1366.Google ScholarCross Ref
- [57] . 2016. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, Vol. 3. SpringerOpen, 1–28.Google Scholar
- [58] . 2020. We Should Be Cautious about Associations of Patient Characteristics with COVID-19 Outcomes That Are Identified in Hospitalised Patients. Health Data Research UK – https://www.hdruk.ac.uk/news/we-should-be-cautious-about-associations-of-patient-characteristics-with-covid-19-outcomes-that-are-identified-in-hospitalised-patients/Google Scholar
- [59] . 2022. A method to assess and argue for practical significance in software engineering. IEEE Transactions on Software Engineering 48, 6 (
June 2022), 2053–2065.Google ScholarDigital Library - [60] . 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 5 (2017), 1413–1432. Google ScholarDigital Library
- [61] . 1997. The role of competitions in education. In Future World: Educating for the 21st Century. IOI.Google Scholar
- [62] . 2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70, 2 (2016), 129–133. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdfGoogle ScholarCross Ref
Index Terms
- Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions
Recommendations
Empirical software engineering for agent programming
AGERE! 2012: Proceedings of the 2nd edition on Programming systems, languages and applications based on actors, agents, and decentralized control abstractionsEmpirical software engineering is a branch of software engineering in which empirical methods are used to evaluate and develop tools, languages and techniques. In this position paper we argue for the use of empirical methods to advance the area of agent ...
Data quality in empirical software engineering: a targeted review
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software EngineeringContext: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data ...
Comments