skip to main content
research-article

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Published:24 November 2023Publication History
Skip Abstract Section

Abstract

There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations—instead of potentially more insightful and robust causal relations.

To support analyzing purely observational data for causal relations and to assess any differences between purely predictive and causal models of the same data, this article discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses and then use the causal information to guide the construction of a statistical model that captures genuine causal relations—such that correlation does imply causation.

We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant’s performance in the contest. While the overall effect associated with programming languages is weak compared to other variables—regardless of whether we consider correlational or causal links—we found considerable differences between a purely associational and a causal analysis of the very same data.

The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques—where genuine causal effects may be confounded.

REFERENCES

  1. [1] Abran Alain, Moore James W., Bourque Pierre, Dupuis Robert, and Tripp L.. 2004. Software Engineering Body of Knowledge. IEEE Computer Society, Angela Burgess, 25.Google ScholarGoogle Scholar
  2. [2] Amrehin Valentin, Greenland Sander, and McShane Blake. 2019. Scientists rise up against statistical significance. Nature 567 (2019), 305307.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Baah George K., Podgurski Andy, and Harrold Mary Jean. 2010. Causal inference for statistical fault localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis. 7384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Back Alexandra and Westman Emma. 2017. Comparing Programming Languages in Google Code Jam. Master’s thesis. Chalmers University of Technology. https://publications.lib.chalmers.se/records/fulltext/250672/250672.pdfGoogle ScholarGoogle Scholar
  5. [5] Benjamin Daniel J., Berger James O., Johannesson Magnus, Nosek Brian A., Wagenmakers E.-J., Berk Richard, Bollen Kenneth A., Brembs Björn, Brown Lawrence, Camerer Colin, Cesarini David, Chambers Christopher D., Clyde Merlise, Cook Thomas D., Boeck Paul De, Dienes Zoltan, Dreber Anna, Easwaran Kenny, Efferson Charles, Fehr Ernst, Fidler Fiona, Field Andy P., Forster Malcolm, George Edward I., Gonzalez Richard, Goodman Steven, Green Edwin, Green Donald P., Greenwald Anthony G., Hadfield Jarrod D., Hedges Larry V., Held Leonhard, Ho Teck Hua, Hoijtink Herbert, Hruschka Daniel J., Imai Kosuke, Imbens Guido, Ioannidis John P. A., Jeon Minjeong, Jones James Holland, Kirchler Michael, Laibson David, List John, Little Roderick, Lupia Arthur, Machery Edouard, Maxwell Scott E., McCarthy Michael, Moore Don A., Morgan Stephen L., Munafó Marcus, Nakagawa Shinichi, Nyhan Brendan, Parker Timothy H., Pericchi Luis, Perugini Marco, Rouder Jeff, Rousseau Judith, Savalei Victoria, Schönbrodt Felix D., Sellke Thomas, Sinclair Betsy, Tingley Dustin, Zandt Trisha Van, Vazire Simine, Watts Duncan J., Winship Christopher, Wolpert Robert L., Xie Yu, Young Cristobal, Zinman Jonathan, and Johnson Valen E.. 2018. Redefine statistical significance. Nature Human Behaviour 2, 6-10 (2018).Google ScholarGoogle Scholar
  6. [6] Berger Emery D., Hollenbeck Celeste, Maj Petr, Vitek Olga, and Vitek Jan. 2019. On the impact of programming languages on code quality: A reproduction study. ACM Transactions on Programming Languages and Systems 41, 4 (2019), 21:1–21:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Caillet Pascal, Klemm Sarah, Ducher Michel, Aussem Alexandre, and Schott Anne-Marie. 2015. Hip fracture in the elderly: A re-analysis of the EPIDOS study with causal Bayesian networks. PLoS One 10, 3 (2015), e0120125.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Cinelli Carlos, Forney Andrew, and Pearl Judea. 2022. A crash course in good and bad controls. Sociological Methods & Research (2022). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Clark Andrew G., Foster Michael, Prifling Benedikt, Walkinshaw Neil, Hierons Robert M., Schmidt Volker, and Turner Robert D.. 2022. Testing causality in scientific modelling software. arXiv preprint arXiv:2209.00357 (2022).Google ScholarGoogle Scholar
  10. [10] Cohen Jacob. 1994. The earth is round (\(p \lt .05\)). American Psychologist 49, 12 (1994), 9971003.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Dubslaff Clemens, Weis Kallistos, Baier Christel, and Apel Sven. 2022. Causality in configurable software systems. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google ScholarGoogle Scholar
  12. [12] Fang Hongbo, Lamba Hemank, Herbsleb James, and Vasilescu Bogdan. 2022. “This is damn slick!” Estimating the impact of tweets on open source project popularity and new contributors. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.Google ScholarGoogle Scholar
  13. [13] Furia Carlo A., Feldt Robert, and Torkar Richard. 2021. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering 47, 9 (September 2021), 17861810.Google ScholarGoogle Scholar
  14. [14] Furia Carlo A., Torkar Richard, and Feldt Robert. 2022. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. ACM Transactions on Software Engineering and Methodology 31, 3 (2022), 40:1–40:38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Furia Carlo A., Torkar Richard, and Feldt Robert. 2023. Replication Package. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gelman Andrew. 2016. The problems with P-values are not just with P-values. American Statistician 70 (2016). Online discussion: http://www.stat.columbia.edu/gelman/research/published/asa_pvalues.pdfGoogle ScholarGoogle Scholar
  17. [17] Gelman Andrew. 2016. Why I Prefer 50% Rather Than 95% Intervals. https://statmodeling.stat.columbia.edu/2016/11/05/why-i-prefer-50-to-95-intervals/ From the blog Statistical Modeling, Causal Inference, and Social Science.Google ScholarGoogle Scholar
  18. [18] Gelman Andrew, Hill Jennifer, and Vehtari Aki. 2020. Regression and Other Stories. Cambridge University Press, Cambridge, UK. https://books.google.se/books?id=SZFKzQEACAAJGoogle ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gelman Andrew and Weakliem David. 2009. Of beauty, sex and power. American Scientist 97 (2009), 310316.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Goodman Steven N.. 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130, 12 (1999), 9951004.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hakulinen Lasse. 2011. Survey on informatics competitions: Developing tasks. In Olympiads in Informatics, Vol. 5. IOI, 1225.Google ScholarGoogle Scholar
  22. [22] Halpern Joseph. 2015. A modification of the Halpern-Pearl definition of causality. In 24th International Joint Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hanenberg Stefan. 2010. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, New York, NY, 2235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Henckel Leonard, Perković Emilija, and Maathuis Marloes H.. 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84, 2 (2022), 579599. Also https://arxiv.org/abs/1907.02435Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Hernán Miguel A. and Monge Susana. 2023. Selection bias due to conditioning on a collider. BMJ 381 (2023), 1135. arXiv: https://www.bmj.com/content/381/bmj.p1135.full.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  26. [26] Heyn Hans-Martin and Knauss Eric. 2022. Structural causal models as boundary objects in AI system development. In 1st International Conference on AI Engineering-Software Engineering for AI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Imbens Guido W.. 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 11291179.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Jaynes Edwin T.. 2003. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Keuning Hieke, Jeuring Johan, and Heeren Bastiaan. 2019. A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education 19, 1 (2019), 3:1–3:43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lee Seongmin, Binkley Dave, Feldt Robert, Gold Nicolas, and Yoo Shin. 2021. Causal program dependence analysis. arXiv preprint arXiv:2104.09107 (2021).Google ScholarGoogle Scholar
  31. [31] Li Xing, Yu Yinbo, Bu Kai, Chen Yan, Yang Jianfeng, and Quan Ruijie. 2019. Thinking inside the box: Differential fault localization for SDN control plane. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM’19). IEEE, 353359.Google ScholarGoogle Scholar
  32. [32] Liu Yuchu, Mattos David Issa, Bosch Jan, Olsson Helena Holmström, and Lantz Jonn. 2022. Bayesian causal inference in automotive software engineering and online evaluation. arXiv preprint arXiv:2207.00222 (2022).Google ScholarGoogle Scholar
  33. [33] McElreath Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd Ed.). CRC Press.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] McShane Blakeley B., Gal David, Gelman Andrew, Robert Christian, and Tackett Jennifer L.. 2019. Abandon statistical significance. American Statistician 73, S1 (2019), 235245.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Menzies Tim and Shepperd Martin. 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 3547. https://arxiv.org/abs/1803.05518Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Meyerovich Leo A. and Rabkin Ariel S.. 2013. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’13). ACM, New York, NY, 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Nanz Sebastian and Furia Carlo A.. 2014. A Comparative Study of Programming Languages in Rosetta Code. http://arxiv.org/abs/1409.0252Google ScholarGoogle Scholar
  38. [38] Nanz Sebastian and Furia Carlo A.. 2015. A comparative study of programming languages in Rosetta code. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Bertolino Antonia, Canfora Gerardo, and Elbaum Sebastian (Eds.). ACM, 778788.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Nanz Sebastian, Torshizi Faraz, Pedroni Michela, and Meyer Bertrand. 2011. Design of an empirical study for comparing the usability of concurrent programming languages. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM’11). IEEE Computer Society, Washington, DC, 325334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Pearl Judea. 2009. Causality. Cambridge University Press. Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Pearl Judea. 2009. Causality: Models, Reasoning and Inference (2nd Ed.). Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Pearl Judea. 2011. The mathematics of causal inference. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). Association for Computing Machinery, New York, NY, 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Pearl Judea. 2019. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62, 3 (2019), 5460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Pearl Judea. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96146.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Pearl Judea and Mackenzie Dana. 2018. The Book of Why. Penguin Random House.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Peters Jonas, Janzing Dominik, and Schölkopf Bernhard. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Prechelt Lutz. 2000. An empirical comparison of seven programming languages. IEEE Computer 33, 10 (Oct. 2000), 2329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Ray Baishakhi, Posnett Daryl, Filkov Vladimir, and Devanbu Premkumar. 2014. A large scale study of programming languages and code quality in Github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, 155165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Richens Jonathan G., Lee Ciarán M., and Johri Saurabh. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11, 1 (2020), 19.Google ScholarGoogle Scholar
  50. [50] Rossbach Christopher J., Hofmann Owen S., and Witchel Emmett. 2010. Is transactional programming actually easier? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 4756.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Rubin Donald B.. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Scholz Maximilian and Bürkner Paul-Christian. 2022. Prediction Can be Safely Used as a Proxy for Explanation in Causally Consistent Bayesian Generalized Linear Models. https://arxiv.org/abs/2210.06927Google ScholarGoogle Scholar
  53. [53] Scholz Maximilian and Torkar Richard. 2021. An empirical study of linespots: A novel past-fault algorithm. Software Testing, Verification and Reliability 31, 8 (2021), e1787.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Seo Hyunmin, Sadowski Caitlin, Elbaum Sebastian G., Aftandilian Edward, and Bowdidge Robert W.. 2014. Programmers’ build errors: A case study (at Google). In 36th International Conference on Software Engineering (ICSE’14). ACM, 724734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Siebert Julian. 2022. Applications of statistical causal inference in software engineering. arXiv preprint arXiv:2211.11482 (2022).Google ScholarGoogle Scholar
  56. [56] Simmons Joseph P., Nelson Leif D., and Simonsohn Uri. 2011. False-positive psychology. Psychological Science 22, 11 (2011), 13591366.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Spirtes Peter and Zhang Kun. 2016. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, Vol. 3. SpringerOpen, 128.Google ScholarGoogle Scholar
  58. [58] Sterne Jonathan. 2020. We Should Be Cautious about Associations of Patient Characteristics with COVID-19 Outcomes That Are Identified in Hospitalised Patients. Health Data Research UKhttps://www.hdruk.ac.uk/news/we-should-be-cautious-about-associations-of-patient-characteristics-with-covid-19-outcomes-that-are-identified-in-hospitalised-patients/Google ScholarGoogle Scholar
  59. [59] Torkar Richard, Furia Carlo A., Feldt Robert, Neto Francisco Gomes de Oliveira, Gren Lucas, Lenberg Per, and Ernst Neil A.. 2022. A method to assess and argue for practical significance in software engineering. IEEE Transactions on Software Engineering 48, 6 (June 2022), 20532065.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Vehtari Aki, Gelman Andrew, and Gabry Jonah. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 5 (2017), 14131432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Verhoeff Tom. 1997. The role of competitions in education. In Future World: Educating for the 21st Century. IOI.Google ScholarGoogle Scholar
  62. [62] Wasserstein Ronald L. and Lazar Nicole A.. 2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70, 2 (2016), 129133. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdfGoogle ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Software Engineering and Methodology
            ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 1
            January 2024
            933 pages
            ISSN:1049-331X
            EISSN:1557-7392
            DOI:10.1145/3613536
            • Editor:
            • Mauro Pezzè
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 November 2023
            • Online AM: 19 August 2023
            • Accepted: 7 July 2023
            • Revised: 27 June 2023
            • Received: 22 January 2023
            Published in tosem Volume 33, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)212
            • Downloads (Last 6 weeks)45

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text