skip to main content
research-article

Mastering Variation in Human Studies: The Role of Aggregation

Published:31 December 2020Publication History
Skip Abstract Section

Abstract

The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.

References

  1. Wasif Afzal, Ahmad Nauman, GhaziJuha Itkonen, Richard Torkar, Anneliese Andrews, and Khurram Bhatti. 2015. An experiment on the effectiveness and efficiency of exploratory testing. Empir. Softw. Eng. 20, 3 (2015), 844--878Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Emil Alégroth, Robert Feldt, and Lisa Ryrholm. 2015. Visual GUI testing in practice: Challenges, problems and limitations. Empir. Softw. Eng. 20, 3 (2015), 694--744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Theodore Anderson and Jeremy Finn. 1996. The New Statistical Analysis of Data. Springer.Google ScholarGoogle Scholar
  4. Titus Barik, Justin Smith, Kevin Lubick, Elisabeth Holmes, Jing Feng, Emerson Murphy-Hill, and Chris Parnin. 2017. Do developers read compiler error messages? In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 575--585.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Victor Basili. 1992. Software Modeling and Measurement: The Goal/Question/Metric Paradigm. Technical Report CS-TR-2956 (UMIACS-TR-92-96). University of Maryland at College Park.Google ScholarGoogle Scholar
  6. Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36, 1--2 (1999), 105--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, 1 (1995), 289--300.Google ScholarGoogle ScholarCross RefCross Ref
  8. Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Rolf Brickenkamp, Lothar Schmidt-Atzert, and Detlev Liepmann. 2010. Test d2-Revision: Aufmerksamkeits-und Konzentrationstest. Hogrefe Göttingen.Google ScholarGoogle Scholar
  10. Raymond Buse, Caitlin Sadowski, and Westley Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. In Proceedings of the International Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’11). ACM, 643--656.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joao Castelhano, Isabel Duarte, Carlos Ferreira, Joao Duraes, Henrique Madeira, and Miguel Castelo-Branco. 2018. The role of the insula in intuitive expert bug detection in computer code: An fMRI study. Brain Imag. Behav. 13, 3 (2018), 623--637.Google ScholarGoogle ScholarCross RefCross Ref
  12. Raymond Cattell. 1952. Factor Analysis. Harper, New York.Google ScholarGoogle Scholar
  13. Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression: Correlation Analysis for the Behavioral Sciences (2 ed.). Addison Wesley.Google ScholarGoogle Scholar
  14. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (1995), 273--297.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kurt Danziger. 1990. Constructing the Subject. Cambridge University Press.Google ScholarGoogle Scholar
  16. Jean-Marc Davril, Edouard Delfosse, Negar Hariri, Mathieu Acher, Jane Cleland-Huang, and Patrick Heymans. 2013. Feature model extraction from large collections of informal product descriptions. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’13). ACM, 290--300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pedro Domingos. 1997. Why does bagging work? A Bayesian account and its implications. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97). AAAI Press, 155--158.Google ScholarGoogle Scholar
  18. J. Duraes, H. Madeira, J. Castelhano, C. Duarte, and M. C. Branco. 2016. WAP: Understanding the brain at software debugging. In Proceedings of the International Symposium Software Reliability Engineering (ISSRE’16). IEEE CS, 87--92.Google ScholarGoogle Scholar
  19. Rick Hoyle (Ed.). 2012. Handbook of Structural Equation Modeling. The Guilford Press.Google ScholarGoogle Scholar
  20. Anders Eklund, Thomas E. Nichols, and Hans Knutsson. 2016. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. U.S.A. 113, 28 (2016), 7900--7905.Google ScholarGoogle ScholarCross RefCross Ref
  21. Stefan Endrikat, Stefan Hanenberg, Romain Robbes, and Andreas Stefik. 2014. How do API documentation and static typing affect API usability? In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 632--642.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Janet Feigenspan, Christian Kästner, Sven Apel, Jörg Liebig, Michael Schulze, Raimund Dachselt, Maria Papendieck, Thomas Leich, and Gunter Saake. 2013. Do background colors improve program comprehension in the #ifdef hell? Empir. Softw. Eng. 18, 4 (2013), 699--745.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2012. Measuring programming experience. In Proceedings of the International Conference on Program Comprehension (ICPC’12). IEEE CS, 73--82.Google ScholarGoogle ScholarCross RefCross Ref
  24. Benjamin Floyd, Tyler Santander, and Westley Weimer. 2017. Decoding the representation of code in the brain: An fMRI study of code review and expertise. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 175--186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xi Ge, Quinton L. DuBose, and Emerson Murphy-Hill. 2012. Reconciling manual and automatic refactoring. In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 211--221.Google ScholarGoogle ScholarCross RefCross Ref
  26. Gene V. Glass. 1976. Primary, secondary, and meta-analysis of research. Education. Res. 5, 10 (1976), 3--8.Google ScholarGoogle ScholarCross RefCross Ref
  27. Robert Glass. 1992. A comparative analysis of the topic areas of computer science, software engineering, and information systems. J. Syst. Softw. 19, 3 (1992), 277--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Robert Glass, Venkataraman Ramesh, and Iris Vessey. 2004. An analysis of research in computing disciplines. Commun. ACM 47, 6 (2004), 89--94.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Robert Glass, Iris Vessey, and Venkataraman Ramesh. 2002. Research in software engineering: An analysis of the literature. J. Info. Softw. Technol. 44, 8 (2002), 491--506.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Grabisch, J. L. Marichal, R. Mesiar, and E. Pap. 2009. Aggregation Functions. Cambridge University Press.Google ScholarGoogle Scholar
  31. Yves Grandvalet. 2004. Bagging equalizes influence. Mach. Learn.g 55, 3 (June 2004), 251--270.Google ScholarGoogle Scholar
  32. Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and Andreas Stefik. 2014. An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19, 5 (2014), 1335--1382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wenhua Hu, Jeffrey C. Carver, Vaibhav Anu, Gursimran S. Walia, and Gary L. Bradshaw. 2018. Using human error information for error prevention. Empir. Softw. Eng. 23, 6 (2018), 3768--3800.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yu Huang, Xinyu Liu, Ryan Krueger, Tyler Santander, Xiaosu Hu, Kevin Leach, Tyler Santander, and Westley Weimer. 2019. Distilling neural representations of data structure manipulation using fMRI and fNIRS. In Proceedings of the International Conference on Software Engineering (ICSE’19). IEEE CS, 396--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Scott Huettel, Allen Song, and Gregory McCarthy. 2008. Functional Magnetic Resonance Imaging. Sinauer Associates.Google ScholarGoogle Scholar
  36. Yoshiharu Ikutani, Takatomi Kubo, Satoshi Nishida, Hideaki Hata, Kenichi Matsumoto, Kazushi Ikeda, and Shinji Nishimoto. 2020. Expert programmers have fine-tuned cortical representations of source code. Retrieved from https://www.biorxiv.org/content/early/2020/01/29/2020.01.28.923953.Google ScholarGoogle Scholar
  37. Nishant Jha and Anas Mahmoud. 2018. Using frame semantics for classifying and summarizing application store reviews. Empir. Softw. Eng. 23, 6 (2018), 3734--3767.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Brendan Juba and Hai Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4039--4048.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jakub Jurkiewicz, Jerzy Nawrocki, Mirosław Ochodek, and Tomasz Głowacki. 2015. HAZOP-based identification of events in use cases. Empir. Softw. Eng. 20, 1 (2015), 82--109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Barbara Kitchenham, Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering: A systematic literature review. J. Info. Softw. Technol. 51, 1 (2009), 7--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Barbara Kitchenham, Pearl Brereton, Zhi Li, David Budgen, and Andrew Burn. 2011. Repeatability of systematic literature reviews. In Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE’11). IET Software, 46--55.Google ScholarGoogle ScholarCross RefCross Ref
  42. Andrew Ko, Thomas Latoza, and Margaret Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants. Empir. Softw. Eng. 20, 1 (2015), 110--141.Google ScholarGoogle ScholarCross RefCross Ref
  43. Makrina Viola Kosti, Robert Feldt, and Lefteris Angelis. 2016. Archetypal personalities of software engineers and their work preferences: A new perspective for empirical studies. Empir. Softw. Eng. 21, 4 (2016), 1509--1532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, 1097--1105.Google ScholarGoogle Scholar
  45. Ryan Krueger, Yu Huang, Xinyu Liu, Tyler Santander, Westley Weimer, and Kevin Leach. 2020. Neurological divide: An fMRI study of prose and code writing. In Proceedings of the International Conference on Software Engineering (ICSE’20). 12. To appear.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue. 2018. Do developers update their library dependencies? Empir. Softw. Eng. 23, 1 (2018), 384--417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Larsson, E. Lindqvist, and R. Torkar. 2014. Outliers and replication in software engineering. In Proceedings of the Asia-Pacific Software Engineering Conference (APSEC’14). IEEE CS, 207--214.Google ScholarGoogle Scholar
  48. Trang Le, Weixuan Fu, and Jason Moore. 2020. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 1 (2020), 250--256.Google ScholarGoogle ScholarCross RefCross Ref
  49. Y. Lee, N. Chen, and R. Johnson. 2013. Drag-and-drop refactoring: Intuitive and efficient program transformation. In Proceedings of the International Conference on Software Engineering (ICSE’13). IEEE CS, 23--32.Google ScholarGoogle Scholar
  50. Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and interactive recommendation for modifying pasted code. In Proceedings of the European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE’15). ACM, 520--531.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yun Lin, Jun Sun, Yinxing Xue, Yang Liu, and Jinsong Dong. 2017. Feedback-based debugging. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 393--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jonathan Lung, Jorge Aranda, and Steve Easterbrook. 2008. On the difficulty of replicating human subjects studies in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’08). ACM, 191--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Patrick Mäder and Alexander Egyed. 2015. Do developers benefit from requirements traceability when evolving and maintaining a software system? Empir. Softw. Eng. 20, 2 (2015), 413--441.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. K. McKiernan, J. Kaufman, J. Kucera-Thompson, and J. Binder. 2003. A parametric manipulation of factors affecting task-induced deactivation in functional neuroimaging. J. Cogn. Neurosci. 15, 3 (2003), 394--408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Helfried Moosbrugger and Augustin Kelava. 2007. Testtheorie und Fragebogenkonstruktion. Springer, Berlin.Google ScholarGoogle Scholar
  56. Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16). ACM, 485--492.Google ScholarGoogle Scholar
  57. Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. Flexjava: Language support for safe and modular approximate programming. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’15). 745--757.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, Johannes Hofmeister, and André Brechmann. 2018. Simultaneous measurement of program comprehension with fMRI and eye tracking: A case study. In Proceedings of the International Symposium Empirical Software Engineering and Measurement (ESEM’18). ACM, 24:1–24:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Thomas Perneger. 1998. What’s wrong with bonferroni adjustments. BMJ 316, 7139 (1998), 1236--1238.Google ScholarGoogle Scholar
  60. Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: A preliminary literature review. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1195--1206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Martin Schäf, Daniel Schwartz-Narbonne, and Thomas Wies. 2013. Explaining inconsistent code. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’13). ACM, 521--531.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and Andre Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 378--389.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Janet Siegmund, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2014. Measuring and modeling programming experience. Empir. Softw. Eng. 19, 5 (2014), 1299--1334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’17). ACM, 140--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 9--19.Google ScholarGoogle ScholarCross RefCross Ref
  66. Norbert Siegmund, Stefan Sobernig, and Sven Apel. 2017. Attributed variability models: Outside the comfort zone. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’17). ACM, 268--278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Dag Sjøberg, Bente Anda, Erik Arisholm, Tore Dyba, Magne Jorgensen, Amela Karahasanovic, Espen Frimann Koren, and Marek Vokác. 2002. Conducting realistic experiments in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’02). IEEE CS, 17--26.Google ScholarGoogle ScholarCross RefCross Ref
  68. Dag Sjøberg, Jo Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, Nils-Kristian Liborg, and Anette Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9 (2005), 733--753.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jean Talairach and Pierre Tournoux. 1988. Co-Planar Stereotaxic Atlas of the Human Brain. Thieme.Google ScholarGoogle Scholar
  70. Walter Tichy, Paul Lukowicz, Lutz Prechelt, and Ernst Heinz. 1995. Experimental evaluation in computer science: A quantitative study. J. Syst. Softw. 28, 1 (1995), 9--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Vladimir Vapnik and Alexey Chervonenkis. 2015. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity. Springer, 11--30.Google ScholarGoogle Scholar
  72. Iris Vessey, Venkataraman Ramesh, and Robert Glass. 2002. Research in information systems: An empirical study of diversity in the discipline and its journals. J. Manage. Info. Syst. 19, 2 (2002), 129--174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Clifford Wagner. 1982. Simpson’s paradox in real life. Amer. Stat. 36, 1 (1982), 46--48.Google ScholarGoogle Scholar
  74. Zhaogui Xu, Shiqing Ma, Xiangyu Zhang, Shuofei Zhu, and Baowen Xu. 2018. Debugging with intelligence via probabilistic inference. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1171--1181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. K. Yskout, R. Scandariato, and W. Joosen. 2012. Does organizing security patterns focus architectural choices? In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 617--627.Google ScholarGoogle Scholar
  76. Koen Yskout, Riccardo Scandariato, and Wouter Joosen. 2015. Do security patterns really help designers? In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 292--302.Google ScholarGoogle ScholarCross RefCross Ref
  77. Lewis Goldberg. 1993. The structure of phenotypic personality traits. American Psychologist 48, 1 (1993), 26--34.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Mastering Variation in Human Studies: The Role of Aggregation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Software Engineering and Methodology
        ACM Transactions on Software Engineering and Methodology  Volume 30, Issue 1
        Continuous Special Section: AI and SE
        January 2021
        444 pages
        ISSN:1049-331X
        EISSN:1557-7392
        DOI:10.1145/3446626
        • Editor:
        • Mauro Pezzè
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 December 2020
        • Accepted: 1 June 2020
        • Revised: 1 May 2020
        • Received: 1 September 2019
        Published in tosem Volume 30, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format