Abstract
The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.
- Wasif Afzal, Ahmad Nauman, GhaziJuha Itkonen, Richard Torkar, Anneliese Andrews, and Khurram Bhatti. 2015. An experiment on the effectiveness and efficiency of exploratory testing. Empir. Softw. Eng. 20, 3 (2015), 844--878Google ScholarDigital Library
- Emil Alégroth, Robert Feldt, and Lisa Ryrholm. 2015. Visual GUI testing in practice: Challenges, problems and limitations. Empir. Softw. Eng. 20, 3 (2015), 694--744.Google ScholarDigital Library
- Theodore Anderson and Jeremy Finn. 1996. The New Statistical Analysis of Data. Springer.Google Scholar
- Titus Barik, Justin Smith, Kevin Lubick, Elisabeth Holmes, Jing Feng, Emerson Murphy-Hill, and Chris Parnin. 2017. Do developers read compiler error messages? In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 575--585.Google ScholarDigital Library
- Victor Basili. 1992. Software Modeling and Measurement: The Goal/Question/Metric Paradigm. Technical Report CS-TR-2956 (UMIACS-TR-92-96). University of Maryland at College Park.Google Scholar
- Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36, 1--2 (1999), 105--139.Google ScholarDigital Library
- Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, 1 (1995), 289--300.Google ScholarCross Ref
- Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32.Google ScholarDigital Library
- Rolf Brickenkamp, Lothar Schmidt-Atzert, and Detlev Liepmann. 2010. Test d2-Revision: Aufmerksamkeits-und Konzentrationstest. Hogrefe Göttingen.Google Scholar
- Raymond Buse, Caitlin Sadowski, and Westley Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. In Proceedings of the International Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’11). ACM, 643--656.Google ScholarDigital Library
- Joao Castelhano, Isabel Duarte, Carlos Ferreira, Joao Duraes, Henrique Madeira, and Miguel Castelo-Branco. 2018. The role of the insula in intuitive expert bug detection in computer code: An fMRI study. Brain Imag. Behav. 13, 3 (2018), 623--637.Google ScholarCross Ref
- Raymond Cattell. 1952. Factor Analysis. Harper, New York.Google Scholar
- Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression: Correlation Analysis for the Behavioral Sciences (2 ed.). Addison Wesley.Google Scholar
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (1995), 273--297.Google ScholarCross Ref
- Kurt Danziger. 1990. Constructing the Subject. Cambridge University Press.Google Scholar
- Jean-Marc Davril, Edouard Delfosse, Negar Hariri, Mathieu Acher, Jane Cleland-Huang, and Patrick Heymans. 2013. Feature model extraction from large collections of informal product descriptions. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’13). ACM, 290--300.Google ScholarDigital Library
- Pedro Domingos. 1997. Why does bagging work? A Bayesian account and its implications. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97). AAAI Press, 155--158.Google Scholar
- J. Duraes, H. Madeira, J. Castelhano, C. Duarte, and M. C. Branco. 2016. WAP: Understanding the brain at software debugging. In Proceedings of the International Symposium Software Reliability Engineering (ISSRE’16). IEEE CS, 87--92.Google Scholar
- Rick Hoyle (Ed.). 2012. Handbook of Structural Equation Modeling. The Guilford Press.Google Scholar
- Anders Eklund, Thomas E. Nichols, and Hans Knutsson. 2016. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. U.S.A. 113, 28 (2016), 7900--7905.Google ScholarCross Ref
- Stefan Endrikat, Stefan Hanenberg, Romain Robbes, and Andreas Stefik. 2014. How do API documentation and static typing affect API usability? In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 632--642.Google ScholarDigital Library
- Janet Feigenspan, Christian Kästner, Sven Apel, Jörg Liebig, Michael Schulze, Raimund Dachselt, Maria Papendieck, Thomas Leich, and Gunter Saake. 2013. Do background colors improve program comprehension in the #ifdef hell? Empir. Softw. Eng. 18, 4 (2013), 699--745.Google ScholarDigital Library
- Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2012. Measuring programming experience. In Proceedings of the International Conference on Program Comprehension (ICPC’12). IEEE CS, 73--82.Google ScholarCross Ref
- Benjamin Floyd, Tyler Santander, and Westley Weimer. 2017. Decoding the representation of code in the brain: An fMRI study of code review and expertise. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 175--186.Google ScholarDigital Library
- Xi Ge, Quinton L. DuBose, and Emerson Murphy-Hill. 2012. Reconciling manual and automatic refactoring. In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 211--221.Google ScholarCross Ref
- Gene V. Glass. 1976. Primary, secondary, and meta-analysis of research. Education. Res. 5, 10 (1976), 3--8.Google ScholarCross Ref
- Robert Glass. 1992. A comparative analysis of the topic areas of computer science, software engineering, and information systems. J. Syst. Softw. 19, 3 (1992), 277--289.Google ScholarDigital Library
- Robert Glass, Venkataraman Ramesh, and Iris Vessey. 2004. An analysis of research in computing disciplines. Commun. ACM 47, 6 (2004), 89--94.Google ScholarDigital Library
- Robert Glass, Iris Vessey, and Venkataraman Ramesh. 2002. Research in software engineering: An analysis of the literature. J. Info. Softw. Technol. 44, 8 (2002), 491--506.Google ScholarCross Ref
- M. Grabisch, J. L. Marichal, R. Mesiar, and E. Pap. 2009. Aggregation Functions. Cambridge University Press.Google Scholar
- Yves Grandvalet. 2004. Bagging equalizes influence. Mach. Learn.g 55, 3 (June 2004), 251--270.Google Scholar
- Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and Andreas Stefik. 2014. An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19, 5 (2014), 1335--1382.Google ScholarDigital Library
- Wenhua Hu, Jeffrey C. Carver, Vaibhav Anu, Gursimran S. Walia, and Gary L. Bradshaw. 2018. Using human error information for error prevention. Empir. Softw. Eng. 23, 6 (2018), 3768--3800.Google ScholarDigital Library
- Yu Huang, Xinyu Liu, Ryan Krueger, Tyler Santander, Xiaosu Hu, Kevin Leach, Tyler Santander, and Westley Weimer. 2019. Distilling neural representations of data structure manipulation using fMRI and fNIRS. In Proceedings of the International Conference on Software Engineering (ICSE’19). IEEE CS, 396--407.Google ScholarDigital Library
- Scott Huettel, Allen Song, and Gregory McCarthy. 2008. Functional Magnetic Resonance Imaging. Sinauer Associates.Google Scholar
- Yoshiharu Ikutani, Takatomi Kubo, Satoshi Nishida, Hideaki Hata, Kenichi Matsumoto, Kazushi Ikeda, and Shinji Nishimoto. 2020. Expert programmers have fine-tuned cortical representations of source code. Retrieved from https://www.biorxiv.org/content/early/2020/01/29/2020.01.28.923953.Google Scholar
- Nishant Jha and Anas Mahmoud. 2018. Using frame semantics for classifying and summarizing application store reviews. Empir. Softw. Eng. 23, 6 (2018), 3734--3767.Google ScholarDigital Library
- Brendan Juba and Hai Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4039--4048.Google ScholarDigital Library
- Jakub Jurkiewicz, Jerzy Nawrocki, Mirosław Ochodek, and Tomasz Głowacki. 2015. HAZOP-based identification of events in use cases. Empir. Softw. Eng. 20, 1 (2015), 82--109.Google ScholarDigital Library
- Barbara Kitchenham, Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering: A systematic literature review. J. Info. Softw. Technol. 51, 1 (2009), 7--15.Google ScholarDigital Library
- Barbara Kitchenham, Pearl Brereton, Zhi Li, David Budgen, and Andrew Burn. 2011. Repeatability of systematic literature reviews. In Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE’11). IET Software, 46--55.Google ScholarCross Ref
- Andrew Ko, Thomas Latoza, and Margaret Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants. Empir. Softw. Eng. 20, 1 (2015), 110--141.Google ScholarCross Ref
- Makrina Viola Kosti, Robert Feldt, and Lefteris Angelis. 2016. Archetypal personalities of software engineers and their work preferences: A new perspective for empirical studies. Empir. Softw. Eng. 21, 4 (2016), 1509--1532.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, 1097--1105.Google Scholar
- Ryan Krueger, Yu Huang, Xinyu Liu, Tyler Santander, Westley Weimer, and Kevin Leach. 2020. Neurological divide: An fMRI study of prose and code writing. In Proceedings of the International Conference on Software Engineering (ICSE’20). 12. To appear.Google ScholarDigital Library
- Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue. 2018. Do developers update their library dependencies? Empir. Softw. Eng. 23, 1 (2018), 384--417.Google ScholarDigital Library
- H. Larsson, E. Lindqvist, and R. Torkar. 2014. Outliers and replication in software engineering. In Proceedings of the Asia-Pacific Software Engineering Conference (APSEC’14). IEEE CS, 207--214.Google Scholar
- Trang Le, Weixuan Fu, and Jason Moore. 2020. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 1 (2020), 250--256.Google ScholarCross Ref
- Y. Lee, N. Chen, and R. Johnson. 2013. Drag-and-drop refactoring: Intuitive and efficient program transformation. In Proceedings of the International Conference on Software Engineering (ICSE’13). IEEE CS, 23--32.Google Scholar
- Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and interactive recommendation for modifying pasted code. In Proceedings of the European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE’15). ACM, 520--531.Google ScholarDigital Library
- Yun Lin, Jun Sun, Yinxing Xue, Yang Liu, and Jinsong Dong. 2017. Feedback-based debugging. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 393--403.Google ScholarDigital Library
- Jonathan Lung, Jorge Aranda, and Steve Easterbrook. 2008. On the difficulty of replicating human subjects studies in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’08). ACM, 191--200.Google ScholarDigital Library
- Patrick Mäder and Alexander Egyed. 2015. Do developers benefit from requirements traceability when evolving and maintaining a software system? Empir. Softw. Eng. 20, 2 (2015), 413--441.Google ScholarDigital Library
- K. McKiernan, J. Kaufman, J. Kucera-Thompson, and J. Binder. 2003. A parametric manipulation of factors affecting task-induced deactivation in functional neuroimaging. J. Cogn. Neurosci. 15, 3 (2003), 394--408.Google ScholarDigital Library
- Helfried Moosbrugger and Augustin Kelava. 2007. Testtheorie und Fragebogenkonstruktion. Springer, Berlin.Google Scholar
- Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16). ACM, 485--492.Google Scholar
- Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. Flexjava: Language support for safe and modular approximate programming. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’15). 745--757.Google ScholarDigital Library
- Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, Johannes Hofmeister, and André Brechmann. 2018. Simultaneous measurement of program comprehension with fMRI and eye tracking: A case study. In Proceedings of the International Symposium Empirical Software Engineering and Measurement (ESEM’18). ACM, 24:1–24:10.Google ScholarDigital Library
- Thomas Perneger. 1998. What’s wrong with bonferroni adjustments. BMJ 316, 7139 (1998), 1236--1238.Google Scholar
- Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: A preliminary literature review. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1195--1206.Google ScholarDigital Library
- Martin Schäf, Daniel Schwartz-Narbonne, and Thomas Wies. 2013. Explaining inconsistent code. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’13). ACM, 521--531.Google ScholarDigital Library
- Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and Andre Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 378--389.Google ScholarDigital Library
- Janet Siegmund, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2014. Measuring and modeling programming experience. Empir. Softw. Eng. 19, 5 (2014), 1299--1334.Google ScholarDigital Library
- Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’17). ACM, 140--150.Google ScholarDigital Library
- Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 9--19.Google ScholarCross Ref
- Norbert Siegmund, Stefan Sobernig, and Sven Apel. 2017. Attributed variability models: Outside the comfort zone. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’17). ACM, 268--278.Google ScholarDigital Library
- Dag Sjøberg, Bente Anda, Erik Arisholm, Tore Dyba, Magne Jorgensen, Amela Karahasanovic, Espen Frimann Koren, and Marek Vokác. 2002. Conducting realistic experiments in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’02). IEEE CS, 17--26.Google ScholarCross Ref
- Dag Sjøberg, Jo Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, Nils-Kristian Liborg, and Anette Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9 (2005), 733--753.Google ScholarDigital Library
- Jean Talairach and Pierre Tournoux. 1988. Co-Planar Stereotaxic Atlas of the Human Brain. Thieme.Google Scholar
- Walter Tichy, Paul Lukowicz, Lutz Prechelt, and Ernst Heinz. 1995. Experimental evaluation in computer science: A quantitative study. J. Syst. Softw. 28, 1 (1995), 9--18.Google ScholarDigital Library
- Vladimir Vapnik and Alexey Chervonenkis. 2015. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity. Springer, 11--30.Google Scholar
- Iris Vessey, Venkataraman Ramesh, and Robert Glass. 2002. Research in information systems: An empirical study of diversity in the discipline and its journals. J. Manage. Info. Syst. 19, 2 (2002), 129--174.Google ScholarDigital Library
- Clifford Wagner. 1982. Simpson’s paradox in real life. Amer. Stat. 36, 1 (1982), 46--48.Google Scholar
- Zhaogui Xu, Shiqing Ma, Xiangyu Zhang, Shuofei Zhu, and Baowen Xu. 2018. Debugging with intelligence via probabilistic inference. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1171--1181.Google ScholarDigital Library
- K. Yskout, R. Scandariato, and W. Joosen. 2012. Does organizing security patterns focus architectural choices? In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 617--627.Google Scholar
- Koen Yskout, Riccardo Scandariato, and Wouter Joosen. 2015. Do security patterns really help designers? In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 292--302.Google ScholarCross Ref
- Lewis Goldberg. 1993. The structure of phenotypic personality traits. American Psychologist 48, 1 (1993), 26--34.Google ScholarCross Ref
Index Terms
- Mastering Variation in Human Studies: The Role of Aggregation
Recommendations
Methods for human - computer interaction research with older people
Designing Computer Systems for and with Older UsersExperimental research in human - computer interaction commonly uses participant groups that are unrepresentative of demographic realities, being young, technically knowledgeable and highly educated. One way of reflecting society more accurately in ...
Opportunities for ubiquitous computing in the homes of low SES older adults
UbiComp '12: Proceedings of the 2012 ACM Conference on Ubiquitous ComputingEight hour contextual observations have been conducted in the homes of 5 low socioeconomic status (SES) urban-dwelling older adults. The purpose of the observations was to understand the daily needs and challenges of older adults in order to design ...
Human cognitive measurement as a metric within usability studies
CHI EA '13: CHI '13 Extended Abstracts on Human Factors in Computing SystemsThere has been a growing interest in the impact that age and online abilities can have on an individual's experience of using the Internet. However, the reliance on these factors has not shown to be entirely conclusive. The current paper develops ...
Comments