research-article

Mastering Variation in Human Studies: The Role of Aggregation

Authors:
Janet Siegmund

Chemnitz University of Technology

Chemnitz University of Technology
View Profile

,
Norman Peitek

Leibniz Institute for Neurobiology

Leibniz Institute for Neurobiology
View Profile

,
Sven Apel

Saarland University, Saarland Informatics Campus

Saarland University, Saarland Informatics Campus

0000-0003-3687-2233
View Profile

,
Norbert Siegmund

Leipzig University

Leipzig University
View Profile

ACM Transactions on Software Engineering and Methodology Volume 30 Issue 1Article No.: 2pp 1–40https://doi.org/10.1145/3406544

Published:31 December 2020Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.

References

Wasif Afzal, Ahmad Nauman, GhaziJuha Itkonen, Richard Torkar, Anneliese Andrews, and Khurram Bhatti. 2015. An experiment on the effectiveness and efficiency of exploratory testing. Empir. Softw. Eng. 20, 3 (2015), 844--878Google ScholarDigital Library
Emil Alégroth, Robert Feldt, and Lisa Ryrholm. 2015. Visual GUI testing in practice: Challenges, problems and limitations. Empir. Softw. Eng. 20, 3 (2015), 694--744.Google ScholarDigital Library
Theodore Anderson and Jeremy Finn. 1996. The New Statistical Analysis of Data. Springer.Google Scholar
Titus Barik, Justin Smith, Kevin Lubick, Elisabeth Holmes, Jing Feng, Emerson Murphy-Hill, and Chris Parnin. 2017. Do developers read compiler error messages? In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 575--585.Google ScholarDigital Library
Victor Basili. 1992. Software Modeling and Measurement: The Goal/Question/Metric Paradigm. Technical Report CS-TR-2956 (UMIACS-TR-92-96). University of Maryland at College Park.Google Scholar
Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36, 1--2 (1999), 105--139.Google ScholarDigital Library
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, 1 (1995), 289--300.Google ScholarCross Ref
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32.Google ScholarDigital Library
Rolf Brickenkamp, Lothar Schmidt-Atzert, and Detlev Liepmann. 2010. Test d2-Revision: Aufmerksamkeits-und Konzentrationstest. Hogrefe Göttingen.Google Scholar
Raymond Buse, Caitlin Sadowski, and Westley Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. In Proceedings of the International Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’11). ACM, 643--656.Google ScholarDigital Library
Joao Castelhano, Isabel Duarte, Carlos Ferreira, Joao Duraes, Henrique Madeira, and Miguel Castelo-Branco. 2018. The role of the insula in intuitive expert bug detection in computer code: An fMRI study. Brain Imag. Behav. 13, 3 (2018), 623--637.Google ScholarCross Ref
Raymond Cattell. 1952. Factor Analysis. Harper, New York.Google Scholar
Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression: Correlation Analysis for the Behavioral Sciences (2 ed.). Addison Wesley.Google Scholar
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (1995), 273--297.Google ScholarCross Ref
Kurt Danziger. 1990. Constructing the Subject. Cambridge University Press.Google Scholar
Jean-Marc Davril, Edouard Delfosse, Negar Hariri, Mathieu Acher, Jane Cleland-Huang, and Patrick Heymans. 2013. Feature model extraction from large collections of informal product descriptions. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’13). ACM, 290--300.Google ScholarDigital Library
Pedro Domingos. 1997. Why does bagging work? A Bayesian account and its implications. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97). AAAI Press, 155--158.Google Scholar
J. Duraes, H. Madeira, J. Castelhano, C. Duarte, and M. C. Branco. 2016. WAP: Understanding the brain at software debugging. In Proceedings of the International Symposium Software Reliability Engineering (ISSRE’16). IEEE CS, 87--92.Google Scholar
Rick Hoyle (Ed.). 2012. Handbook of Structural Equation Modeling. The Guilford Press.Google Scholar
Anders Eklund, Thomas E. Nichols, and Hans Knutsson. 2016. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc. Natl. Acad. Sci. U.S.A. 113, 28 (2016), 7900--7905.Google ScholarCross Ref
Stefan Endrikat, Stefan Hanenberg, Romain Robbes, and Andreas Stefik. 2014. How do API documentation and static typing affect API usability? In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 632--642.Google ScholarDigital Library
Janet Feigenspan, Christian Kästner, Sven Apel, Jörg Liebig, Michael Schulze, Raimund Dachselt, Maria Papendieck, Thomas Leich, and Gunter Saake. 2013. Do background colors improve program comprehension in the #ifdef hell? Empir. Softw. Eng. 18, 4 (2013), 699--745.Google ScholarDigital Library
Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2012. Measuring programming experience. In Proceedings of the International Conference on Program Comprehension (ICPC’12). IEEE CS, 73--82.Google ScholarCross Ref
Benjamin Floyd, Tyler Santander, and Westley Weimer. 2017. Decoding the representation of code in the brain: An fMRI study of code review and expertise. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 175--186.Google ScholarDigital Library
Xi Ge, Quinton L. DuBose, and Emerson Murphy-Hill. 2012. Reconciling manual and automatic refactoring. In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 211--221.Google ScholarCross Ref
Gene V. Glass. 1976. Primary, secondary, and meta-analysis of research. Education. Res. 5, 10 (1976), 3--8.Google ScholarCross Ref
Robert Glass. 1992. A comparative analysis of the topic areas of computer science, software engineering, and information systems. J. Syst. Softw. 19, 3 (1992), 277--289.Google ScholarDigital Library
Robert Glass, Venkataraman Ramesh, and Iris Vessey. 2004. An analysis of research in computing disciplines. Commun. ACM 47, 6 (2004), 89--94.Google ScholarDigital Library
Robert Glass, Iris Vessey, and Venkataraman Ramesh. 2002. Research in software engineering: An analysis of the literature. J. Info. Softw. Technol. 44, 8 (2002), 491--506.Google ScholarCross Ref
M. Grabisch, J. L. Marichal, R. Mesiar, and E. Pap. 2009. Aggregation Functions. Cambridge University Press.Google Scholar
Yves Grandvalet. 2004. Bagging equalizes influence. Mach. Learn.g 55, 3 (June 2004), 251--270.Google Scholar
Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and Andreas Stefik. 2014. An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19, 5 (2014), 1335--1382.Google ScholarDigital Library
Wenhua Hu, Jeffrey C. Carver, Vaibhav Anu, Gursimran S. Walia, and Gary L. Bradshaw. 2018. Using human error information for error prevention. Empir. Softw. Eng. 23, 6 (2018), 3768--3800.Google ScholarDigital Library
Yu Huang, Xinyu Liu, Ryan Krueger, Tyler Santander, Xiaosu Hu, Kevin Leach, Tyler Santander, and Westley Weimer. 2019. Distilling neural representations of data structure manipulation using fMRI and fNIRS. In Proceedings of the International Conference on Software Engineering (ICSE’19). IEEE CS, 396--407.Google ScholarDigital Library
Scott Huettel, Allen Song, and Gregory McCarthy. 2008. Functional Magnetic Resonance Imaging. Sinauer Associates.Google Scholar
Yoshiharu Ikutani, Takatomi Kubo, Satoshi Nishida, Hideaki Hata, Kenichi Matsumoto, Kazushi Ikeda, and Shinji Nishimoto. 2020. Expert programmers have fine-tuned cortical representations of source code. Retrieved from https://www.biorxiv.org/content/early/2020/01/29/2020.01.28.923953.Google Scholar
Nishant Jha and Anas Mahmoud. 2018. Using frame semantics for classifying and summarizing application store reviews. Empir. Softw. Eng. 23, 6 (2018), 3734--3767.Google ScholarDigital Library
Brendan Juba and Hai Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4039--4048.Google ScholarDigital Library
Jakub Jurkiewicz, Jerzy Nawrocki, Mirosław Ochodek, and Tomasz Głowacki. 2015. HAZOP-based identification of events in use cases. Empir. Softw. Eng. 20, 1 (2015), 82--109.Google ScholarDigital Library
Barbara Kitchenham, Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering: A systematic literature review. J. Info. Softw. Technol. 51, 1 (2009), 7--15.Google ScholarDigital Library
Barbara Kitchenham, Pearl Brereton, Zhi Li, David Budgen, and Andrew Burn. 2011. Repeatability of systematic literature reviews. In Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE’11). IET Software, 46--55.Google ScholarCross Ref
Andrew Ko, Thomas Latoza, and Margaret Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants. Empir. Softw. Eng. 20, 1 (2015), 110--141.Google ScholarCross Ref
Makrina Viola Kosti, Robert Feldt, and Lefteris Angelis. 2016. Archetypal personalities of software engineers and their work preferences: A new perspective for empirical studies. Empir. Softw. Eng. 21, 4 (2016), 1509--1532.Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, 1097--1105.Google Scholar
Ryan Krueger, Yu Huang, Xinyu Liu, Tyler Santander, Westley Weimer, and Kevin Leach. 2020. Neurological divide: An fMRI study of prose and code writing. In Proceedings of the International Conference on Software Engineering (ICSE’20). 12. To appear.Google ScholarDigital Library
Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue. 2018. Do developers update their library dependencies? Empir. Softw. Eng. 23, 1 (2018), 384--417.Google ScholarDigital Library
H. Larsson, E. Lindqvist, and R. Torkar. 2014. Outliers and replication in software engineering. In Proceedings of the Asia-Pacific Software Engineering Conference (APSEC’14). IEEE CS, 207--214.Google Scholar
Trang Le, Weixuan Fu, and Jason Moore. 2020. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 1 (2020), 250--256.Google ScholarCross Ref
Y. Lee, N. Chen, and R. Johnson. 2013. Drag-and-drop refactoring: Intuitive and efficient program transformation. In Proceedings of the International Conference on Software Engineering (ICSE’13). IEEE CS, 23--32.Google Scholar
Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and interactive recommendation for modifying pasted code. In Proceedings of the European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE’15). ACM, 520--531.Google ScholarDigital Library
Yun Lin, Jun Sun, Yinxing Xue, Yang Liu, and Jinsong Dong. 2017. Feedback-based debugging. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE CS, 393--403.Google ScholarDigital Library
Jonathan Lung, Jorge Aranda, and Steve Easterbrook. 2008. On the difficulty of replicating human subjects studies in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’08). ACM, 191--200.Google ScholarDigital Library
Patrick Mäder and Alexander Egyed. 2015. Do developers benefit from requirements traceability when evolving and maintaining a software system? Empir. Softw. Eng. 20, 2 (2015), 413--441.Google ScholarDigital Library
K. McKiernan, J. Kaufman, J. Kucera-Thompson, and J. Binder. 2003. A parametric manipulation of factors affecting task-induced deactivation in functional neuroimaging. J. Cogn. Neurosci. 15, 3 (2003), 394--408.Google ScholarDigital Library
Helfried Moosbrugger and Augustin Kelava. 2007. Testtheorie und Fragebogenkonstruktion. Springer, Berlin.Google Scholar
Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16). ACM, 485--492.Google Scholar
Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. Flexjava: Language support for safe and modular approximate programming. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’15). 745--757.Google ScholarDigital Library
Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, Johannes Hofmeister, and André Brechmann. 2018. Simultaneous measurement of program comprehension with fMRI and eye tracking: A case study. In Proceedings of the International Symposium Empirical Software Engineering and Measurement (ESEM’18). ACM, 24:1–24:10.Google ScholarDigital Library
Thomas Perneger. 1998. What’s wrong with bonferroni adjustments. BMJ 316, 7139 (1998), 1236--1238.Google Scholar
Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: A preliminary literature review. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1195--1206.Google ScholarDigital Library
Martin Schäf, Daniel Schwartz-Narbonne, and Thomas Wies. 2013. Explaining inconsistent code. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’13). ACM, 521--531.Google ScholarDigital Library
Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and Andre Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. In Proceedings of the International Conference on Software Engineering (ICSE’14). ACM, 378--389.Google ScholarDigital Library
Janet Siegmund, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2014. Measuring and modeling programming experience. Empir. Softw. Eng. 19, 5 (2014), 1299--1334.Google ScholarDigital Library
Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. In Proceedings of the European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE’17). ACM, 140--150.Google ScholarDigital Library
Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 9--19.Google ScholarCross Ref
Norbert Siegmund, Stefan Sobernig, and Sven Apel. 2017. Attributed variability models: Outside the comfort zone. In Proceedings of the ACM Joint Symposium Foundations of Software Engineering (FSE’17). ACM, 268--278.Google ScholarDigital Library
Dag Sjøberg, Bente Anda, Erik Arisholm, Tore Dyba, Magne Jorgensen, Amela Karahasanovic, Espen Frimann Koren, and Marek Vokác. 2002. Conducting realistic experiments in software engineering. In Proceedings of the International Conference on Software Engineering (ICSE’02). IEEE CS, 17--26.Google ScholarCross Ref
Dag Sjøberg, Jo Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, Nils-Kristian Liborg, and Anette Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9 (2005), 733--753.Google ScholarDigital Library
Jean Talairach and Pierre Tournoux. 1988. Co-Planar Stereotaxic Atlas of the Human Brain. Thieme.Google Scholar
Walter Tichy, Paul Lukowicz, Lutz Prechelt, and Ernst Heinz. 1995. Experimental evaluation in computer science: A quantitative study. J. Syst. Softw. 28, 1 (1995), 9--18.Google ScholarDigital Library
Vladimir Vapnik and Alexey Chervonenkis. 2015. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity. Springer, 11--30.Google Scholar
Iris Vessey, Venkataraman Ramesh, and Robert Glass. 2002. Research in information systems: An empirical study of diversity in the discipline and its journals. J. Manage. Info. Syst. 19, 2 (2002), 129--174.Google ScholarDigital Library
Clifford Wagner. 1982. Simpson’s paradox in real life. Amer. Stat. 36, 1 (1982), 46--48.Google Scholar
Zhaogui Xu, Shiqing Ma, Xiangyu Zhang, Shuofei Zhu, and Baowen Xu. 2018. Debugging with intelligence via probabilistic inference. In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, 1171--1181.Google ScholarDigital Library
K. Yskout, R. Scandariato, and W. Joosen. 2012. Does organizing security patterns focus architectural choices? In Proceedings of the International Conference on Software Engineering (ICSE’12). IEEE CS, 617--627.Google Scholar
Koen Yskout, Riccardo Scandariato, and Wouter Joosen. 2015. Do security patterns really help designers? In Proceedings of the International Conference on Software Engineering (ICSE’15). IEEE CS, 292--302.Google ScholarCross Ref
Lewis Goldberg. 1993. The structure of phenotypic personality traits. American Psychologist 48, 1 (1993), 26--34.Google ScholarCross Ref

Index Terms

Mastering Variation in Human Studies: The Role of Aggregation
1. General and reference
  1. Cross-computing tools and techniques
    1. Empirical studies
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI

Recommendations

Methods for human - computer interaction research with older people
Designing Computer Systems for and with Older Users

Experimental research in human - computer interaction commonly uses participant groups that are unrepresentative of demographic realities, being young, technically knowledgeable and highly educated. One way of reflecting society more accurately in ...
Read More
Opportunities for ubiquitous computing in the homes of low SES older adults
UbiComp '12: Proceedings of the 2012 ACM Conference on Ubiquitous Computing

Eight hour contextual observations have been conducted in the homes of 5 low socioeconomic status (SES) urban-dwelling older adults. The purpose of the observations was to understand the daily needs and challenges of older adults in order to design ...
Read More
Human cognitive measurement as a metric within usability studies
CHI EA '13: CHI '13 Extended Abstracts on Human Factors in Computing Systems

There has been a growing interest in the impact that age and online abilities can have on an individual's experience of using the Internet. However, the reliance on these factors has not shown to be entirely conclusive. The current paper develops ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 30, Issue 1
Continuous Special Section: AI and SE
January 2021
444 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3446626
Editor:
Mauro Pezzè
Università della Svizzera italiana and Università di Milano-Bicocca, Switzerland
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 December 2020
- Accepted: 1 June 2020
- Revised: 1 May 2020
- Received: 1 September 2019
Published in tosem Volume 30, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Human studies
data aggregation
guidelines
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 247
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Mastering Variation in Human Studies: The Role of Aggregation

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Methods for human - computer interaction research with older people

Opportunities for ubiquitous computing in the homes of low SES older adults

Human cognitive measurement as a metric within usability studies