Editorial Notes
A corrigendum was issued for this paper on December 7, 2020. You can download the corrigendum from the supplemental material section of this citation page.
ABSTRACT
Understanding how data workers interact with data and various pieces of information (e.g., code snippet examples) is key to design systems that can better support them in exploring a given dataset. To date, however, there is a paucity of research studying information seeking patterns and the strategies adopted by data workers as they carry out data curation activities. In this work, we aim at understanding the behaviors of data workers in discovering data quality issues, and how these behavioral observations relate to their performance. Specifically, we investigate how data workers use information resources and tools to support their task completion. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The collected data reveals that: (i) searching in external resources is a prevalent action that can be leveraged to achieve better performance; (ii) 'copy-paste-modify' is a typical strategy for writing code to complete tasks; (iii) providing sample code within the system could help data workers to get started with their task; and (iv) surfacing underlying data is an effective way to support exploration. By investigating the behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Our findings provide insights into patterns of interactions with various system components and information resources to perform data curation tasks. This bears implications on the design of domain-specific IR systems for data workers like code-base search.
Supplemental Material
Available for Download
Corrigendum to "On Understanding Data Worker Interaction Behaviors" by Han et al., Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20).
- Tarek M Ahmed, Weiyi Shang, and Ahmed E Hassan. 2015. An empirical study of the copy and paste behavior during development. In Proceedings of the 12th Working Conference on MSR. IEEE Press, 99--110.Google ScholarCross Ref
- Anne Aula, Rehan M Khan, Zhiwei Guan, Paul Fontes, and Peter Hong. 2010. A comparison of visual and textual page previews in judging the helpfulness of web pages. In Proceedings of WWW. ACM, 51--60.Google ScholarDigital Library
- Nurzety A Azuan, Suzanne M Embury, and Norman W Paton. 2017. Observing the data scientist: Using manual corrections as implicit feedback. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. 1--6.Google ScholarDigital Library
- Alan Baddeley. 1992. Working memory. Science, Vol. 255, 5044 (1992), 556--559.Google ScholarCross Ref
- Nilavra Bhattacharya and Jacek Gwizdka. 2019. Measuring learning during search: differences in interactions, eye-gaze, and semantic similarity to expert knowledge. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 63--71.Google ScholarDigital Library
- Georg Buscher, Edward Cutrell, and Meredith Ringel Morris. 2009. What do you see when you're surfing?: using eye tracking to predict salient regions of web pages. In Proceedings of CHI. ACM, 21--30.Google ScholarDigital Library
- Yeounoh Chung, Sanjay Krishnan, and Tim Kraska. 2017. A data quality metric (DQM): how to estimate the number of undetected errors in data sets. Proceedings of the VLDB Endowment, Vol. 10, 10 (2017), 1094--1105.Google ScholarDigital Library
- Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.Google ScholarDigital Library
- Max Grusky, Jeiran Jahani, Josh Schwartz, Dan Valente, Yoav Artzi, and Mor Naaman. 2017. Modeling Sub-Document Attention Using Viewport Time. In Proceedings of CHI. ACM, 6475--6480.Google ScholarDigital Library
- Adam Grzywaczewski and Rahat Iqbal. 2012. Task-specific information retrieval systems for software engineers. J. Comput. System Sci., Vol. 78, 4 (2012), 1204--1218.Google ScholarDigital Library
- Philip J Guo, Sean Kandel, Joseph M Hellerstein, and Jeffrey Heer. 2011. Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. In Proceedings of UIST. 65--74.Google ScholarDigital Library
- Jeff Huang, Ryen W White, and Susan Dumais. 2011. No clicks, no problem: using cursor movements to understand and improve search. In Proceedings of CHI. ACM, 1225--1234.Google ScholarDigital Library
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3363--3372.Google ScholarDigital Library
- Jaewon Kim, Paul Thomas, Ramesh Sankaranarayana, Tom Gedeon, and Hwan-Jin Yoon. 2015. Eye-tracking analysis of user behavior and performance in web search on large and small screens. Journal of the Association for Information Science and Technology, Vol. 66, 3 (2015), 526--544.Google ScholarDigital Library
- Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. 2004. An ethnographic study of copy and paste programming practices in OOPL. In Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE'04. IEEE, 83--92.Google Scholar
- Andrew J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on Software Engineering 12 (2006), 971--987.Google ScholarDigital Library
- Dmitry Lagun and Mounia Lalmas. 2016. Understanding user attention and engagement in online news reading. In Proceedings of WSDM. ACM, 113--122.Google ScholarDigital Library
- Jing Li, Aixin Sun, Zhenchang Xing, and Lei Han. 2018. API Caveat Explorer--Surfacing Negative Usages from Practice: An API-oriented Interactive Exploratory Search System for Programmers. In Proceedings of SIGIR. 1293--1296.Google ScholarDigital Library
- Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and Interactive Recommendation for Modifying Pasted Code. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, 520--531.Google ScholarDigital Library
- Jiqun Liu, Matthew Mitsui, Nicholas J Belkin, and Chirag Shah. 2019. Task, information seeking intentions, and user behavior: Toward a multi-level understanding of Web search. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 123--132.Google ScholarDigital Library
- Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50--60.Google Scholar
- R Mehrotra, AH Awadallah, M Shokouhi, E Yilmaz, I Zitouni, A El Kholy, and M Khabsa. 2017. Deep Sequential Models for Task Satisfaction Prediction. In Proceedings of CIKM. ACM, 737--746.Google ScholarDigital Library
- Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: an investigation of how developers spend their time. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, 25--35.Google ScholarDigital Library
- Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 126.Google ScholarDigital Library
- Krishna Narasimhan and Christoph Reichenbach. 2015. Copy and Paste Redeemed (T). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2015), 630--640.Google ScholarDigital Library
- David J Piorkowski, Scott D Fleming, Irwin Kwan, Margaret M Burnett, Christopher Scaffidi, Rachel KE Bellamy, and Joshua Jordahl. 2013. The whats and hows of programmers' foraging diets. In Proceedings of CHI. ACM, 3063--3072.Google ScholarDigital Library
- Tilmann Rabl and Meikel Poess. 2011. Parallel data generation for performance analysis of large, complex RDBMS. In Proceedings of the Fourth International Workshop on Testing Database Systems. ACM, 5.Google ScholarDigital Library
- Juan Ramos et almbox. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. 133--142.Google Scholar
- John W Ratcliff and David E Metzener. 1988. Pattern-matching-the gestalt approach. Dr Dobbs Journal, Vol. 13, 7 (1988), 46.Google Scholar
- Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F Ilyas, Sebastian Link, Miller J Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2018. Data quality: The role of empiricism. ACM SIGMOD Record, Vol. 46, 4 (2018), 35--43.Google ScholarDigital Library
- Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How Developers Search for Code: A Case Study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, 191--201.Google ScholarDigital Library
- Ripon K. Saha, Chanchal K. Roy, Kevin A. Schneider, and Dewayne E. Perry. 2013. Understanding the Evolution of Type-3 Clones: An Exploratory Study. In Proceedings of the 10th Working Conference on MSR. IEEE Press, 139--148.Google Scholar
- Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data Diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of KDD. 2279--2288.Google ScholarDigital Library
- Y Tay. 2011. Data generation for application-specific benchmarking. VLDB, Challenges and Visions, Vol. 7 (2011).Google Scholar
- Ashish Thusoo and Joydeep Sarma. 2017. Creating a Data-Driven Enterprise with DataOps. O'Reilly Media, Incorporated.Google Scholar
- Michael E Tipping and Christopher M Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 61, 3 (1999), 611--622.Google ScholarCross Ref
- Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How do developers utilize source code from stack overflow? Empirical Software Engineering, Vol. 24, 2 (2019), 637--673.Google ScholarDigital Library
- Xiaohui Xie, Jiaxin Mao, Maarten de Rijke, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2018. Constructing an interaction behavior model for web image search. In Proceedings of SIGIR. 425--434.Google ScholarDigital Library
- Manuela Züger and Thomas Fritz. 2015. Interruptibility of software developers and its prediction using psycho-physiological sensors. In Proceedings of CHI. ACM, 2981--2990.Google ScholarDigital Library
Index Terms
- On Understanding Data Worker Interaction Behaviors
Recommendations
Modelling User Behavior Dynamics with Embeddings
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementUnderstanding user interaction behaviors remains a challenging problem. Quantifying behavior dynamics over time as users complete tasks has only been done in specific domains. In this paper, we present a user behavior model built using behavior ...
A Data-Driven Analysis of Behaviors in Data Curation Processes
Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the ...
Understanding collective crowd behaviors: Learning a Mixture model of Dynamic pedestrian-Agents
CVPR '12: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)In this paper, a new Mixture model of Dynamic pedestrian-Agents (MDA) is proposed to learn the collective behavior patterns of pedestrians in crowded scenes. Collective behaviors characterize the intrinsic dynamics of the crowd. From the agent-based ...
Comments