research-article

On Understanding Data Worker Interaction Behaviors

Authors:
Lei Han

The University of Queensland, Brisbane, Australia

The University of Queensland, Brisbane, Australia
View Profile

,
Tianwa Chen

The University of Queensland, Brisbane, Australia

The University of Queensland, Brisbane, Australia
View Profile

,
Gianluca Demartini

The University of Queensland, Brisbane, Australia

The University of Queensland, Brisbane, Australia
View Profile

,
Marta Indulska

The University of Queensland, Brisbane, Australia

The University of Queensland, Brisbane, Australia
View Profile

,
Shazia Sadiq

The University of Queensland, Brisbane, Australia

The University of Queensland, Brisbane, Australia
View Profile

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2020Pages 269–278https://doi.org/10.1145/3397271.3401059

Published:25 July 2020Publication History

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 269–278

Editorial Notes

A corrigendum was issued for this paper on December 7, 2020. You can download the corrigendum from the supplemental material section of this citation page.

ABSTRACT

Understanding how data workers interact with data and various pieces of information (e.g., code snippet examples) is key to design systems that can better support them in exploring a given dataset. To date, however, there is a paucity of research studying information seeking patterns and the strategies adopted by data workers as they carry out data curation activities. In this work, we aim at understanding the behaviors of data workers in discovering data quality issues, and how these behavioral observations relate to their performance. Specifically, we investigate how data workers use information resources and tools to support their task completion. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The collected data reveals that: (i) searching in external resources is a prevalent action that can be leveraged to achieve better performance; (ii) 'copy-paste-modify' is a typical strategy for writing code to complete tasks; (iii) providing sample code within the system could help data workers to get started with their task; and (iv) surfacing underlying data is an effective way to support exploration. By investigating the behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Our findings provide insights into patterns of interactions with various system components and information resources to perform data curation tasks. This bears implications on the design of domain-specific IR systems for data workers like code-base search.

Supplemental Material

3397271.3401059.mp4

mp4

24.3 MB

Download

Available for Download

pdf

3401059-corrigendum.pdf (463.7 KB)

Corrigendum to "On Understanding Data Worker Interaction Behaviors" by Han et al., Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20).

References

Tarek M Ahmed, Weiyi Shang, and Ahmed E Hassan. 2015. An empirical study of the copy and paste behavior during development. In Proceedings of the 12th Working Conference on MSR. IEEE Press, 99--110.Google ScholarCross Ref
Anne Aula, Rehan M Khan, Zhiwei Guan, Paul Fontes, and Peter Hong. 2010. A comparison of visual and textual page previews in judging the helpfulness of web pages. In Proceedings of WWW. ACM, 51--60.Google ScholarDigital Library
Nurzety A Azuan, Suzanne M Embury, and Norman W Paton. 2017. Observing the data scientist: Using manual corrections as implicit feedback. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. 1--6.Google ScholarDigital Library
Alan Baddeley. 1992. Working memory. Science, Vol. 255, 5044 (1992), 556--559.Google ScholarCross Ref
Nilavra Bhattacharya and Jacek Gwizdka. 2019. Measuring learning during search: differences in interactions, eye-gaze, and semantic similarity to expert knowledge. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 63--71.Google ScholarDigital Library
Georg Buscher, Edward Cutrell, and Meredith Ringel Morris. 2009. What do you see when you're surfing?: using eye tracking to predict salient regions of web pages. In Proceedings of CHI. ACM, 21--30.Google ScholarDigital Library
Yeounoh Chung, Sanjay Krishnan, and Tim Kraska. 2017. A data quality metric (DQM): how to estimate the number of undetected errors in data sets. Proceedings of the VLDB Endowment, Vol. 10, 10 (2017), 1094--1105.Google ScholarDigital Library
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.Google ScholarDigital Library
Max Grusky, Jeiran Jahani, Josh Schwartz, Dan Valente, Yoav Artzi, and Mor Naaman. 2017. Modeling Sub-Document Attention Using Viewport Time. In Proceedings of CHI. ACM, 6475--6480.Google ScholarDigital Library
Adam Grzywaczewski and Rahat Iqbal. 2012. Task-specific information retrieval systems for software engineers. J. Comput. System Sci., Vol. 78, 4 (2012), 1204--1218.Google ScholarDigital Library
Philip J Guo, Sean Kandel, Joseph M Hellerstein, and Jeffrey Heer. 2011. Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. In Proceedings of UIST. 65--74.Google ScholarDigital Library
Jeff Huang, Ryen W White, and Susan Dumais. 2011. No clicks, no problem: using cursor movements to understand and improve search. In Proceedings of CHI. ACM, 1225--1234.Google ScholarDigital Library
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3363--3372.Google ScholarDigital Library
Jaewon Kim, Paul Thomas, Ramesh Sankaranarayana, Tom Gedeon, and Hwan-Jin Yoon. 2015. Eye-tracking analysis of user behavior and performance in web search on large and small screens. Journal of the Association for Information Science and Technology, Vol. 66, 3 (2015), 526--544.Google ScholarDigital Library
Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. 2004. An ethnographic study of copy and paste programming practices in OOPL. In Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE'04. IEEE, 83--92.Google Scholar
Andrew J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on Software Engineering 12 (2006), 971--987.Google ScholarDigital Library
Dmitry Lagun and Mounia Lalmas. 2016. Understanding user attention and engagement in online news reading. In Proceedings of WSDM. ACM, 113--122.Google ScholarDigital Library
Jing Li, Aixin Sun, Zhenchang Xing, and Lei Han. 2018. API Caveat Explorer--Surfacing Negative Usages from Practice: An API-oriented Interactive Exploratory Search System for Programmers. In Proceedings of SIGIR. 1293--1296.Google ScholarDigital Library
Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and Interactive Recommendation for Modifying Pasted Code. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, 520--531.Google ScholarDigital Library
Jiqun Liu, Matthew Mitsui, Nicholas J Belkin, and Chirag Shah. 2019. Task, information seeking intentions, and user behavior: Toward a multi-level understanding of Web search. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 123--132.Google ScholarDigital Library
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50--60.Google Scholar
R Mehrotra, AH Awadallah, M Shokouhi, E Yilmaz, I Zitouni, A El Kholy, and M Khabsa. 2017. Deep Sequential Models for Task Satisfaction Prediction. In Proceedings of CIKM. ACM, 737--746.Google ScholarDigital Library
Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: an investigation of how developers spend their time. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, 25--35.Google ScholarDigital Library
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 126.Google ScholarDigital Library
Krishna Narasimhan and Christoph Reichenbach. 2015. Copy and Paste Redeemed (T). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2015), 630--640.Google ScholarDigital Library
David J Piorkowski, Scott D Fleming, Irwin Kwan, Margaret M Burnett, Christopher Scaffidi, Rachel KE Bellamy, and Joshua Jordahl. 2013. The whats and hows of programmers' foraging diets. In Proceedings of CHI. ACM, 3063--3072.Google ScholarDigital Library
Tilmann Rabl and Meikel Poess. 2011. Parallel data generation for performance analysis of large, complex RDBMS. In Proceedings of the Fourth International Workshop on Testing Database Systems. ACM, 5.Google ScholarDigital Library
Juan Ramos et almbox. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. 133--142.Google Scholar
John W Ratcliff and David E Metzener. 1988. Pattern-matching-the gestalt approach. Dr Dobbs Journal, Vol. 13, 7 (1988), 46.Google Scholar
Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F Ilyas, Sebastian Link, Miller J Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2018. Data quality: The role of empiricism. ACM SIGMOD Record, Vol. 46, 4 (2018), 35--43.Google ScholarDigital Library
Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How Developers Search for Code: A Case Study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, 191--201.Google ScholarDigital Library
Ripon K. Saha, Chanchal K. Roy, Kevin A. Schneider, and Dewayne E. Perry. 2013. Understanding the Evolution of Type-3 Clones: An Exploratory Study. In Proceedings of the 10th Working Conference on MSR. IEEE Press, 139--148.Google Scholar
Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data Diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of KDD. 2279--2288.Google ScholarDigital Library
Y Tay. 2011. Data generation for application-specific benchmarking. VLDB, Challenges and Visions, Vol. 7 (2011).Google Scholar
Ashish Thusoo and Joydeep Sarma. 2017. Creating a Data-Driven Enterprise with DataOps. O'Reilly Media, Incorporated.Google Scholar
Michael E Tipping and Christopher M Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 61, 3 (1999), 611--622.Google ScholarCross Ref
Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How do developers utilize source code from stack overflow? Empirical Software Engineering, Vol. 24, 2 (2019), 637--673.Google ScholarDigital Library
Xiaohui Xie, Jiaxin Mao, Maarten de Rijke, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2018. Constructing an interaction behavior model for web image search. In Proceedings of SIGIR. 425--434.Google ScholarDigital Library
Manuela Züger and Thomas Fritz. 2015. Interruptibility of software developers and its prediction using psycho-physiological sensors. In Proceedings of CHI. ACM, 2981--2990.Google ScholarDigital Library

Index Terms

On Understanding Data Worker Interaction Behaviors
1. Human-centered computing
  1. Human computer interaction (HCI)
  2. Interaction design
    1. Empirical studies in interaction design
    2. Interaction design process and methods
      1. User centered design
      2. User interface design
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
  2. World Wide Web
    1. Web mining
      1. Web log analysis

Recommendations

Modelling User Behavior Dynamics with Embeddings
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Understanding user interaction behaviors remains a challenging problem. Quantifying behavior dynamics over time as users complete tasks has only been done in specific domains. In this paper, we present a user behavior model built using behavior ...
Read More
A Data-Driven Analysis of Behaviors in Data Curation Processes
Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the ...
Read More
Understanding collective crowd behaviors: Learning a Mixture model of Dynamic pedestrian-Agents
CVPR '12: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

In this paper, a new Mixture model of Dynamic pedestrian-Agents (MDA) is proposed to learn the collective behavior patterns of pedestrians in crowded scenes. Collective behaviors characterize the intrinsic dynamics of the crowd. From the agent-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data curation
interaction behavior
search pattern
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 498
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.