research-article

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

Authors:
Shaochen Yu

The University of Queensland, Australia

The University of Queensland, Australia

0000-0002-4526-1525
View Profile

,
Lei Han

The University of Queensland, Australia

The University of Queensland, Australia

0000-0002-7777-3592
View Profile

,
Marta Indulska

The University of Queensland, Australia

The University of Queensland, Australia

0000-0002-2156-4097
View Profile

,
Shazia Sadiq

The University of Queensland, Australia

The University of Queensland, Australia

0000-0001-6739-4145
View Profile

,
Gianluca Demartini

The University of Queensland, Australia

The University of Queensland, Australia

0000-0002-7311-3693
View Profile

Authors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023April 2023Pages 3859–3867https://doi.org/10.1145/3543507.3583515

Published:30 April 2023Publication History

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 3859–3867

ABSTRACT

Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.

Supplemental Material

www 3min.mp4

mp4

7.3 MB

Download

www 3min.mp4

Three-minute presentation

mp4

7.3 MB

Download

Available for Download

pptx

Presentation slides (1.7 MB)

pptx

www 23_latest_ppt_3mins.pptx (1.7 MB)

Presentation slides

References

Paolo Arcaini, Angelo Gargantini, and Elvinia Riccobene. 2019. Regular expression learning with evolutionary testing and repair. In IFIP International Conference on Testing Software and Systems. Springer, 22–40.Google Scholar
Rohit Babbar and Nidhi Singh. 2010. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. 43–50.Google Scholar
Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Can a Machine Replace Humans in Building Regular Expressions¿ A Case Study. IEEE Intelligent Systems 31, 6 (2016), 15–21. https://doi.org/10.1109/MIS.2016.46Google Scholar
Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Inference of regular expressions for text extraction from examples. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1217–1230.Google Scholar
Falk Brauer, Robert Rieger, Adrian Mocan, and Wojciech M Barczynski. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowledge management. 1285–1294.Google Scholar
Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with crowds and computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3180–3191.Google Scholar
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1247–1261.Google Scholar
Robert A Cochran, Loris D’Antoni, Benjamin Livshits, David Molnar, and Margus Veanes. 2015. Program boosting: Program synthesis via crowd-sourcing. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 677–688.Google Scholar
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. John Wiley & Sons.Google Scholar
Henning Fernau. 2009. Algorithms for learning regular expressions from positive data. Information and Computation 207, 4 (2009), 521–541.Google Scholar
U. Gadiraju, G. Demartini, R. Kawase, and S. Dietze. 2015. Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. IEEE Intelligent Systems 30, 4 (2015), 81–85.Google Scholar
Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.Google Scholar
Lei Han, Tianwa Chen, Gianluca Demartini, Marta Indulska, and Shazia Sadiq. 2020. On understanding data worker interaction behaviors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 269–278.Google Scholar
Ihab F Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.Google Scholar
Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683–698.Google Scholar
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the sigchi conference on human factors in computing systems. 3363–3372.Google Scholar
Efim Kinber. 2010. Learning regular expressions from representative examples and membership queries. In International Colloquium on Grammatical Inference. Springer, 94–108.Google Scholar
Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (San Francisco, California) (HILDA ’16). Association for Computing Machinery, New York, NY, USA, Article 9, 5 pages. https://doi.org/10.1145/2939502.2939511Google Scholar
Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing regular expressions from examples for introductory automata assignments. In Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. 70–80.Google Scholar
Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and HV Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the 2008 conference on empirical methods in natural language processing. 21–30.Google Scholar
Karin Murthy, Prasad M Deshpande, 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Systems Engineering. Springer, 455–467.Google Scholar
Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49.Google Scholar
Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic repair of regular expressions. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–29.Google Scholar
Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381–390.Google Scholar
Thomas Rebele, Katerina Tzompanaki, and Fabian M Suchanek. 2018. Adding missing words to regular expressions. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 67–79.Google Scholar
Yongxin Tong, Caleb Chen Cao, Chen Jason Zhang, Yatao Li, and Lei Chen. 2014. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1182–1185.Google Scholar
Shaun Wallace, Alexandra Papoutsaki, Neilly H Tan, Hua Guo, and Jeff Huang. 2021. Case studies on the motivation and performance of contributors who verify and maintain in-flux tabular datasets. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–25.Google Scholar
Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y Zhao. 2014. Man vs. machine: Practical adversarial detection of malicious crowdsourcing workers. In 23rd { USENIX} Security Symposium ({ USENIX} Security 14). 239–254.Google Scholar
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).Google Scholar
Shaochen Yu, Tianwa Chen, Lei Han, Gianluca Demartini, and Shazia Sadiq. 2022. DataOps-4G: On Supporting Generalists in Data Quality Discovery. IEEE Transactions on Knowledge and Data Engineering (2022).Google Scholar
Shichao Zhang, Chengqi Zhang, and Qiang Yang. 2003. Data preparation for data mining. Applied artificial intelligence 17, 5-6 (2003), 375–381.Google Scholar

Index Terms

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency
1. Information systems
  1. Data management systems

Recommendations

A regular expression matching circuit: Decomposed non-deterministic realization with prefix sharing and multi-character transition

This paper shows a compact realization of regular expression matching circuits on FPGAs. First, the given regular expression is converted into a non-deterministic finite automaton (NFA) by the modified McNaughton-Yamada method. Second, to reduce the ...
Read More
Inference of a Concise Regular Expression Considering Interleaving from XML Documents
Advances in Knowledge Discovery and Data Mining
Abstract
XML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to design efficient algorithms for schema learning. Each element in XML schema ...
Read More
Regular expression constrained sequence alignment

We introduce regular expression constrained sequence alignment as the problem of finding the maximum alignment score between given strings S"1 and S"2 over all alignments such that in these alignments there exists a segment where some substring s"1 of S"...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23: Proceedings of the ACM Web Conference 2023
April 2023
4293 pages
ISBN:9781450394161
DOI:10.1145/3543507
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
Author Tags
Crowdsourcing
Format Inconsistency
Human-in-the-loop
Regular Expression
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)72
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

WWW '23: Proceedings of the ACM Web Conference 2023

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A regular expression matching circuit: Decomposed non-deterministic realization with prefix sharing and multi-character transition

Inference of a Concise Regular Expression Considering Interleaving from XML Documents

Regular expression constrained sequence alignment