ABSTRACT
Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.
Supplemental Material
- Paolo Arcaini, Angelo Gargantini, and Elvinia Riccobene. 2019. Regular expression learning with evolutionary testing and repair. In IFIP International Conference on Testing Software and Systems. Springer, 22–40.Google Scholar
- Rohit Babbar and Nidhi Singh. 2010. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. 43–50.Google Scholar
- Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Can a Machine Replace Humans in Building Regular Expressions¿ A Case Study. IEEE Intelligent Systems 31, 6 (2016), 15–21. https://doi.org/10.1109/MIS.2016.46Google Scholar
- Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Inference of regular expressions for text extraction from examples. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1217–1230.Google Scholar
- Falk Brauer, Robert Rieger, Adrian Mocan, and Wojciech M Barczynski. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowledge management. 1285–1294.Google Scholar
- Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with crowds and computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3180–3191.Google Scholar
- Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1247–1261.Google Scholar
- Robert A Cochran, Loris D’Antoni, Benjamin Livshits, David Molnar, and Margus Veanes. 2015. Program boosting: Program synthesis via crowd-sourcing. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 677–688.Google Scholar
- Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. John Wiley & Sons.Google Scholar
- Henning Fernau. 2009. Algorithms for learning regular expressions from positive data. Information and Computation 207, 4 (2009), 521–541.Google Scholar
- U. Gadiraju, G. Demartini, R. Kawase, and S. Dietze. 2015. Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. IEEE Intelligent Systems 30, 4 (2015), 81–85.Google Scholar
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.Google Scholar
- Lei Han, Tianwa Chen, Gianluca Demartini, Marta Indulska, and Shazia Sadiq. 2020. On understanding data worker interaction behaviors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 269–278.Google Scholar
- Ihab F Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.Google Scholar
- Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683–698.Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the sigchi conference on human factors in computing systems. 3363–3372.Google Scholar
- Efim Kinber. 2010. Learning regular expressions from representative examples and membership queries. In International Colloquium on Grammatical Inference. Springer, 94–108.Google Scholar
- Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (San Francisco, California) (HILDA ’16). Association for Computing Machinery, New York, NY, USA, Article 9, 5 pages. https://doi.org/10.1145/2939502.2939511Google Scholar
- Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing regular expressions from examples for introductory automata assignments. In Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. 70–80.Google Scholar
- Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and HV Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the 2008 conference on empirical methods in natural language processing. 21–30.Google Scholar
- Karin Murthy, Prasad M Deshpande, 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Systems Engineering. Springer, 455–467.Google Scholar
- Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49.Google Scholar
- Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic repair of regular expressions. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–29.Google Scholar
- Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381–390.Google Scholar
- Thomas Rebele, Katerina Tzompanaki, and Fabian M Suchanek. 2018. Adding missing words to regular expressions. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 67–79.Google Scholar
- Yongxin Tong, Caleb Chen Cao, Chen Jason Zhang, Yatao Li, and Lei Chen. 2014. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1182–1185.Google Scholar
- Shaun Wallace, Alexandra Papoutsaki, Neilly H Tan, Hua Guo, and Jeff Huang. 2021. Case studies on the motivation and performance of contributors who verify and maintain in-flux tabular datasets. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–25.Google Scholar
- Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y Zhao. 2014. Man vs. machine: Practical adversarial detection of malicious crowdsourcing workers. In 23rd { USENIX} Security Symposium ({ USENIX} Security 14). 239–254.Google Scholar
- Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).Google Scholar
- Shaochen Yu, Tianwa Chen, Lei Han, Gianluca Demartini, and Shazia Sadiq. 2022. DataOps-4G: On Supporting Generalists in Data Quality Discovery. IEEE Transactions on Knowledge and Data Engineering (2022).Google Scholar
- Shichao Zhang, Chengqi Zhang, and Qiang Yang. 2003. Data preparation for data mining. Applied artificial intelligence 17, 5-6 (2003), 375–381.Google Scholar
Index Terms
- Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency
Recommendations
A regular expression matching circuit: Decomposed non-deterministic realization with prefix sharing and multi-character transition
This paper shows a compact realization of regular expression matching circuits on FPGAs. First, the given regular expression is converted into a non-deterministic finite automaton (NFA) by the modified McNaughton-Yamada method. Second, to reduce the ...
Inference of a Concise Regular Expression Considering Interleaving from XML Documents
Advances in Knowledge Discovery and Data MiningAbstractXML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to design efficient algorithms for schema learning. Each element in XML schema ...
Regular expression constrained sequence alignment
We introduce regular expression constrained sequence alignment as the problem of finding the maximum alignment score between given strings S"1 and S"2 over all alignments such that in these alignments there exists a segment where some substring s"1 of S"...
Comments