skip to main content
10.1145/3543507.3583515acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Artifacts Available / v1.1

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

Published:30 April 2023Publication History

ABSTRACT

Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.

Skip Supplemental Material Section

Supplemental Material

www 3min.mp4

mp4

7.3 MB

www 3min.mp4

Three-minute presentation

mp4

7.3 MB

References

  1. Paolo Arcaini, Angelo Gargantini, and Elvinia Riccobene. 2019. Regular expression learning with evolutionary testing and repair. In IFIP International Conference on Testing Software and Systems. Springer, 22–40.Google ScholarGoogle Scholar
  2. Rohit Babbar and Nidhi Singh. 2010. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. 43–50.Google ScholarGoogle Scholar
  3. Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Can a Machine Replace Humans in Building Regular Expressions¿ A Case Study. IEEE Intelligent Systems 31, 6 (2016), 15–21. https://doi.org/10.1109/MIS.2016.46Google ScholarGoogle Scholar
  4. Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Inference of regular expressions for text extraction from examples. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1217–1230.Google ScholarGoogle Scholar
  5. Falk Brauer, Robert Rieger, Adrian Mocan, and Wojciech M Barczynski. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowledge management. 1285–1294.Google ScholarGoogle Scholar
  6. Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with crowds and computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3180–3191.Google ScholarGoogle Scholar
  7. Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1247–1261.Google ScholarGoogle Scholar
  8. Robert A Cochran, Loris D’Antoni, Benjamin Livshits, David Molnar, and Margus Veanes. 2015. Program boosting: Program synthesis via crowd-sourcing. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 677–688.Google ScholarGoogle Scholar
  9. Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. John Wiley & Sons.Google ScholarGoogle Scholar
  10. Henning Fernau. 2009. Algorithms for learning regular expressions from positive data. Information and Computation 207, 4 (2009), 521–541.Google ScholarGoogle Scholar
  11. U. Gadiraju, G. Demartini, R. Kawase, and S. Dietze. 2015. Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. IEEE Intelligent Systems 30, 4 (2015), 81–85.Google ScholarGoogle Scholar
  12. Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.Google ScholarGoogle Scholar
  13. Lei Han, Tianwa Chen, Gianluca Demartini, Marta Indulska, and Shazia Sadiq. 2020. On understanding data worker interaction behaviors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 269–278.Google ScholarGoogle Scholar
  14. Ihab F Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.Google ScholarGoogle Scholar
  15. Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683–698.Google ScholarGoogle Scholar
  16. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the sigchi conference on human factors in computing systems. 3363–3372.Google ScholarGoogle Scholar
  17. Efim Kinber. 2010. Learning regular expressions from representative examples and membership queries. In International Colloquium on Grammatical Inference. Springer, 94–108.Google ScholarGoogle Scholar
  18. Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (San Francisco, California) (HILDA ’16). Association for Computing Machinery, New York, NY, USA, Article 9, 5 pages. https://doi.org/10.1145/2939502.2939511Google ScholarGoogle Scholar
  19. Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing regular expressions from examples for introductory automata assignments. In Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. 70–80.Google ScholarGoogle Scholar
  20. Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and HV Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the 2008 conference on empirical methods in natural language processing. 21–30.Google ScholarGoogle Scholar
  21. Karin Murthy, Prasad M Deshpande, 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Systems Engineering. Springer, 455–467.Google ScholarGoogle Scholar
  22. Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49.Google ScholarGoogle Scholar
  23. Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic repair of regular expressions. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–29.Google ScholarGoogle Scholar
  24. Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381–390.Google ScholarGoogle Scholar
  25. Thomas Rebele, Katerina Tzompanaki, and Fabian M Suchanek. 2018. Adding missing words to regular expressions. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 67–79.Google ScholarGoogle Scholar
  26. Yongxin Tong, Caleb Chen Cao, Chen Jason Zhang, Yatao Li, and Lei Chen. 2014. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1182–1185.Google ScholarGoogle Scholar
  27. Shaun Wallace, Alexandra Papoutsaki, Neilly H Tan, Hua Guo, and Jeff Huang. 2021. Case studies on the motivation and performance of contributors who verify and maintain in-flux tabular datasets. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–25.Google ScholarGoogle Scholar
  28. Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y Zhao. 2014. Man vs. machine: Practical adversarial detection of malicious crowdsourcing workers. In 23rd { USENIX} Security Symposium ({ USENIX} Security 14). 239–254.Google ScholarGoogle Scholar
  29. Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).Google ScholarGoogle Scholar
  30. Shaochen Yu, Tianwa Chen, Lei Han, Gianluca Demartini, and Shazia Sadiq. 2022. DataOps-4G: On Supporting Generalists in Data Quality Discovery. IEEE Transactions on Knowledge and Data Engineering (2022).Google ScholarGoogle Scholar
  31. Shichao Zhang, Chengqi Zhang, and Qiang Yang. 2003. Data preparation for data mining. Applied artificial intelligence 17, 5-6 (2003), 375–381.Google ScholarGoogle Scholar

Index Terms

  1. Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '23: Proceedings of the ACM Web Conference 2023
      April 2023
      4293 pages
      ISBN:9781450394161
      DOI:10.1145/3543507

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 April 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%
    • Article Metrics

      • Downloads (Last 12 months)72
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format