Skip to main content

Adding Missing Words to Regular Expressions

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10938))

Included in the following conference series:

Abstract

Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://dbgroup.eecs.umich.edu/regexLearning/.

  2. 2.

    http://www.cs.cmu.edu/~einat/datasets.html.

References

  1. Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: Workshop on Analytics for Noisy Unstructured Text Data (2010)

    Google Scholar 

  2. Bartoli, A., Davanzo, G., Lorenzo, A.D., Mauri, M., Medvet, E., Sorio, E.: Automatic generation of regular expressions from examples with genetic programming. In: GECCO (2012)

    Google Scholar 

  3. Bartoli, A., Davanzo, G., Lorenzo, A.D., Medvet, E., Sorio, E.: Automatic synthesis of regular expressions from examples. IEEE Comput. 47(12), 72–80 (2014)

    Article  Google Scholar 

  4. Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: On the automatic construction of regular expressions from examples. In: GECCO (2016)

    Google Scholar 

  5. Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: CIKM (2011)

    Google Scholar 

  6. Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Di Pietro, A.: An improved DFA for fast regular expression matching. SIGCOMM Comput. Commun. Rev. 38(5), 29–40 (2008). https://doi.org/10.1145/1452335.1452339

    Article  Google Scholar 

  7. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: SIGPLAN Notices, vol. 46 (2011)

    Article  Google Scholar 

  8. Knight, J.R., Myers, E.W.: Approximate regular expression pattern matching with concave gap penalties. Algorithmica 14(1), 85–121 (1995)

    Article  MathSciNet  Google Scholar 

  9. Le, V., Gulwani, S.: FlashExtract: a framework for data extraction by examples. In: PLDI (2014)

    Article  Google Scholar 

  10. Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. 6(2), 167–195 (2015)

    Google Scholar 

  11. Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP (2008)

    Google Scholar 

  12. Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: EMNLP (2005)

    Google Scholar 

  13. Murthy, K., Padmanabhan, D., Deshpande, P.M.: Improving recall of regular expressions for information extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 455–467. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35063-4_33

    Chapter  Google Scholar 

  14. Myers, E.W., Miller, W.: Approximate matching of regular expressions. Bull. Math. Biol. 51(1), 5–37 (1989)

    Article  MathSciNet  Google Scholar 

  15. Navarro, G.: Approximate regular expression searching with arbitrary integer weights. Nord. J. Comput. 11(4), 356–373 (2004)

    MathSciNet  MATH  Google Scholar 

  16. Prasse, P., Sawade, C., Landwehr, N., Scheffer, T.: Learning to identify concise regular expressions that describe email campaigns. J. Mach. Learn. Res. 16(1), 3687–3720 (2015)

    MathSciNet  MATH  Google Scholar 

  17. Rebele, T., Tzompanaki, K., Suchanek, F.: Visualizing the addition of missing words to regular expressions. In: ISWC (2017)

    Google Scholar 

  18. Rebele, T., Tzompanaki, K., Suchanek, F.: Technical report: adding missing words to regular expressions. Technical report, Telecom ParisTech (2018)

    Google Scholar 

  19. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)

    Google Scholar 

  20. Wu, S., Manber, U., Myers, E.: A subquadratic algorithm for approximate regular expression matching. J. Algorithms 19(3), 346–360 (1995)

    Article  MathSciNet  Google Scholar 

  21. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR (1999)

    Google Scholar 

Download references

Acknowledgments

This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Rebele .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rebele, T., Tzompanaki, K., Suchanek, F.M. (2018). Adding Missing Words to Regular Expressions. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93037-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93036-7

  • Online ISBN: 978-3-319-93037-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics