Adding Missing Words to Regular Expressions

Rebele, Thomas; Tzompanaki, Katerina; Suchanek, Fabian M.

doi:10.1007/978-3-319-93037-4_6

Thomas Rebele¹⁹,
Katerina Tzompanaki²⁰ &
Fabian M. Suchanek¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10938))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2033 Accesses
5 Citations

Abstract

Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: Workshop on Analytics for Noisy Unstructured Text Data (2010)
Google Scholar
Bartoli, A., Davanzo, G., Lorenzo, A.D., Mauri, M., Medvet, E., Sorio, E.: Automatic generation of regular expressions from examples with genetic programming. In: GECCO (2012)
Google Scholar
Bartoli, A., Davanzo, G., Lorenzo, A.D., Medvet, E., Sorio, E.: Automatic synthesis of regular expressions from examples. IEEE Comput. 47(12), 72–80 (2014)
Article Google Scholar
Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: On the automatic construction of regular expressions from examples. In: GECCO (2016)
Google Scholar
Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: CIKM (2011)
Google Scholar
Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Di Pietro, A.: An improved DFA for fast regular expression matching. SIGCOMM Comput. Commun. Rev. 38(5), 29–40 (2008). https://doi.org/10.1145/1452335.1452339
Article Google Scholar
Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: SIGPLAN Notices, vol. 46 (2011)
Article Google Scholar
Knight, J.R., Myers, E.W.: Approximate regular expression pattern matching with concave gap penalties. Algorithmica 14(1), 85–121 (1995)
Article MathSciNet Google Scholar
Le, V., Gulwani, S.: FlashExtract: a framework for data extraction by examples. In: PLDI (2014)
Article Google Scholar
Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. 6(2), 167–195 (2015)
Google Scholar
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP (2008)
Google Scholar
Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: EMNLP (2005)
Google Scholar
Murthy, K., Padmanabhan, D., Deshpande, P.M.: Improving recall of regular expressions for information extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 455–467. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35063-4_33
Chapter Google Scholar
Myers, E.W., Miller, W.: Approximate matching of regular expressions. Bull. Math. Biol. 51(1), 5–37 (1989)
Article MathSciNet Google Scholar
Navarro, G.: Approximate regular expression searching with arbitrary integer weights. Nord. J. Comput. 11(4), 356–373 (2004)
MathSciNet MATH Google Scholar
Prasse, P., Sawade, C., Landwehr, N., Scheffer, T.: Learning to identify concise regular expressions that describe email campaigns. J. Mach. Learn. Res. 16(1), 3687–3720 (2015)
MathSciNet MATH Google Scholar
Rebele, T., Tzompanaki, K., Suchanek, F.: Visualizing the addition of missing words to regular expressions. In: ISWC (2017)
Google Scholar
Rebele, T., Tzompanaki, K., Suchanek, F.: Technical report: adding missing words to regular expressions. Technical report, Telecom ParisTech (2018)
Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)
Google Scholar
Wu, S., Manber, U., Myers, E.: A subquadratic algorithm for approximate regular expression matching. J. Algorithms 19(3), 346–360 (1995)
Article MathSciNet Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR (1999)
Google Scholar

Download references

Acknowledgments

This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).

Author information

Authors and Affiliations

Télécom ParisTech, Paris, France
Thomas Rebele & Fabian M. Suchanek
ETIS lab/ENSEA/Cergy-Pontoise University/CNRS, Cergy-Pontoise, France
Katerina Tzompanaki

Authors

Thomas Rebele
View author publications
You can also search for this author in PubMed Google Scholar
Katerina Tzompanaki
View author publications
You can also search for this author in PubMed Google Scholar
Fabian M. Suchanek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Rebele .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rebele, T., Tzompanaki, K., Suchanek, F.M. (2018). Adding Missing Words to Regular Expressions. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-93037-4_6
Published: 20 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics