Concept Recognition with Convolutional Neural Networks to Optimize Keyphrase Extraction

Waldis, Andreas; Mazzola, Luca; Kaufmann, Michael

doi:10.1007/978-3-030-26636-3_8

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 862))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

449 Accesses
2 Citations

Abstract

For knowledge management purposes, it would be useful to automatically classify and tag documents based on their content. Keyphrase extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas corpus-index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concepts can have. Document-based heuristics can solve this issue, but often result in keyphrases that are not concepts. To increase concept precision, or the percentage of extracted keyphrases that represent actual concepts, we contribute a method to filter keyphrases based on a pre–trained convolutional neural network (CNN). We tested CNNs containing vertical and horizontal filters to decide whether an n-gram (i.e, a consecutive sequence of N words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, assuming that an n-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output is the probability of an n-gram to represent a concept. Multiple configurations for vertical and horizontal filters are analyzed and optimised through a hyper-parameterization process. The results demonstrated concept precision for extracted keywords of between 60 and 80% on average. Consequently, by applying a CNN-based concept recognition filter, the concept precision of keyphrase extraction was significantly improved. For an optimal parameter configuration with an average of five extracted keyphrases per document, the concept precision could be increased from 0.65 to 0.8, meaning that on average, at least four out of five keyphrases extracted by our algorithm were actual concepts verified by Wikipedia titles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://dumps.wikimedia.org/enwiki/20181020.
2.
The permanent link for the selected news item is https://perma.cc/PF53-SY2L.

References

Beliga, S., Metrovic, A., Martinic-Ipsic, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39, 1–20 (2015)
Google Scholar
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. CoRR abs/1206.5533 (2012). http://arxiv.org/abs/1206.5533
Bennani-Smires, K., Musat, C., Jaggi, M., Hossmann, A., Baeriswyl, M.: EmbedRank: unsupervised keyphrase extraction using sentence embeddings. CoRR abs/1801.04470 (2018). http://arxiv.org/abs/1801.04470
Dalvi, N., et al.: A web of concepts. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, pp. 1–12. ACM, New York (2009). https://doi.org/10.1145/1559795.1559797
Das, B., Pal, S., Mondal, S.K., Dalui, D., Shome, S.K.: Automatic keyword extraction from any text document using n-gram rigid collocation. Int. J. Soft Comput. Eng. (IJSCE) 3(2), 238–242 (2013)
Google Scholar
Eiholzer, M.: Method engineering for automatic tagging with inductive fuzzy classification. Master’s thesis, School of Computer Science, Lucerne University of Applied Sciences and Arts, Rotkreuz, Switzerland (2019)
Google Scholar
Fürnkranz, J.: A study using n-gram features for text categorization. Austrian Res. Inst. Artif. Intell. 3(1998), 1–10 (1998)
Google Scholar
Google: Googlenews-vectors-negative300.bin.gz (2013). https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit. Accessed 15 Jan 2018
Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. arXiv preprint arXiv:1704.06841 (2017)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (2003)
Google Scholar
Jagarlamudi, J., Pingali, P., Varma, V.: Query independent sentence scoring approach to DUC 2006. In: Proceeding of Document Understanding Conference (DUC) (2006)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. CoRR abs/1404.2188 (2014). http://arxiv.org/abs/1404.2188
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Lee, S., Kim, H.: News keyword extraction for topic tracking. In: 2008 Fourth International Conference on Networked Computing and Advanced Information Management, vol. 2, pp. 554–559, September 2008. https://doi.org/10.1109/NCM.2008.199
Liu, Y., Shi, M., Li, C.: Domain ontology concept extraction method based on text. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5. IEEE (2016)
Google Scholar
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics (2009)
Google Scholar
Lopez, M.M., Kalita, J.: Deep learning applied to NLP. CoRR abs/1703.03091 (2017). http://arxiv.org/abs/1703.03091
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 (2004)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Google Scholar
Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: AAAI, pp. 2793–2799 (2016)
Google Scholar
Parameswaran, A., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extracting concepts from large datasets. Proc. VLDB Endow. 3(1–2), 566–577 (2010)
Article Google Scholar
Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. Wiley, Hoboken (2010)
Google Scholar
Siegfried, P., Waldis, A.: Automatische generierung plattformübergreifender wissensnetzwerken mit metadaten und volltextindexierung, July 2017. http://www.enterpriselab.ch/webabstracts/projekte/diplomarbeiten/2017/Siegfried.Waldis.2017.bda.html
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, Y., et al.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2008)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Waldis, A., Mazzola, L., Kaufmann, M.: Concept extraction with convolutional neural networks. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 118–129. INSTICC, SciTePress (2018). https://doi.org/10.5220/0006901201180129
Westphal, C., Pei, G.: Scalable routing via greedy embedding. In: INFOCOM 2009, pp. 2826–2830. IEEE (2009)
Google Scholar
Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: EMNLP (2016)
Google Scholar

Download references

Acknowledgements

This research has been funded in part by the Swiss Commission for Technology and Innovation (CTI) as part of the research project Feasibility Study X-MAS: Cross-Platform Mediation, Association and Search Engine, CTI-No. 26335.1 PFES-ES. We thank Benjamin Haymond for proof-reading and copy-editing of our work.

Author information

Authors and Affiliations

School of Information Technology, Lucerne University of Applied Sciences, 6343, Rotkreuz, Switzerland
Andreas Waldis, Luca Mazzola & Michael Kaufmann

Authors

Andreas Waldis
View author publications
You can also search for this author in PubMed Google Scholar
Luca Mazzola
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kaufmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luca Mazzola .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Nordrhein-Westfalen, Germany
Christoph Quix
University of Coimbra, Coimbra, Portugal
Jorge Bernardino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Waldis, A., Mazzola, L., Kaufmann, M. (2019). Concept Recognition with Convolutional Neural Networks to Optimize Keyphrase Extraction. In: Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2018. Communications in Computer and Information Science, vol 862. Springer, Cham. https://doi.org/10.1007/978-3-030-26636-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-26636-3_8
Published: 20 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26635-6
Online ISBN: 978-3-030-26636-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics