Skip to main content

Concept Recognition with Convolutional Neural Networks to Optimize Keyphrase Extraction

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2018)

Abstract

For knowledge management purposes, it would be useful to automatically classify and tag documents based on their content. Keyphrase extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas corpus-index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concepts can have. Document-based heuristics can solve this issue, but often result in keyphrases that are not concepts. To increase concept precision, or the percentage of extracted keyphrases that represent actual concepts, we contribute a method to filter keyphrases based on a pre–trained convolutional neural network (CNN). We tested CNNs containing vertical and horizontal filters to decide whether an n-gram (i.e, a consecutive sequence of N words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, assuming that an n-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output is the probability of an n-gram to represent a concept. Multiple configurations for vertical and horizontal filters are analyzed and optimised through a hyper-parameterization process. The results demonstrated concept precision for extracted keywords of between 60 and 80% on average. Consequently, by applying a CNN-based concept recognition filter, the concept precision of keyphrase extraction was significantly improved. For an optimal parameter configuration with an average of five extracted keyphrases per document, the concept precision could be increased from 0.65 to 0.8, meaning that on average, at least four out of five keyphrases extracted by our algorithm were actual concepts verified by Wikipedia titles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://dumps.wikimedia.org/enwiki/20181020.

  2. 2.

    The permanent link for the selected news item is https://perma.cc/PF53-SY2L.

References

  1. Beliga, S., Metrovic, A., Martinic-Ipsic, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39, 1–20 (2015)

    Google Scholar 

  2. Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. CoRR abs/1206.5533 (2012). http://arxiv.org/abs/1206.5533

  3. Bennani-Smires, K., Musat, C., Jaggi, M., Hossmann, A., Baeriswyl, M.: EmbedRank: unsupervised keyphrase extraction using sentence embeddings. CoRR abs/1801.04470 (2018). http://arxiv.org/abs/1801.04470

  4. Dalvi, N., et al.: A web of concepts. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, pp. 1–12. ACM, New York (2009). https://doi.org/10.1145/1559795.1559797

  5. Das, B., Pal, S., Mondal, S.K., Dalui, D., Shome, S.K.: Automatic keyword extraction from any text document using n-gram rigid collocation. Int. J. Soft Comput. Eng. (IJSCE) 3(2), 238–242 (2013)

    Google Scholar 

  6. Eiholzer, M.: Method engineering for automatic tagging with inductive fuzzy classification. Master’s thesis, School of Computer Science, Lucerne University of Applied Sciences and Arts, Rotkreuz, Switzerland (2019)

    Google Scholar 

  7. FĂ¼rnkranz, J.: A study using n-gram features for text categorization. Austrian Res. Inst. Artif. Intell. 3(1998), 1–10 (1998)

    Google Scholar 

  8. Google: Googlenews-vectors-negative300.bin.gz (2013). https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit. Accessed 15 Jan 2018

  9. Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. arXiv preprint arXiv:1704.06841 (2017)

  10. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (2003)

    Google Scholar 

  11. Jagarlamudi, J., Pingali, P., Varma, V.: Query independent sentence scoring approach to DUC 2006. In: Proceeding of Document Understanding Conference (DUC) (2006)

    Google Scholar 

  12. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  13. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. CoRR abs/1404.2188 (2014). http://arxiv.org/abs/1404.2188

  14. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)

  15. Lee, S., Kim, H.: News keyword extraction for topic tracking. In: 2008 Fourth International Conference on Networked Computing and Advanced Information Management, vol. 2, pp. 554–559, September 2008. https://doi.org/10.1109/NCM.2008.199

  16. Liu, Y., Shi, M., Li, C.: Domain ontology concept extraction method based on text. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5. IEEE (2016)

    Google Scholar 

  17. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics (2009)

    Google Scholar 

  18. Lopez, M.M., Kalita, J.: Deep learning applied to NLP. CoRR abs/1703.03091 (2017). http://arxiv.org/abs/1703.03091

  19. Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 (2004)

    Google Scholar 

  20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  21. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)

    Google Scholar 

  22. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: AAAI, pp. 2793–2799 (2016)

    Google Scholar 

  23. Parameswaran, A., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extracting concepts from large datasets. Proc. VLDB Endow. 3(1–2), 566–577 (2010)

    Article  Google Scholar 

  24. Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)

  25. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. Wiley, Hoboken (2010)

    Google Scholar 

  26. Siegfried, P., Waldis, A.: Automatische generierung plattformĂ¼bergreifender wissensnetzwerken mit metadaten und volltextindexierung, July 2017. http://www.enterpriselab.ch/webabstracts/projekte/diplomarbeiten/2017/Siegfried.Waldis.2017.bda.html

  27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  28. Song, Y., et al.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2008)

    Google Scholar 

  29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  30. Waldis, A., Mazzola, L., Kaufmann, M.: Concept extraction with convolutional neural networks. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 118–129. INSTICC, SciTePress (2018). https://doi.org/10.5220/0006901201180129

  31. Westphal, C., Pei, G.: Scalable routing via greedy embedding. In: INFOCOM 2009, pp. 2826–2830. IEEE (2009)

    Google Scholar 

  32. Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: EMNLP (2016)

    Google Scholar 

Download references

Acknowledgements

This research has been funded in part by the Swiss Commission for Technology and Innovation (CTI) as part of the research project Feasibility Study X-MAS: Cross-Platform Mediation, Association and Search Engine, CTI-No. 26335.1 PFES-ES. We thank Benjamin Haymond for proof-reading and copy-editing of our work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Mazzola .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Waldis, A., Mazzola, L., Kaufmann, M. (2019). Concept Recognition with Convolutional Neural Networks to Optimize Keyphrase Extraction. In: Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2018. Communications in Computer and Information Science, vol 862. Springer, Cham. https://doi.org/10.1007/978-3-030-26636-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26636-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26635-6

  • Online ISBN: 978-3-030-26636-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics