Automatic keyphrase extraction from scientific articles

Kim, Su Nam; Medelyan, Olena; Kan, Min-Yen; Baldwin, Timothy

doi:10.1007/s10579-012-9210-3

Automatic keyphrase extraction from scientific articles

Original Paper
Published: 18 December 2012

Volume 47, pages 723–742, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Su Nam Kim¹,
Olena Medelyan²,
Min-Yen Kan³ &
…
Timothy Baldwin¹

2229 Accesses
74 Citations
3 Altmetric
Explore all metrics

Abstract

This paper describes the organization and results of the automatic keyphrase extraction task held at the Workshop on Semantic Evaluation 2010 (SemEval-2010). The keyphrase extraction task was specifically geared towards scientific articles. Systems were automatically evaluated by matching their extracted keyphrases against those assigned by the authors as well as the readers to the same documents. We outline the task, present the overall ranking of the submitted systems, and discuss the improvements to the state-of-the-art in keyphrase extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Recent automatic text summarization techniques: a survey

Article 29 March 2016

Notes

We use “keyphrase” and “keywords” interchangeably to refer to both single words and multiword expressions.
http://bit.ly/maui-datasets.
http://github.com/snkim/AutomaticKeyphraseExtraction.
These values were computed using the test documents only.
Using the Perl implementation available at http://tartarus.org/~martin/PorterStemmer/; we informed participants that this was the stemmer we would be using for the task, to avoid possible stemming variations between implementations.
An alternative approach could have been to use a more fine-grained evaluation measure which takes into account the relative ranking of different keyphrases at a given cutoff, such as nDCG (Jarvelin and Kekalainen 2002).
We also experimented with a naive Bayes learner, but found the results to be identical to the ME learner due to the simplicity of the feature set.
http://opennlp.sourceforge.net/projects.html.
The remaining 19 % of keyphrases do not actually appear in the documents and thus cannot be extracted.

References

Barker, K., & Corrnacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Proceedings of the 13th biennial conference of the canadian society on computational studies of intelligence: Advances in artificial intelligence (pp. 40–52). Montreal, Canada.
Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL/EACL 1997 workshop on intelligent scalable text summarization (pp. 10–17). Madrid, Spain.
Bernend, G., & Farkas, R. (2010). SZTERGAK: Feature engineering for keyphrase extraction. In Proceedings of the 5th international workshop on semantic evaluation (pp. 186–189). Uppsala, Sweden.
Bordea, G., & Buitelaar P. (2010). DERIUNLP: A context based approach to automatic keyphrase extraction. In Proceedings of the 5th international workshop on semantic evaluation (pp. 146–149). Uppsala, Sweden,
D’Avanzo, E., & Magnini, B. (2005). A keyphrase-based approach to summarization: The LAKE system. In Proceedings of the 2005 document understanding workshop (DUC 2005) (pp. 6–8). Vancouver, Canada.
Eichler, K., & Neumann, G. (2010). DFKI KeyWE: Ranking keyphrases extracted from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation (pp. 150–153). Uppsala, Sweden.
El-Beltagy, S. R., & Rafea, A. (2010). KP-Miner: Participation in SemEval-2. In Proceedings of the 5th international workshop on semantic evaluation (pp. 190–193). Uppsala, Sweden.
Ercan, G. (2006). Automated text summarization and keyphrase extraction. Master’s thesis, Bilkent University.
Frank, E., Paynter, G. W., Witten, I. H., Gutwin C., & Nevill-Manning, C. G. (1999). Domain specific keyphrase extraction. In Proceedings of the 16th international joint conference on artificial intelligence (IJCAI-99) (pp. 668–673). Stockholm, Sweden.
Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms. International Journal of Digital Libraries, 3(2), 117–132.
Article Google Scholar
Gong, Z., & Liu, Q. (2008). Improving keyword based web image search with visual feature distribution and term expansion. Knowledge and Information Systems, 21(1), 113–132.
Article Google Scholar
Gutwin, C., Paynter, G., Witten, I., Nevill-Manning C., & Frank, E. (1999). Improving browsing in digital libraries with keyphrase indexes. Journal of Decision Support Systems, 27, 81–104.
Article Google Scholar
Hammouda, K. M., Matute, D. N., & Kamel, M. S. (2005). CorePhrase: Keyphrase extraction for document clustering. In Proceedings of the 4th international conference on machine learning and data mining (MLDM 2005) (pp. 265–274). Leipzig, Germany.
Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on empirical methods in natural language processing (pp. 216–223). Sapporo, Japan.
Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. thesis, Stockholm University.
Hulth, A., & Megyesi, B. B. (2006). A study on automatically extracted keywords in text categorization. In Proceedings of 21st international conference on computational linguistics and 44th annual meeting of the association for computational Linguistics (pp. 537–544). Sydney, Australia.
Jarmasz, M., & Barriere, C. (2004). Keyphrase Extraction: Enhancing Lists. In Proceedings of the 2nd conference on computational linguistics in the North-East. Montreal, Canada. http://arxiv.org/abs/1204.0255.
Jarvelin, K., & Kekalainen, J. (2002). Cumulated Gain-based Evaluation of IR techniques. ACM Transactions on Information Systems 20(4).
Kim, S. N., Baldwin, T., & Kan, M.-Y. (2009). The use of topic representative words in text categorization. In Proceedings of the fourteenth Australasian document computing symposium (ADCS 2009) (pp. 75–81). Sydney, Australia.
Kim, S. N., Baldwin, T., & Kan, M.-Y. (2010). Evaluating N-gram based evaluation metrics for automatic keyphrase extraction. In Proceedings of the 23rd international conference on computational linguistics (COLING) (pp. 572–580). Beijing, China.
Kim, S. N., & Kan, M.-Y. (2009). Re-examining automatic keyphrase extraction approach in scientific articles. In Proceedings of the ACL/IJCNLP 2009 workshop on multiword expressions (pp. 7–16). Singapore.
Krapivin, M., Autayeu, A., & Marchese, M. (2009). Large dataset for keyphrases extraction. Technical Report DISI-09-055, DISI, University of Trento, Italy.
Krapivin, M., Autayeu, M., Marchese, M., Blanzieri, E., & Segata, N. (2010). Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In Proceedings of the joint JCDL/ICADL international digital libraries conference (pp. 102–111). Gold Coast, Australia.
Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of SIGIR 2001 (pp. 349–357). New Orleans, USA.
Litvak, M., & Last, M. (2008). Graph-based keyword extraction for single-document summarization. In Proceedings of the 2nd workshop on multi-source multilingual information extraction and summarization (pp. 17–24). Manchester, UK.
Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009a). Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics (pp. 620–628). Boulder, USA.
Liu, Z., Li, P., Zheng, Y., & Maosong, S. (2009b). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 257–266). Singapore.
Lopez, P., & Romary, L. (2010). HUMB: Automatic key term extraction from scientific articles in GROBID. In Proceedings of the 5th international workshop on semantic evaluation (pp. 248–251). Uppsala, Sweden.
Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word Co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(1), 157–169.
Article Google Scholar
Medelyan, O., Frank, E., & Witten, I. H. (2009) Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 1318–1327). Singapore.
Medelyan, O., & Witten, I. (2006). Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEED-CS joint conference on Digital libraries (pp. 296–297).
Mihalcea, R., & Faruque, E. (2004). SenseLearner: Minimally supervised word sense disambiguation for all words in open text. In Proceedings of the ACL/SIGLEX Senseval-3 Workshop (pp. 155–158). Barcelona, Spain.
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts. In Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain.
Nguyen, T. D., & Kan, M.-Y. (2007). Key phrase extraction in scientific publications. In Proceeding of international conference on Asian digital libraries (pp. 317–326). Hanoi, Vietnam.
Nguyen, T. D., & Luong, M.-T. (2010). WINGNUS: Keyphrase extraction utilizing document logical structure. In Proceedings of the 5th international workshop on semantic evaluation (pp. 166–169). Uppsala, Sweden.
Ortiz, R., Pinto, D., Tovar, M., & Jiménez-Salazar, H. (2010). BUAP: An unsupervised approach to automatic keyphrase extraction from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation (pp. 174–177). Uppsala, Sweden.
Ouyang, Y., Li, W., & Zhang, R. (2010). 273. Task 5. keyphrase extraction based on core word identification and word expansion. In Proceedings of the 5th international workshop on semantic evaluation (pp. 142–145). Uppsala, Sweden.
Park, J., Lee, J. G., & Daille, B. (2010). UNPMC: Naive approach to extract keyphrases from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation (pp. 178–181). Uppsala, Sweden.
Pasquier, C. (2010). Single document keyphrase extraction using sentence clustering and Latent Dirichlet allocation. In Proceedings of the 5th international workshop on semantic evaluation (pp. 154–157). Uppsala, Sweden.
Paukkeri, M.-S., & Honkela, T. (2010). Likey: unsupervised language-independent keyphrase extraction. In Proceedings of the 5th international workshop on semantic evaluation (pp. 162–165). Uppsala, Sweden.
Paukkeri, M.-S., Nieminen, I. T., Polla, M., & Honkela, T. (2008). A language-independent approach to keyphrase extraction and evaluation. In Proceedings of the 22nd international conference on computational Linguistics (pp. 83–86). Manchester, UK.
Pianta, E., & Tonelli, S. (2010). KX: A flexible system for keyphrase extraction. In Proceedings of the 5th international workshop on semantic evaluation (pp. 170–173). Uppsala, Sweden.
Schutz, A. T. (2008). Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. Master’s thesis, National University of Ireland.
Schwartz, A. S., & Hearst, M. A. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific symposium on biocomputing (Vol. 8, pp. 451–462).
Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. In Proceedings of ACL workshop on multiword expressions (pp. 33–40). Sapporo, Japan.
Treeratpituk, P., Teregowda, P., Huang, J., & Giles, C. L. (2010). SEERLAB: A system for extracting keyphrases from scholarly documents. In Proceedings of the 5th international workshop on semantic evaluation (pp. 182–185). Uppsala, Sweden.
Turney, P. (1999). Learning to extract keyphrases from text. National Research Council, Institute for Information Technology, Technical Report ERB-1057. (NRC #41622).
Turney, P. (2003). Coherent keyphrase extraction via Web mining. In Proceedings of the eighteenth international joint conference on artificial intelligence (pp. 434–439). Acapulco, Mexico.
Wan, X., & Xiao, J. (2008). CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of 22nd international conference on computational linguistics (pp. 969–976). Manchester, UK.
Wang, C., Zhang, M., Ru, L., & Ma, S. (2008). An automatic online news topic keyphrase extraction system. In Proceedings of 2008 IEEE/WIC/ACM international conference on web intelligence (pp. 214–219). Sydney, Australia.
Wang, L., & Li, F. (2010). SJTULTLAB: Chunk based method for keyphrase extraction. In Proceedings of the 5th international workshop on semantic evaluation (pp. 158–161). Uppsala, Sweden.
Witten, I., Paynter, G., Frank, E., Gutwin, C., & Nevill-Manning G. (1999). KEA: Practical automatic key phrase extraction. In Proceedings of the Fourth ACM conference on digital libraries (pp. 254–255). Berkeley, USA.
Zervanou, K. (2010). UvT: The UvT Term extraction system in the keyphrase extraction task. In Proceedings of the 5th international workshop on semantic evaluation (pp. 194–197). Uppsala, Sweden.
Zesch, T., & Gurevych, I. (2009). Approximate matching for evaluating keyphrase extraction. In Proceedings of RANLP 2009 (Recent Advances in Natural Language Processing) (pp. 484–489). Borovets, Bulgaria.
Zhang, Y., Zincir-Heywood, N., & Milios, E. (2004). Term based clustering and summarization of Web Page collections. In Proceedings of the 17th conference of the Canadian society for computational studies of intelligence (pp. 60–74). London, Canada.

Download references

Acknowledgements

This work was supported by National Research Foundation grant “Interactive Media Search” (grant # R-252-000-325-279) for Min-Yen Kan, and ARC Discovery grant no. DP110101934 for Timothy Baldwin.

Author information

Authors and Affiliations

Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Su Nam Kim & Timothy Baldwin
Pingar, Auckland, New Zealand
Olena Medelyan
School of Computing, National University of Singapore, Singapore, Singapore
Min-Yen Kan

Authors

Su Nam Kim
View author publications
You can also search for this author in PubMed Google Scholar
Olena Medelyan
View author publications
You can also search for this author in PubMed Google Scholar
Min-Yen Kan
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timothy Baldwin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S.N., Medelyan, O., Kan, MY. et al. Automatic keyphrase extraction from scientific articles. Lang Resources & Evaluation 47, 723–742 (2013). https://doi.org/10.1007/s10579-012-9210-3

Download citation

Published: 18 December 2012
Issue Date: September 2013
DOI: https://doi.org/10.1007/s10579-012-9210-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic keyphrase extraction from scientific articles

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Recent automatic text summarization techniques: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic keyphrase extraction from scientific articles

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Recent automatic text summarization techniques: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation