skip to main content
10.1145/3543507.3583220acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Descartes: Generating Short Descriptions of Wikipedia Articles

Published:30 April 2023Publication History

ABSTRACT

Wikipedia is one of the richest knowledge sources on the Web today. In order to facilitate navigating, searching, and maintaining its content, Wikipedia’s guidelines state that all articles should be annotated with a so-called short description indicating the article’s topic (e.g., the short description of beer is “Alcoholic drink made from fermented cereal grains”). Nonetheless, a large fraction of articles (ranging from 10.2% in Dutch to 99.7% in Kazakh) have no short description yet, with detrimental effects for millions of Wikipedia users. Motivated by this problem, we introduce the novel task of automatically generating short descriptions for Wikipedia articles and propose Descartes, a multilingual model for tackling it. Descartes integrates three sources of information to generate an article description in a target language: the text of the article in all its language versions, the already-existing descriptions (if any) of the article in other languages, and semantic type information obtained from a knowledge graph. We evaluate a Descartes model trained for handling 25 languages simultaneously, showing that it beats baselines (including a strong translation-based baseline) and performs on par with monolingual models tailored for specific languages. A human evaluation on three languages further shows that the quality of Descartes’s descriptions is largely indistinguishable from that of human-written descriptions; e.g., 91.3% of our English descriptions (vs. 92.1% of human-written descriptions) pass the bar for inclusion in Wikipedia, suggesting that Descartes is ready for production, with the potential to support human editors in filling a major gap in today’s Wikipedia across languages.

References

  1. Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 2091–2100. https://proceedings.mlr.press/v48/allamanis16.htmlGoogle ScholarGoogle Scholar
  2. David Arthur and Sergei Vassilvitskii. 2006. k-means++: The Advantages of Careful Seeding. Technical Report 2006-13. Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/Google ScholarGoogle Scholar
  3. Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46, 3 (2011), 399–424.Google ScholarGoogle Scholar
  4. Siddhartha Banerjee and Prasenjit Mitra. 2015. WikiKreator: Improving Wikipedia Stubs Automatically. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, 867–877. https://doi.org/10.3115/v1/P15-1084Google ScholarGoogle ScholarCross RefCross Ref
  5. Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39, 3/4 (1952), 324–345. http://www.jstor.org/stable/2334029Google ScholarGoogle Scholar
  6. Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S. Weld. 2020. TLDR: Extreme Summarization of Scientific Documents. arxiv:2004.15011 [cs.CL]Google ScholarGoogle Scholar
  7. Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2017. Faithful to the Original: Fact Aware Neural Abstractive Summarization. arxiv:1711.04434 [cs.IR]Google ScholarGoogle Scholar
  8. Andrew Chisholm, Will Radford, and Ben Hachey. 2017. Learning to generate one-sentence biographies from Wikidata. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain, 633–642. https://aclanthology.org/E17-1060Google ScholarGoogle ScholarCross RefCross Ref
  9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google ScholarGoogle ScholarCross RefCross Ref
  10. Angela Fan and Claire Gardent. 2022. Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8561–8576. https://doi.org/10.18653/v1/2022.acl-long.586Google ScholarGoogle ScholarCross RefCross Ref
  11. Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2019. Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4186–4196. https://doi.org/10.18653/v1/D19-1428Google ScholarGoogle ScholarCross RefCross Ref
  12. Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2021. Structured Neural Summarization. arxiv:1811.01824 [cs.LG]Google ScholarGoogle Scholar
  13. Anjalie Field, Sascha Rothe, Simon Baumgartner, Cong Yu, and Abe Ittycheriah. 2020. A Generative Approach to Titling and Clustering Wikipedia Sections. In Proceedings of the Fourth Workshop on Neural Generation and Translation. Association for Computational Linguistics, Online, 79–87. https://doi.org/10.18653/v1/2020.ngt-1.9Google ScholarGoogle ScholarCross RefCross Ref
  14. Beliz Gunel, Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2020. Mind The Facts: Knowledge-Boosted Coherent Abstractive Text Summarization. arxiv:2006.15435 [cs.CL]Google ScholarGoogle Scholar
  15. Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics. https://aclanthology.org/C92-2082Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. arxiv:2005.01159 [cs.CL]Google ScholarGoogle Scholar
  17. Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana, 640–645. https://doi.org/10.18653/v1/N18-2101Google ScholarGoogle ScholarCross RefCross Ref
  18. Lucie-Aimée Kaffee, Pavlos Vougiouklis, and Elena Simperl. 2021. Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective. Semantic Web (February 2021). https://eprints.soton.ac.uk/449718/Google ScholarGoogle Scholar
  19. Dustin Lange, Christoph Böhm, and Felix Naumann. 2010. Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (Toronto, ON, Canada) (CIKM ’10). Association for Computing Machinery, New York, NY, USA, 1661–1664. https://doi.org/10.1145/1871437.1871698Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, 1203–1213. https://doi.org/10.18653/v1/D16-1128Google ScholarGoogle ScholarCross RefCross Ref
  21. Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA.Google ScholarGoogle Scholar
  22. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703Google ScholarGoogle ScholarCross RefCross Ref
  23. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. arxiv:1801.10198 [cs.CL]Google ScholarGoogle Scholar
  24. Shan Liu and Mizuho Iwaihara. 2016. Extracting representative phrases from Wikipedia article sections. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). 1–6. https://doi.org/10.1109/ICIS.2016.7550850Google ScholarGoogle ScholarCross RefCross Ref
  25. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742. https://doi.org/10.1162/tacl_a_00343Google ScholarGoogle ScholarCross RefCross Ref
  26. Jared K Lunceford and Marie Davidian. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23 (2004).Google ScholarGoogle ScholarCross RefCross Ref
  27. Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, and QUAN Z. Sheng. 2022. Multi-Document Summarization via Deep Learning Techniques: A Survey. ACM Comput. Surv. (mar 2022). https://doi.org/10.1145/3529754 Just Accepted.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kathleen McKeown and Dragomir R. Radev. 1995. Generating summaries of multiple news articles. In SIGIR ’95.Google ScholarGoogle Scholar
  29. Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. arxiv:1602.06023 [cs.CL]Google ScholarGoogle Scholar
  30. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. arxiv:1808.08745 [cs.CL]Google ScholarGoogle Scholar
  31. Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. 2014. The Language Demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics 2 (2014), 79–92. https://doi.org/10.1162/tacl_a_00167Google ScholarGoogle ScholarCross RefCross Ref
  32. María Pérez-Ortiz and Rafał K. Mantiuk. 2017. A practical guide and software for analysing pairwise comparison experiments. ArXiv abs/1712.03686 (2017).Google ScholarGoogle Scholar
  33. Maxime Peyrard, Wei Zhao, Steffen Eger, and Robert West. 2021. Better than Average: Paired Evaluation of NLP systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online, 2301–2315. https://doi.org/10.18653/v1/2021.acl-long.179Google ScholarGoogle ScholarCross RefCross Ref
  34. Tiziano Piccardi, Michele Catasta, Leila Zia, and Robert West. 2018. Structuring Wikipedia Articles with Section Recommendations. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 665–674. https://doi.org/10.1145/3209978.3209984Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christina Sauper and Regina Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-Aware Approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, 208–216. https://aclanthology.org/P09-1024Google ScholarGoogle ScholarCross RefCross Ref
  36. Satoshi Sekine and Chikashi Nobata. 2003. A survey for Multi-Document Summarization. In Proceedings of the HLT-NAACL 03 Text Summarization Workshop. 65–72. https://aclanthology.org/W03-0509Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tomás Sáez and Aidan Hogan. 2018. Automatically Generating Wikipedia Info-boxes from Wikidata. WWW ’18: Companion Proceedings of the The Web Conference 2018, 1823–1830. https://doi.org/10.1145/3184558.3191647Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christophe Gravier, Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples. Journal of Web Semantics 52-53 (2018), 1–15. https://doi.org/10.1016/j.websem.2018.07.002Google ScholarGoogle ScholarCross RefCross Ref
  40. Pavlos Vougiouklis, Eddy Maddalena, Jonathon S. Hare, and Elena Paslaru Bontas Simperl. 2020. Point at the Triple: Generation of Text Summaries from Knowledge Base Triples. J. Artif. Intell. Res. 69 (2020), 1–31.Google ScholarGoogle ScholarCross RefCross Ref
  41. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (sep 2014), 78–85. https://doi.org/10.1145/2629489Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Denny Vrandečić. 2021. Building a multilingual Wikipedia. Commun. ACM 64, 4 (March 2021), 38–41. https://doi.org/10.1145/3425778Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Roberto Yus, Varish Mulwad, Tim Finin, and Eduardo Mena. 2014. Infoboxer: Using Statistical and Semantic Knowledge to Help Create Wikipedia Infoboxes. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272 (Riva del Garda, Italy) (ISWC-PD’14). CEUR-WS.org, Aachen, DEU, 405–408.Google ScholarGoogle Scholar
  44. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. arxiv:1912.08777 [cs.CL]Google ScholarGoogle Scholar
  45. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 563–578. https://doi.org/10.18653/v1/D19-1053Google ScholarGoogle ScholarCross RefCross Ref
  46. Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, and Meng Jiang. 2021. Enhancing Factual Consistency of Abstractive Summarization. arxiv:2003.08612 [cs.CL]Google ScholarGoogle Scholar

Index Terms

  1. Descartes: Generating Short Descriptions of Wikipedia Articles
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              WWW '23: Proceedings of the ACM Web Conference 2023
              April 2023
              4293 pages
              ISBN:9781450394161
              DOI:10.1145/3543507

              Copyright © 2023 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 30 April 2023

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate1,899of8,196submissions,23%

              Upcoming Conference

              WWW '24
              The ACM Web Conference 2024
              May 13 - 17, 2024
              Singapore , Singapore
            • Article Metrics

              • Downloads (Last 12 months)157
              • Downloads (Last 6 weeks)20

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format