research-article

Descartes: Generating Short Descriptions of Wikipedia Articles

Authors:
Marija Sakota

EPFL, Switzerland

EPFL, Switzerland

0000-0002-5192-5842
View Profile

,
Maxime Peyrard

EPFL, Switzerland

EPFL, Switzerland

0000-0003-4782-6603
View Profile

,
Robert West

EPFL, Switzerland

EPFL, Switzerland

0000-0002-3984-1232
View Profile

Authors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023April 2023Pages 1446–1456https://doi.org/10.1145/3543507.3583220

Published:30 April 2023Publication History

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 1446–1456

ABSTRACT

Wikipedia is one of the richest knowledge sources on the Web today. In order to facilitate navigating, searching, and maintaining its content, Wikipedia’s guidelines state that all articles should be annotated with a so-called short description indicating the article’s topic (e.g., the short description of beer is “Alcoholic drink made from fermented cereal grains”). Nonetheless, a large fraction of articles (ranging from 10.2% in Dutch to 99.7% in Kazakh) have no short description yet, with detrimental effects for millions of Wikipedia users. Motivated by this problem, we introduce the novel task of automatically generating short descriptions for Wikipedia articles and propose Descartes, a multilingual model for tackling it. Descartes integrates three sources of information to generate an article description in a target language: the text of the article in all its language versions, the already-existing descriptions (if any) of the article in other languages, and semantic type information obtained from a knowledge graph. We evaluate a Descartes model trained for handling 25 languages simultaneously, showing that it beats baselines (including a strong translation-based baseline) and performs on par with monolingual models tailored for specific languages. A human evaluation on three languages further shows that the quality of Descartes’s descriptions is largely indistinguishable from that of human-written descriptions; e.g., 91.3% of our English descriptions (vs. 92.1% of human-written descriptions) pass the bar for inclusion in Wikipedia, suggesting that Descartes is ready for production, with the potential to support human editors in filling a major gap in today’s Wikipedia across languages.

References

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 2091–2100. https://proceedings.mlr.press/v48/allamanis16.htmlGoogle Scholar
David Arthur and Sergei Vassilvitskii. 2006. k-means++: The Advantages of Careful Seeding. Technical Report 2006-13. Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/Google Scholar
Peter C Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46, 3 (2011), 399–424.Google Scholar
Siddhartha Banerjee and Prasenjit Mitra. 2015. WikiKreator: Improving Wikipedia Stubs Automatically. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, 867–877. https://doi.org/10.3115/v1/P15-1084Google ScholarCross Ref
Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39, 3/4 (1952), 324–345. http://www.jstor.org/stable/2334029Google Scholar
Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S. Weld. 2020. TLDR: Extreme Summarization of Scientific Documents. arxiv:2004.15011 [cs.CL]Google Scholar
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2017. Faithful to the Original: Fact Aware Neural Abstractive Summarization. arxiv:1711.04434 [cs.IR]Google Scholar
Andrew Chisholm, Will Radford, and Ben Hachey. 2017. Learning to generate one-sentence biographies from Wikidata. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain, 633–642. https://aclanthology.org/E17-1060Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google ScholarCross Ref
Angela Fan and Claire Gardent. 2022. Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8561–8576. https://doi.org/10.18653/v1/2022.acl-long.586Google ScholarCross Ref
Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2019. Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4186–4196. https://doi.org/10.18653/v1/D19-1428Google ScholarCross Ref
Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2021. Structured Neural Summarization. arxiv:1811.01824 [cs.LG]Google Scholar
Anjalie Field, Sascha Rothe, Simon Baumgartner, Cong Yu, and Abe Ittycheriah. 2020. A Generative Approach to Titling and Clustering Wikipedia Sections. In Proceedings of the Fourth Workshop on Neural Generation and Translation. Association for Computational Linguistics, Online, 79–87. https://doi.org/10.18653/v1/2020.ngt-1.9Google ScholarCross Ref
Beliz Gunel, Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2020. Mind The Facts: Knowledge-Boosted Coherent Abstractive Text Summarization. arxiv:2006.15435 [cs.CL]Google Scholar
Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics. https://aclanthology.org/C92-2082Google ScholarDigital Library
Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. arxiv:2005.01159 [cs.CL]Google Scholar
Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana, 640–645. https://doi.org/10.18653/v1/N18-2101Google ScholarCross Ref
Lucie-Aimée Kaffee, Pavlos Vougiouklis, and Elena Simperl. 2021. Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective. Semantic Web (February 2021). https://eprints.soton.ac.uk/449718/Google Scholar
Dustin Lange, Christoph Böhm, and Felix Naumann. 2010. Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (Toronto, ON, Canada) (CIKM ’10). Association for Computing Machinery, New York, NY, USA, 1661–1664. https://doi.org/10.1145/1871437.1871698Google ScholarDigital Library
Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, 1203–1213. https://doi.org/10.18653/v1/D16-1128Google ScholarCross Ref
Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA.Google Scholar
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703Google ScholarCross Ref
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. arxiv:1801.10198 [cs.CL]Google Scholar
Shan Liu and Mizuho Iwaihara. 2016. Extracting representative phrases from Wikipedia article sections. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). 1–6. https://doi.org/10.1109/ICIS.2016.7550850Google ScholarCross Ref
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742. https://doi.org/10.1162/tacl_a_00343Google ScholarCross Ref
Jared K Lunceford and Marie Davidian. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23 (2004).Google ScholarCross Ref
Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, and QUAN Z. Sheng. 2022. Multi-Document Summarization via Deep Learning Techniques: A Survey. ACM Comput. Surv. (mar 2022). https://doi.org/10.1145/3529754 Just Accepted.Google ScholarDigital Library
Kathleen McKeown and Dragomir R. Radev. 1995. Generating summaries of multiple news articles. In SIGIR ’95.Google Scholar
Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. arxiv:1602.06023 [cs.CL]Google Scholar
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. arxiv:1808.08745 [cs.CL]Google Scholar
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. 2014. The Language Demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics 2 (2014), 79–92. https://doi.org/10.1162/tacl_a_00167Google ScholarCross Ref
María Pérez-Ortiz and Rafał K. Mantiuk. 2017. A practical guide and software for analysing pairwise comparison experiments. ArXiv abs/1712.03686 (2017).Google Scholar
Maxime Peyrard, Wei Zhao, Steffen Eger, and Robert West. 2021. Better than Average: Paired Evaluation of NLP systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online, 2301–2315. https://doi.org/10.18653/v1/2021.acl-long.179Google ScholarCross Ref
Tiziano Piccardi, Michele Catasta, Leila Zia, and Robert West. 2018. Structuring Wikipedia Articles with Section Recommendations. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 665–674. https://doi.org/10.1145/3209978.3209984Google ScholarDigital Library
Christina Sauper and Regina Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-Aware Approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, 208–216. https://aclanthology.org/P09-1024Google ScholarCross Ref
Satoshi Sekine and Chikashi Nobata. 2003. A survey for Multi-Document Summarization. In Proceedings of the HLT-NAACL 03 Text Summarization Workshop. 65–72. https://aclanthology.org/W03-0509Google ScholarDigital Library
Tomás Sáez and Aidan Hogan. 2018. Automatically Generating Wikipedia Info-boxes from Wikidata. WWW ’18: Companion Proceedings of the The Web Conference 2018, 1823–1830. https://doi.org/10.1145/3184558.3191647Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarDigital Library
Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christophe Gravier, Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples. Journal of Web Semantics 52-53 (2018), 1–15. https://doi.org/10.1016/j.websem.2018.07.002Google ScholarCross Ref
Pavlos Vougiouklis, Eddy Maddalena, Jonathon S. Hare, and Elena Paslaru Bontas Simperl. 2020. Point at the Triple: Generation of Text Summaries from Knowledge Base Triples. J. Artif. Intell. Res. 69 (2020), 1–31.Google ScholarCross Ref
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (sep 2014), 78–85. https://doi.org/10.1145/2629489Google ScholarDigital Library
Denny Vrandečić. 2021. Building a multilingual Wikipedia. Commun. ACM 64, 4 (March 2021), 38–41. https://doi.org/10.1145/3425778Google ScholarDigital Library
Roberto Yus, Varish Mulwad, Tim Finin, and Eduardo Mena. 2014. Infoboxer: Using Statistical and Semantic Knowledge to Help Create Wikipedia Infoboxes. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272 (Riva del Garda, Italy) (ISWC-PD’14). CEUR-WS.org, Aachen, DEU, 405–408.Google Scholar
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. arxiv:1912.08777 [cs.CL]Google Scholar
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 563–578. https://doi.org/10.18653/v1/D19-1053Google ScholarCross Ref
Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, and Meng Jiang. 2021. Enhancing Factual Consistency of Abstractive Summarization. arxiv:2003.08612 [cs.CL]Google Scholar

Index Terms

Descartes: Generating Short Descriptions of Wikipedia Articles

Index terms have been assigned to the content through auto-classification.

Recommendations

Automatically Generating Wikipedia Info-boxes from Wikidata
WWW '18: Companion Proceedings of the The Web Conference 2018

Info-boxes provide a summary of the most important meta-data relating to a particular entity described by a Wikipedia article. However, many articles have no info-box or have info-boxes with only minimal information; furthermore, there is a huge ...
Read More
Analysis of discussion contributions in translated Wikipedia articles
ICIC '12: Proceedings of the 4th international conference on Intercultural Collaboration

Translation of articles in Wikipedia is one of the most prominent methods for increasing the quality of different language Wikipedias. Discussion pages in Wikipedia contribute to a large portion of the online encyclopedia, and are used by Wikipedia ...
Read More
Extracting lack of information on Wikipedia by comparing multilingual articles
IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services

Wikipedia has multilingual articles, the information of which differs, even for articles on the same topic. As described in this paper, we propose a system to extract and present lack of information of one language on Wikipedia by comparing two languages ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23: Proceedings of the ACM Web Conference 2023
April 2023
4293 pages
ISBN:9781450394161
DOI:10.1145/3543507
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)157
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Descartes: Generating Short Descriptions of Wikipedia Articles

WWW '23: Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically Generating Wikipedia Info-boxes from Wikidata

Analysis of discussion contributions in translated Wikipedia articles

Extracting lack of information on Wikipedia by comparing multilingual articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Descartes: Generating Short Descriptions of Wikipedia Articles

WWW '23: Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically Generating Wikipedia Info-boxes from Wikidata

Analysis of discussion contributions in translated Wikipedia articles

Extracting lack of information on Wikipedia by comparing multilingual articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media