skip to main content
10.1145/3487553.3524927acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Anchor Prediction: A Topic Modeling Approach

Published:16 August 2022Publication History

ABSTRACT

Networks of documents connected by hyperlinks, such as Wikipedia, are ubiquitous. Hyperlinks are inserted by the authors to enrich the text and facilitate the navigation through the network. However, authors tend to insert only a fraction of the relevant hyperlinks, mainly because this is a time consuming task. In this paper we address an annotation, which we refer to as anchor prediction. Even though it is conceptually close to link prediction or entity linking, it is a different task that require developing a specific method to solve it. Given a source document and a target document, this task consists in automatically identifying anchors in the source document, i.e words or terms that should carry a hyperlink pointing towards the target document. We propose a contextualized relational topic model, CRTM, that models directed links between documents as a function of the local context of the anchor in the source document and the whole content of the target document. The model can be used to predict anchors in a source document, given the target document, without relying on a dictionary of previously seen mention or title, nor any external knowledge graph. Authors can benefit from CRTM, by letting it automatically suggest hyperlinks, given a new document and the set of target document to connect to. It can also benefit to readers, by dynamically inserting hyperlinks between the documents they’re reading. Experiments conducted on several Wikipedia corpora (in English, Italian and German) highlight the practical usefulness of anchor prediction and demonstrate the relevancy of our approach.

References

  1. Christopher Aicher, Abigail Z Jacobs, and Aaron Clauset. 2013. Adapting the stochastic block model to edge-weighted networks. arXiv preprint arXiv:1305.5782(2013).Google ScholarGoogle Scholar
  2. Haoli Bai, Zhuangbin Chen, Michael R Lyu, Irwin King, and Zenglin Xu. 2018. Neural relational topic models for scientific article analysis. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 27–36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8(2013), 1798–1828.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aleksandar Bojchevski and Stephan Günnemann. [n.d.]. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. In Proceeding of ICLR.Google ScholarGoogle Scholar
  7. Robin Brochier and Frédéric Béchet. 2021. Predicting Links on Wikipedia with Anchor Text Information. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Robin Brochier, Adrien Guille, and Julien Velcin. 2019. Global vectors for node representations. In The World Wide Web Conference. 2587–2593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robin Brochier, Adrien Guille, and Julien Velcin. 2020. Inductive Document Network Embedding with Topic-Word Attention. In Proceedings of ECIR. Springer, 326–340.Google ScholarGoogle Scholar
  10. Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In Artificial Intelligence and Statistics. 81–88.Google ScholarGoogle Scholar
  11. Jonathan Chang and David M Blei. 2010. Hierarchical relational models for document networks. The Annals of Applied Statistics(2010), 124–150.Google ScholarGoogle Scholar
  12. Ning Chen, Jun Zhu, Fei Xia, and Bo Zhang. 2013. Generalized relational topic models with data augmentation. In Proceedings of IJCAI. 1273–1279.Google ScholarGoogle Scholar
  13. Ran Ding, Ramesh Nallapati, and Bing Xiang. 2018. Coherence-Aware Neural Topic Modeling. In EMNLP.Google ScholarGoogle Scholar
  14. Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management. 1625–1628.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2619–2629.Google ScholarGoogle ScholarCross RefCross Ref
  16. Martin Gerlach, Marshall Miller, Rita Ho, Kosta Harlan, and Djellel Difallah. 2021. Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3818–3827.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Antoine Gourru, Adrien Guille, Julien Velcin, and Julien Jacques. 2020. Document Network Projection in Pretrained Word Embedding Space. In ECIR. Springer, 150–157.Google ScholarGoogle Scholar
  18. Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. 2008. Latent topic models for hypertext. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. 230–239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Amit Gruber, Michal Rosen-Zvi, and Yair Weiss. 2008. Topic Models for Hypertext: How Many Words is a Single Link Worth?(2008).Google ScholarGoogle Scholar
  20. Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. 2007. Hidden topic markov models. In Artificial intelligence and statistics. PMLR, 163–170.Google ScholarGoogle Scholar
  21. Xianpei Han and Le Sun. 2012. An entity-topic model for entity linking. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 105–115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yu Hao, Xin Cao, Yixiang Fang, Xike Xie, and Sibo Wang. 2020. Inductive Link Prediction for Nodes Having Only Attribute Information. In Proceedings of IJCAI. 1209–1215.Google ScholarGoogle ScholarCross RefCross Ref
  23. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 782–792.Google ScholarGoogle Scholar
  24. Thomas Hofmann. 1999. Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization. In Proceedings of NeurIPS. 914–920.Google ScholarGoogle Scholar
  25. Diederik P Kingma, Max Welling, 2019. An Introduction to Variational Autoencoders. Foundations and Trends® in Machine Learning 12, 4(2019), 307–392.Google ScholarGoogle Scholar
  26. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  27. Martin Josifoski Sebastian Riedel Luke Zettlemoyer Ledell Wu, Fabio Petroni. 2020. Zero-shot Entity Linking with Dense Entity Retrieval. In EMNLP.Google ScholarGoogle Scholar
  28. Jie Liu, Zhicheng He, Lai Wei, and Yalou Huang. 2018. Content to node: Self-translation network embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD). 1794–1802.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Víctor Martínez, Fernando Berzal, and Juan-Carlos Cubero. 2016. A Survey of Link Prediction in Complex Networks. 49, 4 (2016).Google ScholarGoogle Scholar
  30. Olena Medelyan, Ian H Witten, and David Milne. 2008. Topic indexing with Wikipedia. In Proceedings of the AAAI WikiAI workshop, Vol. 1. 19–24.Google ScholarGoogle Scholar
  31. Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. 2008. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web. 101–110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rada Mihalcea and Andras Csomai. 2007. Wikify! Linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 233–242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.Google ScholarGoogle Scholar
  34. Feng Nan, Ran Ding, Ramesh Nallapati, and Bing Xiang. 2019. Topic Modeling with Wasserstein Autoencoders. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6345–6381.Google ScholarGoogle ScholarCross RefCross Ref
  35. Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: a new entity annotator. In Proceedings of the first international workshop on Entity recognition & disambiguation. 55–62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C Estelle Smith, Bowen Yu, Anjali Srivastava, Aaron Halfaker, Loren Terveen, and Haiyi Zhu. 2020. Keeping community in the loop: Understanding wikipedia stakeholder values for machine learning-based systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li. 2008. HTM: A topic model for hypertexts. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. 514–522.Google ScholarGoogle ScholarCross RefCross Ref
  38. Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. 2015. Modeling mention, context and entity with neural networks for entity disambiguation. In Twenty-fourth international joint conference on artificial intelligence.Google ScholarGoogle Scholar
  39. I Tolstikhin, O Bousquet, S Gelly, and B Schölkopf. 2018. Wasserstein Auto-Encoders. In 6th International Conference on Learning Representations (ICLR 2018). OpenReview. net.Google ScholarGoogle Scholar
  40. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.Google ScholarGoogle Scholar
  41. Hao Wang, Xingjian Shi, and Dit-Yan Yeung. 2017. Relational deep learning: A deep latent variable model for link prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.Google ScholarGoogle ScholarCross RefCross Ref
  42. Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1006–1011.Google ScholarGoogle ScholarCross RefCross Ref
  43. Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. 1445–1456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015. Network representation learning with rich text information. In Proceedings of IJCAI. 2111–2117.Google ScholarGoogle Scholar
  45. Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2015. Birds of a feather linked together: A discriminative topic model using link-based priors. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 261–266.Google ScholarGoogle ScholarCross RefCross Ref
  46. Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2016. A discriminative topic model using document network structure. In Proceedings of the 54th ACL Annual Meeting (Volume 1: Long Papers). 686–696.Google ScholarGoogle ScholarCross RefCross Ref
  47. Aonan Zhang, Jun Zhu, and Bo Zhang. 2013. Sparse relational topic models for document networks. In Joint ECML and KDD. Springer, 670–685.Google ScholarGoogle Scholar
  48. Jian Zhang, Jun Zheng, Jinyin Chen, and Qi Xuan. 2020. Hyper-Substructure Enhanced Link Predictor. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, USA, 2305–2308. https://doi.org/10.1145/3340531.3412096Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Deyu Zhou, Xuemeng Hu, and Rui Wang. 2020. Neural Topic Modeling by Incorporating Document Relationship Graph. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3790–3796.Google ScholarGoogle ScholarCross RefCross Ref
  50. Qile Zhu, Zheng Feng, and Xiaolin Li. 2018. GraphBTM: Graph enhanced autoencoded variational inference for biterm topic model. In Proceedings of the 2018 conference on empirical methods in natural language processing. 4663–4672.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Anchor Prediction: A Topic Modeling Approach

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '22: Companion Proceedings of the Web Conference 2022
      April 2022
      1338 pages
      ISBN:9781450391306
      DOI:10.1145/3487553

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 August 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format