ABSTRACT
Linking entities from different sources is a fundamental task in building open knowledge graphs. Despite much research conducted in related fields, the challenges of linkinglarge-scale heterogeneous entity graphs are far from resolved. Employing two billion-scale academic entity graphs (Microsoft Academic Graph and AMiner) as sources for our study, we propose a unified framework --- LinKG --- to address the problem of building a large-scale linked entity graph. LinKG is coupled with three linking modules, each of which addresses one category of entities. To link word-sequence-based entities (e.g., venues), we present a long short-term memory network-based method for capturing the dependencies. To link large-scale entities (e.g., papers), we leverage locality-sensitive hashing and convolutional neural networks for scalable and precise linking. To link entities with ambiguity (e.g., authors), we propose heterogeneous graph attention networks to model different types of entities. Our extensive experiments and systematical analysis demonstrate that LinKG can achieve linking accuracy with an F1-score of 0.9510, significantly outperforming the state-of-the-art. LinKG has been deployed to Microsoft Academic Search and AMiner to integrate the two large graphs. We have published the linked results---the Open Academic Graph (OAG)\footnote\urlhttps://www.openacademic.ai/oag/ , making it the largest publicly available heterogeneous academic graph to date.
Supplemental Material
- Ron Bekkerman and Andrew McCallum. 2005. Disambiguating Web Appearances of People in a Social Network. In WWW'05 . 463--470. Google ScholarDigital Library
- Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal , Vol. 18, 1 (2009), 255--276. Google ScholarDigital Library
- Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDE , Vol. 1, 1 (2007), 5. Google ScholarDigital Library
- Mikhail Bilenko. 2004. Learnable Similarity Functions and their Applications to Clustering and Record Linkage. In AAAI'04. 981--982. Google ScholarDigital Library
- George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP'13. 8609--8613.Google ScholarCross Ref
- Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 . 253--262. Google ScholarDigital Library
- Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate record detection: A survey. TKDE , Vol. 19, 1 (2007), 1--16. Google ScholarDigital Library
- Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In JCDL'04 . 296--305. Google ScholarDigital Library
- Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. 2018. Regal: Representation learning-based graph alignment. In CIKM'18 . 117--126. Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation , Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Lili Jiang, Jianyong Wang, Ning An, Shengyuan Wang, Jian Zhan, and Lian Li. 2009. Grape: A graph-based framework for disambiguating people appearances in web search. In ICDM'09 . 199--208. Google ScholarDigital Library
- Mahboob Alam Khalid, Valentin Jijkoun, and Maarten De Rijke. 2008. The impact of named entity normalization on information retrieval for question answering. In ECIR'08. 705--710. Google ScholarDigital Library
- Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents.. In ICML, Vol. 14. 1188--1196. Google ScholarDigital Library
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE , Vol. 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Lingli Li, Jianzhong Li, and Hong Gao. 2015. Rule-Based method for entity resolution. TKDE , Vol. 27, 1 (2015), 250--263.Google ScholarCross Ref
- Xin Li, Paul Morie, and Dan Roth. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In AAAI'04. 419--424. Google ScholarDigital Library
- Li Liu, William K Cheung, Xin Li, and Lejian Liao. 2016. Aligning Users across Social Networks Using Network Embedding. In IJCAI . 1774--1780. Google ScholarDigital Library
- Tong Man, Huawei Shen, Shenghua Liu, Xiaolong Jin, and Xueqi Cheng. 2016. Predict Anchor Links across Social Networks via an Embedding Approach.. In IJCAI'16 . 1823--1829. Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS'13 . 3111--3119. Google ScholarDigital Library
- Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics , Vol. 2 (2014), 231--244.Google ScholarCross Ref
- Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. 2018. Deepinf: Social influence prediction with deep learning. In KDD'18. ACM, 2110--2119. Google ScholarDigital Library
- Shu Rong, Xing Niu, Evan Xiang, Haofen Wang, Qiang Yang, and Yong Yu. 2012. A machine learning approach for instance matching based on similarity metrics. In ISWC'12 . 460--475. Google ScholarDigital Library
- Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In KDD'02. 269--278. Google ScholarDigital Library
- Wei Shen, Jiawei Han, and Jianyong Wang. 2014. A probabilistic model for linking named entities in web text with heterogeneous information networks. In SIGMOD'14. ACM, 1199--1210. Google ScholarDigital Library
- Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE , Vol. 27, 2 (2015), 443--460.Google ScholarCross Ref
- Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-june Paul Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In WWW'15 . 243--246. Google ScholarDigital Library
- Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A Comparison of Blocking Methods for Record Linkage. In PSD'14. 253--268.Google Scholar
- Yizhou Sun and Jiawei Han. 2013. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsletter , Vol. 14, 2 (2013), 20--28. Google ScholarDigital Library
- Shulong Tan, Ziyu Guan, Deng Cai, Xuzhen Qin, Jiajun Bu, and Chun Chen. 2014. Mapping Users across Networks by Manifold Alignment on Hypergraph. In AAAI'14 . 159--165. Google ScholarDigital Library
- Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE TKDE , Vol. 24, 6 (2012), 975--987. Google ScholarDigital Library
- Jie Tang, Juanzi Li, Bangyong Liang, Xiaotong Huang, Yi Li, and Kehong Wang. 2006. Using Bayesian decision for ontology mapping. Journal of Web Semantics , Vol. 4, 4 (2006), 243--262. Google ScholarDigital Library
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW'15. 1067--1077. Google ScholarDigital Library
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and Mining of Academic Social Networks. In KDD'08 . 990--998. Google ScholarDigital Library
- Petar Velivc ković , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò , and Yoshua Bengio. 2018. Graph Attention Networks . ICLR (2018).Google Scholar
- Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. 2012. Cross-lingual knowledge linking across wiki knowledge bases. In WWW'12 . 459--468. Google ScholarDigital Library
- Yang Yang, Yizhou Sun, Jie Tang, Bo Ma, and Juanzi Li. 2015. Entity Matching Across Heterogeneous Sources. In KDD'15. 1395--1404. Google ScholarDigital Library
- Xiaoxin Yin, Jiawei Han, and Philip S Yu. 2007. Object distinction: Distinguishing objects with identical names. In ICDE'07 . 1242--1246.Google ScholarCross Ref
- Jing Zhang, Bo Chen, Xianming Wang, Hong Chen, Cuiping Li, Fengmei Jin, Guojie Song, and Yutao Zhang. 2018a. MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks. In CIKM'18 . 327--336. Google ScholarDigital Library
- Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, and Philip Yu. 2015. COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency. In KDD'15 . 1485--1494. Google ScholarDigital Library
- Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018b. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.. In KDD'18 . 1002--1011. Google ScholarDigital Library
- Yan Zhuang, Guoliang Li, Zhuojian Zhong, and Jianhua Feng. 2017. Hike: A hybrid human-machine method for entity alignment in large-scale knowledge bases. In CIKM'17. 1917--1926. Google ScholarDigital Library
Index Terms
- OAG: Toward Linking Large-scale Heterogeneous Entity Graphs
Recommendations
Name usage pattern in the synonym ambiguity problem in bibliographic data
Individuals often appear with multiple names when considering large bibliographic datasets, giving rise to the synonym ambiguity problem. Although most related works focus on resolving name ambiguities, this work focus on classifying and characterizing ...
Named entity disambiguation by leveraging wikipedia semantic knowledge
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementName ambiguity problem has raised an urgent demand for efficient, high-quality named entity disambiguation methods. The key problem of named entity disambiguation is to measure the similarity between occurrences of names. The traditional methods measure ...
A graph-based approach for ontology population with named entities
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementAutomatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web and knowledge management techniques. This issue naturally consists of two subtasks: (1) for the entity mention whose ...
Comments