Elsevier

Information Fusion

Volume 46, March 2019, Pages 171-183
Information Fusion

Full Length Article
An incremental graph-partitioning algorithm for entity resolution

https://doi.org/10.1016/j.inffus.2018.06.001Get rights and content

Highlights

  • A novel incremental data association algorithm is proposed for entity resolution.

  • Order of magnitude faster than batch algorithms, with little/no loss in accuracy.

  • Shows 30–40% better F-Score on John Smith dataset, compared to leading heuristics.

  • Proposed algorithm leverages the Clique Partition Problem.

  • Quickly updates solution for new references as well as changes in similarity scores.

Abstract

Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

Introduction

In today’s era of data deluge, information fusion has increasing applications in myriad of domains like disaster relief [28], [38], consumer marketing [39], etc. In most of these applications, data are typically gathered by heterogeneous sources such as human observers and/or human operated or automated physical sensors like cameras, LIDAR, acoustic sensors, etc. These data are gathered over time and may contain duplicated references of the same set of objects, which may cause inconsistencies and inefficiencies in the data analysis efforts. For this reason, data association becomes an essential step in multi-source information fusion, which identifies when two sensors are observing the same object and, if so, those observations are fused together, creating “cumulative” evidence. This cumulative evidence can be used as a surrogate to the real world, for building and testing various hypotheses and models.

In settings where the observed data is in the form of named entities such as people, places, organizations, etc., this problem is often referred to as entity resolution. In these settings, the data is often in the form of structured or unstructured natural language, like databases, field reports, news articles, websites, citation lists, etc. Observations of entities are called entity references, and the goal is to identify when multiple entity references refer to the same entity, in which case they are said to be co-referent. In this paper, the term entity resolution is used, rather than data association, since the test datasets used later are primarily drawn from that community.

In many applications, the data on which entity resolution is performed is constantly changing over time. This newly arriving information takes two forms: (1) addition of new entity references into the current dataset; and (2) New information that may revise beliefs about the current entity resolution solution. The challenge of incremental entity resolution is to update the current solution to reflect these changes without needing to resolve the entire problem. If done correctly, incremental entity resolution can provide tremendous gains in computation time, with very little to no loss in the solution accuracy.

The main contribution of this paper is an incremental heuristic algorithm for the Clique Partition Problem (CPP), a well studied graph partitioning problem [13], [21], [37], [45]. The algorithm to be presented updates a current partitioning solution when a new entity reference arrives or if new information about an existing reference arrives. The incremental algorithm developed in this paper is tested on a variety of datasets to establish its performance under a range of conditions.

The remainder of this paper is organized as follows. Section 2 reviews prior work in incremental entity resolution with an emphasis on incremental clustering. The new incremental algorithm is presented in Section 3. The testing methodology is discussed in Section 4, along with the experimental results for both random and real datasets. Finally, we conclude with a summary and some future research directions.

Section snippets

Related work

This section contains a review of prior work on partitioning and clustering problems applied to entity resolution. Clustering is reviewed since it is an extremely well studied topic for solving the same basic problem in entity resolution. A survey of clustering algorithms is given by Xu and Wunsch [51].

A mathematical programming framework will be adopted to help demonstrate some important properties of the incremental entity resolution algorithm. There is an extensive history of applying

Incremental clique partitioning for entity resolution

In this section it is shown how entity resolution can be modeled as a Clique Partition Problem (CPP) and solved to optimality. It is then shown how the CPP can be used to update the optimal entity resolution solution subject to incrementally arriving information without necessarily re-solving the entire problem. Finally, the implications of solving the CPP heuristically and of the inclusion of blocking strategies to improve execution time are discussed. When possible we choose notation and

Experimental results

Two different graph partitioning solvers are used with the incremental method of Section 3. The first solves the CPP(w) formulation optimally using CPLEX 12.5, and the second employs the agglomerative clustering algorithm explained in Section 3.5. CPLEX is run with default settings except for parallel mode turned off (multi-threading and concurrency were not explored). The agglomerative clustering solution is found to perform very close to CPLEX on the majority of datasets tested. To prevent

Conclusions and future research

To conclude, we studied a data association problem of resolving named entities (aka entity resolution) in a dynamic setting where data arrives incrementally into the system. The main contribution of this work is an incremental algorithm, which is able to maintain a high quality solution to an entity resolution model as new references and scoring information arrive incrementally. We formulated this problem as a clique partitioning problem, which lends itself to the algorithm development.

Disclaimer

The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL or the U.S. Government.

Acknowledgments

This research activity is supported by a Multidisciplinary University Research Initiative (MURI) grant (Number W911NF-09-1-0392) for Unified Research on Network-based Hard/Soft Information Fusion, issued by the US Army Research Office (ARO).

This material is also based upon work supported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory contract number FA8650-10-C-7062. The U.S. Government is authorized to reproduce and distribute reprints for

References (51)

  • ChenZ. et al.

    Graph-based clustering for computational linguistics: a survey

    Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing

    (2010)
  • P. Christen et al.

    Towards scalable real-time entity resolution using a similarity-aware inverted index approach

    Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008)

    (2008)
  • W.W. Cohen et al.

    A comparison of string distance metrics for name-matching tasks

    Proceedings of the 2003 Workshop on Information Integration on the Web, IIWEB

    (2003)
  • G. Costa et al.

    An incremental clustering scheme for data de-duplication

    Data Min. Knowl. Discov.

    (2010)
  • K. Date et al.

    Test and evaluation of data association algorithms in hard+soft data fusion

    Proceedings of the Seventeenth International Conference on Information Fusion (FUSION)

    (2014)
  • U. Dorndorf et al.

    Fast clustering algorithms

    ORSA J. Comput.

    (1994)
  • J. Finkel et al.

    Enforcing transitivity in coreference resolution

    Proceedings of the Forty-Sixth Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers

    (2008)
  • J.R. Finkel et al.

    Incorporating non-local information into information extraction systems by Gibbs sampling

    Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics

    (2005)
  • D. Firmani et al.

    Online entity resolution using an oracle

    Proc. VLDB Endow.

    (2016)
  • J. Gehrke et al.

    Overview of the 2003 KDD cup

    SIGKDD Explor.

    (2003)
  • C.H. Gooi et al.

    Cross-document coreference on a large scale corpus

    Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL

    (2004)
  • J.L. Graham et al.

    A synthetic dataset for evaluating soft and hard fusion algorithms

    Proceedings of the 2011 SPIE

    (2011)
  • G.A. Gross et al.

    Systemic test and evaluation of a hard+ soft information fusion framework: challenges and current approaches

    Proceedings of the Seventeenth International Conference on Information Fusion (FUSION)

    (2014)
  • M. Grötschel et al.

    A cutting plane algorithm for a clustering problem

    Math. Program.

    (1989)
  • A. Gruenheid et al.

    Incremental record linkage

    Proc. VLDB Endow.

    (2014)
  • Cited by (0)

    View full text