Probabilistic latent variable models for unsupervised many-to-many object matching

https://doi.org/10.1016/j.ipm.2015.12.013Get rights and content

Highlights

  • We propose a probabilistic model for matching clusters in different domains without correspondence information.

  • The proposed method can handle data with more than two domains, and the number of objects in each domain can be different.

  • We extend the proposed method for a semi-supervised setting.

  • We demonstrate that the proposed method achieve better matching performance than existing methods using synthetic and real-world data sets.

Abstract

Object matching is an important task for finding the correspondence between objects in different domains, such as documents in different languages and users in different databases. In this paper, we propose probabilistic latent variable models that offer many-to-many matching without correspondence information or similarity measures between different domains. The proposed model assumes that there is an infinite number of latent vectors that are shared by all domains, and that each object is generated from one of the latent vectors and a domain-specific projection. By inferring the latent vector used for generating each object, objects in different domains are clustered according to the vectors that they share. Thus, we can realize matching between groups of objects in different domains in an unsupervised manner. We give learning procedures of the proposed model based on a stochastic EM algorithm. We also derive learning procedures in a semi-supervised setting, where correspondence information for some objects are given. The effectiveness of the proposed models is demonstrated by experiments on synthetic and real data sets.

Introduction

Object matching is an important task for finding the correspondence between objects in different domains. Examples of object matching include matching an image with an annotation (Socher & Fei-Fei, 2010), an English word with a French word (Tripathi, Klami, & Virpioja, 2010), and user identification in different databases for recommendation (Li, Yang, & Xue, 2009). Most object matching methods require similarity measures between objects in the different domains, or paired data that contain correspondence information. When a similarity measure is given, we can match objects by finding pairs of objects that maximize the sum of the similarities. When correspondence information is given, we can obtain a mapping function from one domain to another by using supervised learning methods, and then we can calculate the similarities between objects in different domains.

However, similarity measures and correspondences might not be available. Defining similarities and generating correspondences incur considerable cost and time, and they are sometimes unobtainable because of the need to preserve privacy. For example, dictionaries between some languages might not exist, and different online stores cannot share user identification. For these situations, unsupervised object matching methods have been proposed; they include kernelized sorting (Quadrianto, Smola, Song, & Tuytelaars, 2010), least squares object matching (Yamada & Sugiyama, 2011), matching canonical correlation analysis (Haghighi, Liang, Berg-Kirkpatrick, & Klein, 2008), and its Bayesian extension (Klami, 2012, Klami, 2013). These methods find one-to-one matches. However, matching is not necessarily one-to-one in some applications. For example, when matching English and German documents, multiple English documents with the similar topic could correspond to multiple German documents. In image annotation, related annotations ‘tree’, ‘wood’ and ‘forest’ can be attached to multiple images that look similar to each other. Other limitations of these methods are that the number of domains is limited to two, and the numbers of objects in the different domains must be the same. There can be more than two domains in some applications, for example matching multilingual documents such as English, French and German, and the number of documents for each language can be different.

In this paper, we propose a probabilistic latent variable model for finding correspondence between object clusters in multiple domains without correspondence information. We assume that objects in different domains share a hidden structure, which is represented by an infinite number of latent vectors that are shared by all domains. Each object is generated from one of the latent vectors and a domain-specific linear projection. The latent vectors used for generating objects are unknown. By assigning a latent vector to each object, we can allocate objects in different domains to common clusters, and find many-to-many matches. The number of clusters is automatically inferred from the given data by using a Dirichlet process prior. The proposed model can handle more than two domains with different numbers of objects. We infer the proposed model using a stochastic EM algorithm. The proposed model can ignore arbitrary linear transformations for different domains by inferring the domain-specific linear projection, and can find cluster matching in different domains, where similarity cannot be calculated directly.

The proposed model assumes a Gaussian distribution for each observed variable, and its mean is determined by a latent vector and a linear projection matrix. It is an extension of probabilistic principle component analysis (PCA) (Tipping & Bishop, 1999b) and factor analysis (FA) (Everitt, 1984), which are representative probabilistic latent variable models. With probabilistic PCA and FA, each object is associated with a latent vector. On the other hand, with the proposed model, the latent vector that is assigned to each object is hidden. When the number of domains is one, and every object is assigned to a cluster that is different from those of other objects, the proposed model corresponds to probabilistic principle component analysis.

The proposed model can be also used in a semi-supervised setting, where correspondence information for some objects is given (Jagarlamudi, Juarez, Daumé III, 2010, Quadrianto, Smola, Song, Tuytelaars, 2010). The information assists matching by incorporating a condition stating that the cluster assignments of the corresponding objects must be the same. We derive learning procedures for the semi-supervised setting by modifying the learning procedures for unsupervised setting.

This paper is an extended version of Iwata, Hirao, and Ueda (2013). We newly proposed the inference procedure for a semi-supervised setting, and added derivations and experiments. The remainder of this paper is organized as follows. In Section 2, we review related work. In Section 3, we formulate the proposed model and describe its efficient learning procedures. We also present the learning procedures for a semi-supervised setting and for missing data. In Section 4, we demonstrate the effectiveness of the proposed models with experiments on synthetic and real data sets. Finally, we present concluding remarks and a discussion of future work in Section 5.

Section snippets

Unsupervised object matching

Unsupervised object matching is a task that involves finding the correspondence between objects in different domains without correspondence information. For example, kernelized sorting (Quadrianto et al., 2010) finds the correspondence by permuting a set to maximize the dependence between two domains where the Hilbert Schmidt Independence Criterion (HSIC) is used as the dependence measure. Kernelized sorting requires the two domains have the same number of objects. Convex kernelized sorting (

Model

Suppose that we are given objects in D domains X={Xd}d=1D, where Xd={xdn}n=1Nd is a set of objects in the dth domain, and xdnRMd is the feature vector of the nth object in the dth domain. Our notation is summarized in Table 1. Note that we are unaware of any correspondence between objects in different domains. The number of objects Nd and the dimensionality Md for each domain can be different from those of other domains. The task is to match groups of objects across multiple domains in an

Matchin rotated handwritten digits

First, we demonstrate the proposed model described in Section 3 using a toy data set with three domains, which is created using handwritten digits from the MNIST database (LeCun, Bottou, Bengio, & Haffner, 1998). The first domain contains original handwritten digits, where each image is downsampled to 16 × 16 pixels. We synthesize objects for the second and third domains by rotating handwritten digits by 90 and 180 degrees, clockwise, respectively. Thus, we obtain three-domain objects that

Conclusion

We proposed a generative model approach for finding many-to-many matching based on probabilistic latent variable models. In experiments, we confirmed that the proposed model can perform much better than conventional methods based on object matching, clustering and their combinations. Advantages of the proposed model over the existing methods are that it can find many-to-many matching, and can handle multiple domains with different numbers of objects with no prior knowledge. Because the proposed

References (40)

  • AndrewG. et al.

    Deep canonical correlation analysis

    Proceedings of the 30th international conference on machine learning

    (2013)
  • BachF.R. et al.

    A probabilistic interpretation of canonical correlation analysis

    Tech. Rep. 688

    (2005)
  • Boyd-GraberJ. et al.

    Multilingual topic models for unaligned text

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    (2009)
  • ChangC. et al.

    LIBSVM: a library for support vector machines

    ACM Transactions on Intelligent Systems and Technology (TIST)

    (2011)
  • DjuricN. et al.

    Convex kernelized sorting

    Twenty-sixth AAAI conference on artificial intelligence

    (2012)
  • EverittB.S.

    An introduction to latent variable models

    (1984)
  • GhahramaniZ. et al.

    The EM algorithm for mixtures of factor analyzers

    Tech. rep

    (1996)
  • HaghighiA. et al.

    Learning bilingual lexicons from monolingual corpora

    Proceedings of ACL-08: Hlt

    (2008)
  • HardoonD.R. et al.

    Canonical correlation analysis: an overview with application to learning methods

    Neural Computation

    (2004)
  • HarelM. et al.

    Learning from multiple outlooks

  • HubertL. et al.

    Comparing partitions

    Journal of Classification

    (1985)
  • IwataT. et al.

    Unsupervised cluster matching via probabilistic latent variable models

    Proceedings of the twenty-seventh AAAI conference on artificial intelligence

    (2013)
  • IwataT. et al.

    Learning common grammar from multilingual corpus

    Proceedings of 48th annual meeting of the association for computational linguistics

    (2010)
  • JagarlamudiJ. et al.

    Kernelized sorting for natural language processing

    Aaai ’10: Proceedings of the 24th AAAI conference on artificial intelligence

    (2010)
  • KimuraA. et al.

    SemiCCA: Efficient semi-supervised learning of canonical correlations

    Proceedings of IAPR international conference on pattern recognition, ICPR ’10

    (2010)
  • KlamiA.

    Variational Bayesian matching

    Proceedings of Asian conference on machine learning

    (2012)
  • KlamiA.

    Bayesian object matching

    Machine learning

    (2013)
  • LawrenceN.D. et al.

    Non-linear matrix factorization with Gaussian processes

    Proceedings of the 26th annual international conference on machine learning, ICML ’09

    (2009)
  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • LiB. et al.

    Transfer learning for collaborative filtering via a rating-matrix generative model

    Proceedings of the 26th annual international conference on machine learning, ICML ’09

    (2009)
  • Cited by (0)

    View full text