Information divergence estimation based on data-dependent partitions

https://doi.org/10.1016/j.jspi.2010.04.011Get rights and content

Abstract

This work studies the problem of information divergence estimation based on data-dependent partitions. A histogram-based data-dependent estimate is proposed adopting a version of Barron-type histogram-based estimate. The main result is the stipulation of sufficient conditions on the partition scheme to make the estimate strongly consistent. Furthermore, when the distributions are equipped with density functions in (Rd,B(Rd)), we obtain sufficient conditions that guarantee a density-free strongly consistent information divergence estimate. In this context, the result is presented for two emblematic partition schemes: the statistically equivalent blocks (Gessaman's data-driven partition) and data-dependent tree-structured vector quantization (TSVQ).

Introduction

Let P and Q be probability measures defined on (Rd,B(Rd)), the finite dimensional Euclidean space equipped with the Borel sigma field, then the information divergence of P with respect to Q is expressed by (see, e.g., Kullback, 1958, Gray, 1990),D(PQ)=supπQAπP(A)·logP(A)Q(A),where Q denotes the collection of finite measurable partitions of Rd. For this quantity to be finite, it is necessary that PQ (Kullback, 1958), which makes P/Q(x) the Radon–Nicodym (RN) derivative of P with respect to Q well defined. Considering the important case when P and Q are absolutely continuous with respect to the Lebesgue measure λ, i.e., Pλ and Qλ, it is sometime convenient to use the following expression (see, Gray, 1990):D(PQ)=Rdp(x)·logp(x)q(x)λ(x),where p(x)=P/λ(x) and q(x)=Q/λ(x) are the density functions of P and Q, respectively. The information divergence, also known Kullback–Leibler (KL) divergence or relative entropy, is a well known fundamental quantity in statistics and information theory (Kullback, 1958, Cover and Thomas, 1991, Gray, 1990). In statistics, KL divergence expresses the average information per observation to discriminate between two probabilistic models (Kullback, 1958). In large deviations, it characterizes the rate function, which reflects the exponential decay of convergence of empirical measures to their probabilities, Sanovs Theorem (see, e.g., den Hollander, 2000), and the rate of decay of the probability of error in a binary hypothesis testing problem, Steins Lemma (see Cover and Thomas, 1991).

On the application side, mainly because of its role as a discriminative measure (Kullback, 1958), the information divergence has found wide use in statistical learning-decision problems. It has been adopted as an optimality criterion for parameter re-estimation (Singer and Warmuth, 1996, Juang and Rabiner, 1985), as a similarity measure for modeling clustering and indexing (Vasconcelos, 2004b, Vasconcelos, 2000, Do and Vetterli, 2002), as an indicator to quantify the effect of estimation error in a Bayes decision approach (Vasconcelos, 2004a, Silva and Narayanan, 2009), to quantify the approximation error of vector quantization in statistical hypothesis testing (Jain et al., 2002, Poor and Tomas, 1977) and as fidelity indicator for feature selection and feature extraction (Saito and Coifman, 1994, Novovicova et al., 1996). These learning scenarios do not have access to the distributions and consequently they rely on empirical data to estimate this quantity. A standard setting considers X1,‥,Xn and Y1,‥,Yn to be independent and identically distributed (i.i.d.) realizations of P and Q, respectively. Then the problem becomes finding a distribution-free function or estimator D^(·) from Rd·n×Rd·n to R such that D^(X1,,Xn,Y1,,Yn) converges to D(PQ) almost surely as n tends to infinity (strong consistency).

In this regard, the closely related problem of differential entropy estimation has been systematically studied for distributions equipped with densities, adopting for instance non-parametric histogram-based, kernel-based and nearest-neighbor techniques. In these settings, the conditions for density-free strong consistency are well understood. An excellent review can be found in Beirlant et al. (1997) and some recent contributions in Darbellay and Vajda (1999) and Paninski, 2003, Paninski, 2008. Another closely related problem is the non-parametric density estimation, as the KL divergence is a functional of two probability measures. In this context the classical problem of strong consistency in L1 sense is well understood (Lugosi and Nobel, 1996, Devroye and Györfi, 1985). More recent work on non-parametric distribution estimation considers consistency under stronger notions (Györfi and van der Meulen, 1994, Barron et al., 1992, Györfi et al., 1998, Berlinet et al., 1998). In particular the seminal work of Barron et al. (1992) proposed variations of classical histogram-based density estimates to achieve consistency in two types of information divergences, motivated by the learning problem on universal lossy compression. This approach has been extended by Györfi et al. (1998) and Berlinet et al. (1998) for the problem of consistency in χ2- divergence and in Csiszár's ϕ- divergence (where the information divergence is a particular case), respectively. Although the two aforementioned research lines have been systematically explored, to the best of our knowledge, their estimates and results do not extend directly to the consistent estimation of information divergence. The main reason is that the learning setting here is different. On the one hand, we need to consider finite samples from the two distributions, P and Q, while on the other, we need to infer the distributions from the data in a way that is appropriate to the particular nature of the divergence information functional. However, because of their inherent connections, the extensions of techniques and results from distribution and differential entropy estimation to KL divergence estimation are important directions to explore.

In that spirit, there have been some recent contributions, in particular for P and Q defined in (Rd,B(Rd)) and both absolutely continuous with respect to the Lebesgue measure λ. The first important reference in this regard is from Wang et al. (2005), who proposed a histogram-based divergence estimation based on partitioning the space in statistically equivalent intervals. Sufficient conditions on the proposed data-driven partition were stipulated to guarantee strong consistency. Silva and Narayanan (2007) took this direction a step further finding consistency conditions for a general family of data-driven partition schemes. The main limitation of these two works is that they are only valid when the sample points of P and Q are taken to infinity in a specific order, one after the other, which limits their applicability. Alternatively, Nguyen et al. (2007) proposed a variational approach to estimate the divergence (see, Gray, 1990, den Hollander, 2000). Under certain approximation assumptions and smoothness condition on the likelihood-ratio, strong consistency and asymptotic rate of convergence for the proposed estimate were obtained. More recently, Wang et al. (2009) proposed nearest-neighbor techniques, where mean-square consistency was the main focus of analysis.

In this work we present contributions in the area of histogram-based information divergence estimation, in particular studying data-driven partitions schemes (Lugosi and Nobel, 1996, Nobel, 1996, Devroye et al., 1996, Darbellay and Vajda, 1999). We have significantly improved the initial findings in Wang et al. (2005) and Silva and Narayanan (2007). We reformulate the problem, propose new estimates and results to address properly the case when the samples of P and Q jointly tend to infinity, and furthermore, report new practical implications by getting concrete density-free KL divergence estimates from previously unexplored multivariate data-driven partition schemes.

Specifically in Section 3, we present the general histogram-based estimation scheme. This scheme quantizes the space function of the data and constructs a version of the Barron-type of histogram-based density estimate (Barron et al., 1992) as a way to approximate δP/δQ(x), which can be considered the sufficient statistics for the problem. Then assuming that D(PQ)<, Theorem 4 in Section 5 characterizes sufficient conditions on the partitions scheme to make the estimate strongly consistent. This result does not require P and Q to be absolutely continuous with respect to λ, and furthermore, it is valid for distributions defined on a general measurable space (X,S). Concerning the approximation error presented in Section 4, we adopt Csiszár's notion of asymptotically sufficient partitions (Csiszár, 1967, Csiszár, 1973), and when Pλ and Qλ, Theorem 2 presents a condition for this error to vanish based on a shrinking cell property for data-driven partitions (Lugosi and Nobel, 1996, Breiman et al., 1984, Devroye et al., 1996). For the estimation error, in Section 5, we use the Vapnik–Chervonenkis (1971) (VC) inequality (Vapnik, 1998; see also Devroye et al., 1996, Lugosi and Nobel, 1996) and characterize a concentration result on the empirical distributions, Lemma 3, that makes this error tend to zero as n tend to infinity with probability one.

In the second part of this work, we explore applications of our main result. In Section 7 consistency is demonstrated for multivariate statistically equivalent blocks—Gessaman's (1970) data-dependent partition, while Section 8 shows equivalent results for tree-structured vector quantizations (TSVQ) (Devroye et al., 1996, Breiman et al., 1984, Nobel, 2002). Importantly in both settings, a range of parametric values are characterized to obtain a family of density-free consistent estimates. The main challenge faced in deriving these results, is to prove the adopted shrinking cell condition, which is achieved from the adaptive nature of the data-driven partition schemes. Finally, some of the proofs and derivations are organized in the appendix.

Section snippets

Preliminaries

This section provides notation and key results used for the rest of paper.

The data-driven estimator

Let P and Q be probability measures in (Rd,B(Rd)) such that D(PQ)<. For the learning problem let us consider X1,‥,Xn and Y1,‥,Yn i.i.d. realizations of random variables in Rd and driven by P and Q, respectively, and let Π={π1,π2,} be a data-driven partition scheme for Rd. We propose a plug-in histogram-based estimate for the information divergence of the form,Dπn(Y1,,Yn)(Pn*Qn)Aπn(Y1,,Yn)Pn*(A)·logPn*(A)Qn(A),where P*n is a Barron et al. (1992) type of empirical measure given byPn*(A)(

Approximation error analysis

In this section we study the approximation quality of data-dependent partition schemes Π for the information divergence estimation (8). This notion is strongly related with the concept of asymptotically sufficient partition developed by Csiszár, 1973, Csiszár, 1967 and recent extensions presented by Vajda (2002), Liese et al. (2006) and Berlinet and Vajda (2005) (see also Liese and Vajda, 1987). The main difference here is that we are dealing with data-dependent partitions driven by an

The main result

Theorem 4

Let P and Q be probability measures in (X,S) such that D(PQ)<. Let X1,‥,Xn and Y1,‥,Yn be i.i.d. realizations of P and Q, respectively, and Π={π1,π2,} a partition scheme with associated sequence of measurable partitions A1,A2,. If for some l(0,1), there exists p(0,l/2), τ(0,l2p] and (kn)nN a non-negative sequence such that

  • (a)

    (kn)(n0.5+l/2), (an)(np) and (an)=o(1), and on Π we impose that:

  • (b)

    limnnτM(An)=0,

  • (c)

    limnnτlogΔn*(An)=0,

  • (d)

    nN, (y1,,yn)Xn, infAπn(y1n)Qn(A)kn/n,

  • (e)

    and Π is

Applications

In this section we address two practical questions. First, is there a partition scheme that using the histogram-based estimate in (8), provides a strongly consistent KL divergence estimator distribution-free for a family of probability measures? Second, assuming a positive answer for the previous question, what are the range of design values on these constructions that guarantee this result?

To address these questions, we study how the set of sufficient conditions presented in Theorem 2,

ln-spacing partition rule for R

Let us first start with a simple scenario. Let us consider the real line (R,B(R)) as the measurable space and a partition scheme that dichotomizes the space in statistically equivalent intervals. This was the setting explored by Wang et al. (2005). More precisely, let Y1,‥,Yn be i.i.d. realizations drawn from QPλ(R). The order statistics Y(1),‥,Y(n) are defined as the permutation of Y1,‥,Yn such that Y(1)<Y(2)<<Y(n)—this permutation exists with probability one as Qλ. Based on this sequence,

Tree-structured partition schemes

We start with some definitions and preliminaries to facilitate the exposition of the main result in Section 8.3.

Final remarks

The main result in Theorem 4 and its applications (Theorem 6, Theorem 8) suggest that the information divergence estimation problem put more restrictions in terms of data-driven design conditions when compared with the problem of density estimation, in particular for the reference measure Q, consistent in the L1 sense (Lugosi and Nobel, 1996). This conjecture agrees with findings on density-free estimation of information theoretic quantities (Györfi and van der Meulen, 1987) and the convergence

Acknowledgments

The work of J. Silva was supported by funding from FONDECYT Grant 1090138, CONICYT-Chile. The work of S.S. Narayanan was supported by funding from the National Science Foundation (NSF).

References (53)

  • A. Barron et al.

    Distribution estimation consistent in total variation and in two types of information divergence

    IEEE Transactions on Information Theory

    (1992)
  • J. Beirlant et al.

    Nonparametric entropy estimation: an overview

    International Journal of Mathematical and Statistical Sciences

    (1997)
  • A. Berlinet et al.

    On asymptotic sufficiency and optimality of quantizations

    Journal of Statistical Planing and Inference

    (2005)
  • A. Berlinet et al.

    About the asymptotic accuracy of Barron density estimate

    IEEE Transactions on Information Theory

    (1998)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • T.M. Cover

    Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition

    IEEE Transactions on Electronic Computers

    (1965)
  • T.M. Cover et al.

    Elements of Information Theory

    (1991)
  • I. Csiszár

    Information-type measures of difference of probability distributions and indirect observations

    Studia Scientiarum Mathematicarum Hungarica

    (1967)
  • Csiszár, I., 1973. Generalized entropy and quantization problems. In: Academia (Ed.), Transactions on Sixth Prague...
  • I. Csiszár et al.

    Information Theory and Statistics: A Tutorial

    (2004)
  • G.A. Darbellay et al.

    Estimation of the information by an adaptive partition of the observation space

    IEEE Transactions on Information Theory

    (1999)
  • F. den Hollander

    Large Deviations

    (2000)
  • L. Devroye et al.

    Nonparametric Density Estimation: The L1 View

    (1985)
  • L. Devroye et al.

    A Probabilistic Theory of Pattern Recognition

    (1996)
  • L. Devroye et al.

    Combinatorial Methods in Density Estimation

    (2001)
  • M.N. Do et al.

    Wavelet-based texture retrieval using generalized gaussian densities and Kullback–Leibler distance

    IEEE Transactions on Image Processing

    (2002)
  • M.P. Gessaman

    A consistent nonparametric multivariate density estimator based on statistically equivalent blocks

    The Annals of Mathematical Statistics

    (1970)
  • R.M. Gray

    Entropy and Information Theory

    (1990)
  • L. Györfi et al.

    Distribution estimates consistent in χ2- divergence

    Statistics

    (1998)
  • L. Györfi et al.

    Density-free convergence properties of various estimators of entropy

    Computational Statistics and Data Analysis

    (1987)
  • Györfi, L., van der Meulen, E.C., 1994. Density estimation consistent in information divergence. In: IEEE International...
  • P.R. Halmos

    Measure Theory

    (1950)
  • A. Jain et al.

    Information-theoretic bounds on target recognition performance based on degraded image data

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • B.H. Juang et al.

    A probabilistic distance measure for hidden Markov models

    AT&T Technical Journal

    (1985)
  • S. Kullback

    Information Theory and Statistics

    (1958)
  • F. Liese et al.

    Asymptotically sufficient partition and quantization

    IEEE Transactions on Information Theory

    (2006)
  • Cited by (28)

    View all citing articles on Scopus
    View full text