Information divergence estimation based on data-dependent partitions

doi:10.1016/j.jspi.2010.04.011

Journal of Statistical Planning and Inference

Volume 140, Issue 11, November 2010, Pages 3180-3198

https://doi.org/10.1016/j.jspi.2010.04.011 Get rights and content

Abstract

This work studies the problem of information divergence estimation based on data-dependent partitions. A histogram-based data-dependent estimate is proposed adopting a version of Barron-type histogram-based estimate. The main result is the stipulation of sufficient conditions on the partition scheme to make the estimate strongly consistent. Furthermore, when the distributions are equipped with density functions in $(R^{d}, B (R^{d}))$ , we obtain sufficient conditions that guarantee a density-free strongly consistent information divergence estimate. In this context, the result is presented for two emblematic partition schemes: the statistically equivalent blocks (Gessaman's data-driven partition) and data-dependent tree-structured vector quantization (TSVQ).

Introduction

Let P and Q be probability measures defined on $(R^{d}, B (R^{d}))$ , the finite dimensional Euclidean space equipped with the Borel sigma field, then the information divergence of P with respect to Q is expressed by (see, e.g., Kullback, 1958, Gray, 1990), $D (P ∥ Q) = \sup_{π \in Q} \sum_{A \in π} P (A) \cdot \log \frac{P (A)}{Q (A)},$ where $Q$ denotes the collection of finite measurable partitions of $R^{d}$ . For this quantity to be finite, it is necessary that $P ⪡ Q$ (Kullback, 1958), which makes $\partial P / \partial Q (x)$ the Radon–Nicodym (RN) derivative of P with respect to Q well defined. Considering the important case when P and Q are absolutely continuous with respect to the Lebesgue measure $λ$ , i.e., $P ⪡ λ$ and $Q ⪡ λ$ , it is sometime convenient to use the following expression (see, Gray, 1990): $D (P ∥ Q) = \int_{R^{d}} p (x) \cdot \log \frac{p (x)}{q (x)} λ (\partial x),$ where $p (x) = \partial P / \partial λ (x)$ and $q (x) = \partial Q / \partial λ (x)$ are the density functions of P and Q, respectively. The information divergence, also known Kullback–Leibler (KL) divergence or relative entropy, is a well known fundamental quantity in statistics and information theory (Kullback, 1958, Cover and Thomas, 1991, Gray, 1990). In statistics, KL divergence expresses the average information per observation to discriminate between two probabilistic models (Kullback, 1958). In large deviations, it characterizes the rate function, which reflects the exponential decay of convergence of empirical measures to their probabilities, Sanov’s Theorem (see, e.g., den Hollander, 2000), and the rate of decay of the probability of error in a binary hypothesis testing problem, Stein’s Lemma (see Cover and Thomas, 1991).

On the application side, mainly because of its role as a discriminative measure (Kullback, 1958), the information divergence has found wide use in statistical learning-decision problems. It has been adopted as an optimality criterion for parameter re-estimation (Singer and Warmuth, 1996, Juang and Rabiner, 1985), as a similarity measure for modeling clustering and indexing (Vasconcelos, 2004b, Vasconcelos, 2000, Do and Vetterli, 2002), as an indicator to quantify the effect of estimation error in a Bayes decision approach (Vasconcelos, 2004a, Silva and Narayanan, 2009), to quantify the approximation error of vector quantization in statistical hypothesis testing (Jain et al., 2002, Poor and Tomas, 1977) and as fidelity indicator for feature selection and feature extraction (Saito and Coifman, 1994, Novovicova et al., 1996). These learning scenarios do not have access to the distributions and consequently they rely on empirical data to estimate this quantity. A standard setting considers X₁,‥,X_n and Y₁,‥,Y_n to be independent and identically distributed (i.i.d.) realizations of P and Q, respectively. Then the problem becomes finding a distribution-free function or estimator $\hat{D} (\cdot)$ from $R^{d \cdot n} \times R^{d \cdot n}$ to $R$ such that $\hat{D} (X_{1}, ‥, X_{n}, Y_{1}, ‥, Y_{n})$ converges to $D (P ∥ Q)$ almost surely as n tends to infinity (strong consistency).

In this regard, the closely related problem of differential entropy estimation has been systematically studied for distributions equipped with densities, adopting for instance non-parametric histogram-based, kernel-based and nearest-neighbor techniques. In these settings, the conditions for density-free strong consistency are well understood. An excellent review can be found in Beirlant et al. (1997) and some recent contributions in Darbellay and Vajda (1999) and Paninski, 2003, Paninski, 2008. Another closely related problem is the non-parametric density estimation, as the KL divergence is a functional of two probability measures. In this context the classical problem of strong consistency in L₁ sense is well understood (Lugosi and Nobel, 1996, Devroye and Györfi, 1985). More recent work on non-parametric distribution estimation considers consistency under stronger notions (Györfi and van der Meulen, 1994, Barron et al., 1992, Györfi et al., 1998, Berlinet et al., 1998). In particular the seminal work of Barron et al. (1992) proposed variations of classical histogram-based density estimates to achieve consistency in two types of information divergences, motivated by the learning problem on universal lossy compression. This approach has been extended by Györfi et al. (1998) and Berlinet et al. (1998) for the problem of consistency in $χ^{2}$ - divergence and in Csiszár's $ϕ$ - divergence (where the information divergence is a particular case), respectively. Although the two aforementioned research lines have been systematically explored, to the best of our knowledge, their estimates and results do not extend directly to the consistent estimation of information divergence. The main reason is that the learning setting here is different. On the one hand, we need to consider finite samples from the two distributions, P and Q, while on the other, we need to infer the distributions from the data in a way that is appropriate to the particular nature of the divergence information functional. However, because of their inherent connections, the extensions of techniques and results from distribution and differential entropy estimation to KL divergence estimation are important directions to explore.

In that spirit, there have been some recent contributions, in particular for P and Q defined in $(R^{d}, B (R^{d}))$ and both absolutely continuous with respect to the Lebesgue measure $λ$ . The first important reference in this regard is from Wang et al. (2005), who proposed a histogram-based divergence estimation based on partitioning the space in statistically equivalent intervals. Sufficient conditions on the proposed data-driven partition were stipulated to guarantee strong consistency. Silva and Narayanan (2007) took this direction a step further finding consistency conditions for a general family of data-driven partition schemes. The main limitation of these two works is that they are only valid when the sample points of P and Q are taken to infinity in a specific order, one after the other, which limits their applicability. Alternatively, Nguyen et al. (2007) proposed a variational approach to estimate the divergence (see, Gray, 1990, den Hollander, 2000). Under certain approximation assumptions and smoothness condition on the likelihood-ratio, strong consistency and asymptotic rate of convergence for the proposed estimate were obtained. More recently, Wang et al. (2009) proposed nearest-neighbor techniques, where mean-square consistency was the main focus of analysis.

In this work we present contributions in the area of histogram-based information divergence estimation, in particular studying data-driven partitions schemes (Lugosi and Nobel, 1996, Nobel, 1996, Devroye et al., 1996, Darbellay and Vajda, 1999). We have significantly improved the initial findings in Wang et al. (2005) and Silva and Narayanan (2007). We reformulate the problem, propose new estimates and results to address properly the case when the samples of P and Q jointly tend to infinity, and furthermore, report new practical implications by getting concrete density-free KL divergence estimates from previously unexplored multivariate data-driven partition schemes.

Specifically in Section 3, we present the general histogram-based estimation scheme. This scheme quantizes the space function of the data and constructs a version of the Barron-type of histogram-based density estimate (Barron et al., 1992) as a way to approximate $δ P / δ Q (x)$ , which can be considered the sufficient statistics for the problem. Then assuming that $D (P ∥ Q) < \infty$ , Theorem 4 in Section 5 characterizes sufficient conditions on the partitions scheme to make the estimate strongly consistent. This result does not require P and Q to be absolutely continuous with respect to $λ$ , and furthermore, it is valid for distributions defined on a general measurable space $(X, S)$ . Concerning the approximation error presented in Section 4, we adopt Csiszár's notion of asymptotically sufficient partitions (Csiszár, 1967, Csiszár, 1973), and when $P ⪡ λ$ and $Q ⪡ λ$ , Theorem 2 presents a condition for this error to vanish based on a shrinking cell property for data-driven partitions (Lugosi and Nobel, 1996, Breiman et al., 1984, Devroye et al., 1996). For the estimation error, in Section 5, we use the Vapnik–Chervonenkis (1971) (VC) inequality (Vapnik, 1998; see also Devroye et al., 1996, Lugosi and Nobel, 1996) and characterize a concentration result on the empirical distributions, Lemma 3, that makes this error tend to zero as n tend to infinity with probability one.

In the second part of this work, we explore applications of our main result. In Section 7 consistency is demonstrated for multivariate statistically equivalent blocks—Gessaman's (1970) data-dependent partition, while Section 8 shows equivalent results for tree-structured vector quantizations (TSVQ) (Devroye et al., 1996, Breiman et al., 1984, Nobel, 2002). Importantly in both settings, a range of parametric values are characterized to obtain a family of density-free consistent estimates. The main challenge faced in deriving these results, is to prove the adopted shrinking cell condition, which is achieved from the adaptive nature of the data-driven partition schemes. Finally, some of the proofs and derivations are organized in the appendix.

Section snippets

Preliminaries

This section provides notation and key results used for the rest of paper.

The data-driven estimator

Let P and Q be probability measures in $(R^{d}, B (R^{d}))$ such that $D (P ∥ Q) < \infty$ . For the learning problem let us consider X₁,‥,X_n and Y₁,‥,Y_n i.i.d. realizations of random variables in $R^{d}$ and driven by P and Q, respectively, and let $Π = {π_{1}, π_{2}, \dots}$ be a data-driven partition scheme for $R^{d}$ . We propose a plug-in histogram-based estimate for the information divergence of the form, $D_{π_{n} (Y_{1}, \dots, Y_{n})} (P_{n}^{*} ∥ Q_{n}) \equiv \sum_{A \in π_{n} (Y_{1}, \dots, Y_{n})} P_{n}^{*} (A) \cdot \log \frac{P_{n}^{*} (A)}{Q_{n} (A)},$ where P^*_n is a Barron et al. (1992) type of empirical measure given by $P_{n}^{*} (A) \equiv ($

Approximation error analysis

In this section we study the approximation quality of data-dependent partition schemes $Π$ for the information divergence estimation (8). This notion is strongly related with the concept of asymptotically sufficient partition developed by Csiszár, 1973, Csiszár, 1967 and recent extensions presented by Vajda (2002), Liese et al. (2006) and Berlinet and Vajda (2005) (see also Liese and Vajda, 1987). The main difference here is that we are dealing with data-dependent partitions driven by an

The main result

Theorem 4

Let P and Q be probability measures in $(X, S)$ such that $D (P ∥ Q) < \infty$ . Let X₁,‥,X_n and Y₁,‥,Y_n be i.i.d. realizations of P and Q, respectively, and $Π = {π_{1}, π_{2}, \dots}$ a partition scheme with associated sequence of measurable partitions $A_{1}, A_{2}, \dots$ . If for some $l \in (0, 1)$ , there exists $p \in (0, l / 2)$ , $τ \in (0, l - 2 p]$ and $(k_{n})_{n \in N}$ a non-negative sequence such that

(a)
$(k_{n}) ≽ (n^{0.5 + l / 2})$ , $(a_{n}) ≽ (n^{- p})$ and (a_n)=o(1), and on $Π$ we impose that:
(b)
$\lim_{n \to \infty} n^{- τ} M (A_{n}) = 0$ ,
(c)
$\lim_{n \to \infty} n^{- τ} \log Δ_{n}^{*} (A_{n}) = 0$ ,
(d)
$\forall n \in N$ , $\forall (y_{1}, ‥, y_{n}) \in X^{n}$ , $\inf_{A \in π_{n} (y_{1}^{n})} Q_{n} (A) \geq k_{n} / n$ ,
(e)
and $Π$ is

Applications

In this section we address two practical questions. First, is there a partition scheme that using the histogram-based estimate in (8), provides a strongly consistent KL divergence estimator distribution-free for a family of probability measures? Second, assuming a positive answer for the previous question, what are the range of design values on these constructions that guarantee this result?

To address these questions, we study how the set of sufficient conditions presented in Theorem 2,

l_n-spacing partition rule for $R$

Let us first start with a simple scenario. Let us consider the real line $(R, B (R))$ as the measurable space and a partition scheme that dichotomizes the space in statistically equivalent intervals. This was the setting explored by Wang et al. (2005). More precisely, let Y₁,‥,Y_n be i.i.d. realizations drawn from $Q \in P_{λ} (R)$ . The order statistics Y⁽¹⁾,‥,Y⁽ⁿ⁾ are defined as the permutation of Y₁,‥,Y_n such that $Y^{(1)} < Y^{(2)} < \dots < Y^{(n)} —this$ permutation exists with probability one as $Q ⪡ λ$ . Based on this sequence,

Tree-structured partition schemes

We start with some definitions and preliminaries to facilitate the exposition of the main result in Section 8.3.

Final remarks

The main result in Theorem 4 and its applications (Theorem 6, Theorem 8) suggest that the information divergence estimation problem put more restrictions in terms of data-driven design conditions when compared with the problem of density estimation, in particular for the reference measure Q, consistent in the L₁ sense (Lugosi and Nobel, 1996). This conjecture agrees with findings on density-free estimation of information theoretic quantities (Györfi and van der Meulen, 1987) and the convergence

Acknowledgments

The work of J. Silva was supported by funding from FONDECYT Grant 1090138, CONICYT-Chile. The work of S.S. Narayanan was supported by funding from the National Science Foundation (NSF).

References (53)

A. Barron et al.
Distribution estimation consistent in total variation and in two types of information divergence
IEEE Transactions on Information Theory
(1992)
J. Beirlant et al.
Nonparametric entropy estimation: an overview
International Journal of Mathematical and Statistical Sciences
(1997)
A. Berlinet et al.
On asymptotic sufficiency and optimality of quantizations
Journal of Statistical Planing and Inference
(2005)
A. Berlinet et al.
About the asymptotic accuracy of Barron density estimate
IEEE Transactions on Information Theory
(1998)
L. Breiman et al.
Classification and Regression Trees
(1984)
T.M. Cover
Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition
IEEE Transactions on Electronic Computers
(1965)
T.M. Cover et al.
Elements of Information Theory
(1991)
I. Csiszár
Information-type measures of difference of probability distributions and indirect observations
Studia Scientiarum Mathematicarum Hungarica
(1967)
Csiszár, I., 1973. Generalized entropy and quantization problems. In: Academia (Ed.), Transactions on Sixth Prague...
I. Csiszár et al.
Information Theory and Statistics: A Tutorial
(2004)

G.A. Darbellay et al.

Estimation of the information by an adaptive partition of the observation space

IEEE Transactions on Information Theory

(1999)

F. den Hollander

Large Deviations

(2000)

L. Devroye et al.

Nonparametric Density Estimation: The L₁ View

(1985)

L. Devroye et al.

A Probabilistic Theory of Pattern Recognition

(1996)

L. Devroye et al.

Combinatorial Methods in Density Estimation

(2001)

M.N. Do et al.

Wavelet-based texture retrieval using generalized gaussian densities and Kullback–Leibler distance

IEEE Transactions on Image Processing

(2002)

M.P. Gessaman

A consistent nonparametric multivariate density estimator based on statistically equivalent blocks

The Annals of Mathematical Statistics

(1970)

R.M. Gray

Entropy and Information Theory

(1990)

L. Györfi et al.

Distribution estimates consistent in $χ^{2}$ - divergence

Statistics

(1998)

L. Györfi et al.

Density-free convergence properties of various estimators of entropy

Computational Statistics and Data Analysis

(1987)

Györfi, L., van der Meulen, E.C., 1994. Density estimation consistent in information divergence. In: IEEE International...

P.R. Halmos

Measure Theory

(1950)

A. Jain et al.

Information-theoretic bounds on target recognition performance based on degraded image data

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2002)

B.H. Juang et al.

A probabilistic distance measure for hidden Markov models

AT&T Technical Journal

(1985)

S. Kullback

Information Theory and Statistics

(1958)

F. Liese et al.

Asymptotically sufficient partition and quantization

IEEE Transactions on Information Theory

(2006)

Cited by (28)

Information dependency: Strong consistency of Darbellay-Vajda partition estimators
2013, Journal of Statistical Planning and Inference
The Darbellay–Vajda partition scheme is a well known method to estimate the information dependency. This estimator belongs to a class of data-dependent partition estimators. We would like to prove that with some simple conditions, the Darbellay–Vajda partition estimator is a strong consistency for the information dependency estimation of a bivariate random vector. This result is an extension of Silva and Narayanan, 2010a, Silva and Narayanan, 2010b work which gives some simple conditions to confirm that the Gessaman's partition estimator and the tree-quantization partition estimator, other estimators in the class of data-dependent partition estimators, are strongly consistent.
On the convergence of Shannon differential entropy, and its connections with density and entropy estimation
2012, Journal of Statistical Planning and Inference
This work extends the study of convergence properties of the Shannon differential entropy, and its connections with the convergence of probability measures in the sense of total variation and direct and reverse information divergence. The results relate the topics of distribution (density) estimation, and Shannon information measures estimation, with special focus on the case of differential entropy. On the application side, this work presents an explicit analysis of the density estimation, and differential entropy estimation, for distributions defined on a finite-dimension Euclidean space $(R^{d}, B (R^{d}))$ . New consistency results are derived for several histogram-based estimators: the classical product scheme, the Barron's estimator, one of the approaches proposed by Györfi and Van der Meulen, and the data-driven partition scheme of Lugosi and Nobel.
MAUVE Scores for Generative Models: Theory and Practice
2022, arXiv
Approximate Bayesian Computation via Classification
2022, Journal of Machine Learning Research
On the Interplay between Information Loss and Operation Loss in Representations for Classification
2022, Proceedings of Machine Learning Research
Efficient Density Estimation for High-Dimensional Data
2022, IEEE Access

View all citing articles on Scopus

View full text

Information divergence estimation based on data-dependent partitions

Abstract

Introduction

Section snippets

Preliminaries

The data-driven estimator

Approximation error analysis

The main result

Applications

ln-spacing partition rule for R

Tree-structured partition schemes

Final remarks

Acknowledgments

Distribution estimation consistent in total variation and in two types of information divergence

IEEE Transactions on Information Theory

Nonparametric entropy estimation: an overview

International Journal of Mathematical and Statistical Sciences

On asymptotic sufficiency and optimality of quantizations

Journal of Statistical Planing and Inference

About the asymptotic accuracy of Barron density estimate

IEEE Transactions on Information Theory

Classification and Regression Trees

Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition

IEEE Transactions on Electronic Computers

Elements of Information Theory

Information-type measures of difference of probability distributions and indirect observations

Studia Scientiarum Mathematicarum Hungarica

Information Theory and Statistics: A Tutorial

Estimation of the information by an adaptive partition of the observation space

IEEE Transactions on Information Theory

Large Deviations

Nonparametric Density Estimation: The L1 View

A Probabilistic Theory of Pattern Recognition

Combinatorial Methods in Density Estimation

Wavelet-based texture retrieval using generalized gaussian densities and Kullback–Leibler distance

IEEE Transactions on Image Processing

A consistent nonparametric multivariate density estimator based on statistically equivalent blocks

The Annals of Mathematical Statistics

Entropy and Information Theory

Distribution estimates consistent in χ2- divergence

Statistics

Density-free convergence properties of various estimators of entropy

Computational Statistics and Data Analysis

Measure Theory

Information-theoretic bounds on target recognition performance based on degraded image data

IEEE Transactions on Pattern Analysis and Machine Intelligence

A probabilistic distance measure for hidden Markov models

AT&T Technical Journal

Information Theory and Statistics

Asymptotically sufficient partition and quantization

IEEE Transactions on Information Theory

l_n-spacing partition rule for $R$

Nonparametric Density Estimation: The L₁ View

Distribution estimates consistent in $χ^{2}$ - divergence