Elsevier

Pattern Recognition

Volume 44, Issue 2, February 2011, Pages 265-277
Pattern Recognition

On the distance concentration awareness of certain data reduction techniques

https://doi.org/10.1016/j.patcog.2010.08.018Get rights and content

Abstract

We make a first investigation into a recently raised concern about the suitability of existing data analysis techniques when faced with the counter-intuitive properties of high dimensional data spaces, such as the phenomenon of distance concentration. Under the structural assumption of a generic linear model with a latent variable and an additive unstructured noise, we find that dimension reduction that explicitly guards against distance concentration recovers the well-known techniques of Fisher's linear discriminant analysis, Fisher's discriminant ratio and a variant of projection pursuit. Extrapolation to regression uncovers a close link to sure independence screening, which is a recently proposed technique for variable selection in ultra-high dimensional feature spaces. Hence, these techniques may be seen as distance concentration aware, despite they have not been explicitly designed to have this property. Throughout our analysis, other than the dependency structure implied by the mentioned linear model, we make no assumptions about the distributions of the variables involved.

Introduction

Creating univariate projections of high dimensional data in directions that satisfy some specified criteria is often the method of choice in high dimensional data analysis. For instance, Fisher's linear discriminant analysis (FLDA) provides the direction that optimally preserves the class structure. In the unsupervised setting, projection pursuit seeks projections that are maximally non-Gaussian [10], e.g. having maximal kurtosis.

Owing to the increasingly high dimensionality of data sets in a number of areas, most notably in cancer research, a serious concern has been raised recently [6] that questions the suitability of existing data analysis techniques when faced with the properties of high dimensional data spaces. In particular, in this study we are concerned with a specific aspect of the dimensionality curse, known as the phenomenon of distance concentration—that is, as the data dimensionality increases, all pairwise distances between points may become too similar to each other in certain cases [3], [6], [15]. Contrary to other problems of high dimensionality, like data sparseness and computational overhead, the phenomenon of distance concentration is rather counter-intuitive, and hence its effects are much less obvious.

Pattern recognition has been the basis for a series of generic methodologies with great potential in a wide range of domains, such as face recognition, spam filtering and multimedia applications. It has been a key to computer-aided diagnosis systems to support the human expert's interpretations and findings. Data become higher and higher dimensional in all such application areas. Yet, the effects of distance concentration have, so far, not been examined in the context of pattern recognition methodologies. As pointed out in [6], the existing tools have not been designed with an awareness of phenomena that only occur in high dimensions. To what extent is their use appropriate in such high dimensional problems? Since the phenomenon of distance concentration has now been completely characterised [3], [11], [20], we may be able feasibly address the issue.

Here we proceed in a systematic, model-driven manner. We examine some commonly used linear data reduction techniques specifically from the point of view of their awareness of the distance concentration problem, by looking at their effect on the relative variance of the pairwise inter-point distances under the generic model under consideration. The sequence of relative variances (indexed by data dimensionality) was previously shown to be the key to describe the phenomenon of concentration of pairwise distances between points drawn from an arbitrary data distribution [3], [11], [20]. In addition, it has been shown [11] that this is governed by the (lack of) correlation structure among the data features relative to their noise content. Since pattern recognition would be impossible if the data had no structure, while a noise content is also an inevitable reality, we find it most useful for our analysis to consider a data model that captures both of these characteristics.

We ask the following questions:

  • Which is the direction of maximal inter-distance relative variance?

  • Which of the data features have the maximal inter-distance relative variance?

  • What statistical property should a latent variable have, so that its high dimensional expansion has maximal relative variance?

It turns out that, in these settings, the answer to the first two questions recovers Fisher's discriminant analysis, Fisher's discriminant ratio, and the recent technique of sure independence screening. The answer to the latter of our questions provides a new justification of the kurtosis index, frequently employed in projection pursuit for finding ‘interesting structure’ in the data.

Section snippets

Distance concentration

Let Fm, m=1,2,…be an infinite sequence of data distributions and x1(m),,xN(m) a random sample of n independent data vectors distributed as Fm. For each m, let ·:dom(Fm)R+ be a function that takes a point from the domain of Fm and returns a positive real value. Further, s>0 will denote an arbitrary positive constant, and it is assumed that E[x(m)s] and Var[x(m)s] are finite and E[x(m)s]0.

Theorem 2.1

Beyer et al. [3]

If limmVar[x(m)s]/E[x(m)s]2=0, then ε>0, limmP[max1jNxj(m)<(1+ε)min1jNxj(m)]=1;

A re-interpretation of Fisher's linear discriminant analysis

Consider the case when y{1,1}. Then, the model (1) is a two-class mixture density with shared covariance.

We seek a linear projection of the high dimensional data down to 1D, such that the relative variance of the projected data—w.r.t. the Euclidean distance and the above mentioned data model—is maximised. In other words, how can we down-project the data to make it least concentrated for a subsequent two-class mixture-based classification.

Denote RVmlead(x) the leading term in (4), that is, RVm

Experiments

This section presents a set of experiments that illustrate and validate the findings of the previous section, using synthetic data generated from various instantiations of the model Eq. (1) under study. We then also demonstrate the working of the identified methods, and their combinations, in benchmark real data experiments.

Conclusions

We made a first investigation into the question of whether or not certain existing data analysis techniques would be still suitable when faced with the counter-intuitive phenomenon of distance concentration in high dimensional data spaces. Our analysis was made under the structural assumption of one generating latent variable swamped by unstructured noise, and has examined several dimensionality reduction scenarios that would maximise the relative variance of the transformed data. We found that

Acknowledgement

This work was supported by an MRC Discipline Hopping Award (Grant G0701858).

Ata Kaban is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University “Babes-Bolyai” of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at

References (30)

  • R. Clarke et al.

    The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

    Nature Reviews Cancer

    (2008)
  • S. Dasgupta

    Learning mixtures of Gaussians

  • S. Dasgupta et al.

    An elementary proof of the Johnson–Lindenstrauss lemma

    Random Structures and Algorithms

    (2002)
  • M. Dettling

    BagBoosting for tumor classification with gene expression data

    Bioinformatics

    (2004)
  • P. Diaconis et al.

    Asymptotics of graphical projection pursuit

    Annuals of Statistics

    (1984)
  • Cited by (12)

    View all citing articles on Scopus

    Ata Kaban is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University “Babes-Bolyai” of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at Helsinki University of Technology (June–December 2000 and in the summer of 2003) and at HIIT BRU, University of Helsinki (September 2005). Prior to her career in Computer Science, she received the B.A. degree in musical composition (1994) and the M.A. (1995) and the Ph.D. (1999) degrees in musicology from the Music Academy “Gh. Dima” of Cluj-Napoca, Romania.

    View full text