On the distance concentration awareness of certain data reduction techniques
Introduction
Creating univariate projections of high dimensional data in directions that satisfy some specified criteria is often the method of choice in high dimensional data analysis. For instance, Fisher's linear discriminant analysis (FLDA) provides the direction that optimally preserves the class structure. In the unsupervised setting, projection pursuit seeks projections that are maximally non-Gaussian [10], e.g. having maximal kurtosis.
Owing to the increasingly high dimensionality of data sets in a number of areas, most notably in cancer research, a serious concern has been raised recently [6] that questions the suitability of existing data analysis techniques when faced with the properties of high dimensional data spaces. In particular, in this study we are concerned with a specific aspect of the dimensionality curse, known as the phenomenon of distance concentration—that is, as the data dimensionality increases, all pairwise distances between points may become too similar to each other in certain cases [3], [6], [15]. Contrary to other problems of high dimensionality, like data sparseness and computational overhead, the phenomenon of distance concentration is rather counter-intuitive, and hence its effects are much less obvious.
Pattern recognition has been the basis for a series of generic methodologies with great potential in a wide range of domains, such as face recognition, spam filtering and multimedia applications. It has been a key to computer-aided diagnosis systems to support the human expert's interpretations and findings. Data become higher and higher dimensional in all such application areas. Yet, the effects of distance concentration have, so far, not been examined in the context of pattern recognition methodologies. As pointed out in [6], the existing tools have not been designed with an awareness of phenomena that only occur in high dimensions. To what extent is their use appropriate in such high dimensional problems? Since the phenomenon of distance concentration has now been completely characterised [3], [11], [20], we may be able feasibly address the issue.
Here we proceed in a systematic, model-driven manner. We examine some commonly used linear data reduction techniques specifically from the point of view of their awareness of the distance concentration problem, by looking at their effect on the relative variance of the pairwise inter-point distances under the generic model under consideration. The sequence of relative variances (indexed by data dimensionality) was previously shown to be the key to describe the phenomenon of concentration of pairwise distances between points drawn from an arbitrary data distribution [3], [11], [20]. In addition, it has been shown [11] that this is governed by the (lack of) correlation structure among the data features relative to their noise content. Since pattern recognition would be impossible if the data had no structure, while a noise content is also an inevitable reality, we find it most useful for our analysis to consider a data model that captures both of these characteristics.
We ask the following questions:
- •
Which is the direction of maximal inter-distance relative variance?
- •
Which of the data features have the maximal inter-distance relative variance?
- •
What statistical property should a latent variable have, so that its high dimensional expansion has maximal relative variance?
It turns out that, in these settings, the answer to the first two questions recovers Fisher's discriminant analysis, Fisher's discriminant ratio, and the recent technique of sure independence screening. The answer to the latter of our questions provides a new justification of the kurtosis index, frequently employed in projection pursuit for finding ‘interesting structure’ in the data.
Section snippets
Distance concentration
Let Fm, m=1,2,…be an infinite sequence of data distributions and a random sample of n independent data vectors distributed as Fm. For each m, let be a function that takes a point from the domain of Fm and returns a positive real value. Further, will denote an arbitrary positive constant, and it is assumed that and are finite and . Theorem 2.1 If , then , ; Beyer et al. [3]
A re-interpretation of Fisher's linear discriminant analysis
Consider the case when . Then, the model (1) is a two-class mixture density with shared covariance.
We seek a linear projection of the high dimensional data down to 1D, such that the relative variance of the projected data—w.r.t. the Euclidean distance and the above mentioned data model—is maximised. In other words, how can we down-project the data to make it least concentrated for a subsequent two-class mixture-based classification.
Denote the leading term in (4), that is,
Experiments
This section presents a set of experiments that illustrate and validate the findings of the previous section, using synthetic data generated from various instantiations of the model Eq. (1) under study. We then also demonstrate the working of the identified methods, and their combinations, in benchmark real data experiments.
Conclusions
We made a first investigation into the question of whether or not certain existing data analysis techniques would be still suitable when faced with the counter-intuitive phenomenon of distance concentration in high dimensional data spaces. Our analysis was made under the structural assumption of one generating latent variable swamped by unstructured noise, and has examined several dimensionality reduction scenarios that would maximise the relative variance of the transformed data. We found that
Acknowledgement
This work was supported by an MRC Discipline Hopping Award (Grant G0701858).
Ata Kaban is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University “Babes-Bolyai” of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at
References (30)
Problems and results in extremal combinatorics, part I
Discrete Mathematics
(2003)- et al.
When is ‘nearest neighbour’ meaningful: a converse theorem and implications
Journal of Complexity
(2009) New instability results for high-dimensional nearest-neighbor search
Information Processing Letters
(2009)- et al.
Independent component analysis by general non-linear Hebbian-like learning rules
Signal Processing
(1998) On the geometry of similarity search: dimensionality curse and concentration of measure
Information Processing Letters
(2000)- et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002) - et al.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proceedings of National Academy Sciences of the Unites States of America
(1999) - et al.
When is nearest neighbor meaningful?
- E. Candès, J. Romberg. ℓ1-magic: recovery of sparse signals via convex programming,...
- et al.
Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation
Bioinformatics
(2006)
The properties of high-dimensional data spaces: implications for exploring gene and protein expression data
Nature Reviews Cancer
Learning mixtures of Gaussians
An elementary proof of the Johnson–Lindenstrauss lemma
Random Structures and Algorithms
BagBoosting for tumor classification with gene expression data
Bioinformatics
Asymptotics of graphical projection pursuit
Annuals of Statistics
Cited by (12)
Do all roads lead to Rome? Studying distance measures in the context of machine learning
2023, Pattern RecognitionGood and bad neighborhood approximations for outlier detection ensembles
2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Reward Motivation Enhances Task Coding in Frontoparietal Cortex
2016, Cerebral CortexClustering evaluation in high-dimensional data
2016, Unsupervised Learning Algorithms
Ata Kaban is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University “Babes-Bolyai” of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at Helsinki University of Technology (June–December 2000 and in the summer of 2003) and at HIIT BRU, University of Helsinki (September 2005). Prior to her career in Computer Science, she received the B.A. degree in musical composition (1994) and the M.A. (1995) and the Ph.D. (1999) degrees in musicology from the Music Academy “Gh. Dima” of Cluj-Napoca, Romania.