Discriminant component analysis via distance correlation maximization

doi:10.1016/j.patcog.2019.107052

Pattern Recognition

Volume 98, February 2020, 107052

https://doi.org/10.1016/j.patcog.2019.107052 Get rights and content

Highlights

•
We propose a dimensionality reduction technique based on distance correlation.
•
Our method maximizes the dependency between data samples and target variable.
•
Kernel version of our method is also derived for non-linear problems.
•
Our approach has a simple and closed-form solution.
•
Our approach is computationally efficient.

Abstract

In the following study, an innovative supervised dimensionality reduction technique is proposed. dCor-based Dimensionality Reduction or dDR technique is based on distance correlation; a powerful correlation measure which is applicable to arbitrary-dimensional random variables. By projecting the samples to a lower dimensional space, dDR maximizes the correlation between explanatory and response variables. The proposed dDR algorithm can be easily implemented and it is computationally efficient. Moreover, it has a closed-form and a simple solution which makes it significantly effective in many different applications. In order to apply the proposed technique on non-linear problems, the kernel version of the dDR is also derived. Extensive analyses and empirical experiments across various visualization, classification, and regression tasks indicate that our algorithm is the method of choice; as it offers statistically superior results in comparison with other state-of-the-art approaches in the literature.

Introduction

Recently, along with the fast pace of developments in technology, science, and huge amount of data available, the need for more robust and efficient learning algorithms has never been felt more. Among these huge data, the existence of high dimensional data sets is a prevalent and inevitable issue.

In high dimensional feature space, due to the curse of dimensionality, the conventional learning algorithms do not produce satisfactory results. In order to maintain a given sample density in many approaches, the number of required training instances and complexity of the target function grows exponentially with increasing data dimension. This situation would become worse even when the dimension of data significantly exceeds the number of data points or the required training data is expensive or difficult to collect. For example, DNA micro-array data consist of thousands of gene expression features while the number of examples is relatively small.

Reducing the data dimensionality has become very popular over the years and many effective methods have been proposed in the literature [15], [20], [38], [48]. From among these techniques, linear dimension reduction methods, which are very prevalent in the literature, try to learn a low-dimensional subspace to project the high dimensional data.

Principle Component Analysis (PCA) [23], an unsupervised linear approach of dimensionality reduction, is a useful statistical technique which has been used in many applications [53]. PCA models the data $X \in R^{d}$ as approximately lying in some low dimensional subspace. By modeling this subspace, PCA preserves as much variability as possible in the original data.

On the other hand, supervised learning techniques [5], [13], [36] try to predict the l dimensional response variable $Y \in R^{l}$ from a given explanatory random variable X in order to improve the prediction accuracy of classification or regression tasks.

Significant improvements in learning algorithms are the direct result of the reduction in dimensionality of the input data before training; dimensionality reduction methods are effective tools in overcoming the curse of dimensionality. The performance of supervised learning tasks can be enhanced by considering the discriminative information in response variable while projecting the data into a lower dimensional space. This can be achieved by projecting the data in the direction towards which the dependency between explanatory and response variables is maximized. It is also desirable in dimensionality reduction to preserve the structural information of the data and maximally associate the embedded data with available side information (e.g labels). However, most of these algorithms suffer from very high costs in time and memory or they solve a very complicated optimization problem [14], [41], [44].

In this paper, we propose a supervised dimensionality reduction technique called dCor-based Dimensionality Reduction (dDR). This technique finds a direction to project the data so that it is highly related to response variable Y. In other words, our technique projects the samples into a lower dimensional space while it maximizes the dependency between explanatory random variable X and the target variable Y.

The main contributions of this paper are as follows:

1.
Our proposed technique can be solved in closed-form efficiently and it does not suffer from high computational complexity or a complicated optimization problem. dDR not only improves the learning ability of learning algorithms, but it also solves the high computational overload of the state-of-the-art approaches. Our proposed method is based on distance correlation, which is more general yet powerful than the Pearson correlation coefficient.
2.
The characteristics of dDR allow us to derive the kernel version of the algorithm in order to make dDR applicable to non-linear learning problems, as well. These dDR and KdDR algorithms are reduced to a simple optimization problem, which can be solved by eigenvalue decomposition.
3.
To indicate the effectiveness of our proposed technique, well-known and state-of-the-art dimensionality reduction methods are implemented and compared with our algorithm. Comprehensive analyses and experiments, including time complexity analyses across wide varieties of synthetic, UCI [11], and high dimensional biological data sets over classification and regression problems, are conducted. Our results indicate the effectiveness and efficiency of our approach in processing high dimensional and complex non-linear data structures.

The remainder of the paper is structured as follows: Section 2 introduces the prevalent research progresses in supervised dimensionality reduction in literature. Section 3 describes the distance correlation measure (dCor) on which our method relies on. Section 4 explains our method in a more detailed fashion. Analyses of the results and experimental settings are discussed in Section 5. The concluding Section 6 addresses the conclusion of the paper and our future work.

Section snippets

Related work

Several proposed approaches in the area of supervised dimensionality reduction are presented in this section.

Fisher Discriminant Analysis (FDA) [10] is an old yet popular approach in literature. It maximizes the between-class scatter yet minimizes the within-class scatter and projects the data to a $c - 1$ dimensional space where c is the number of classes. However, FDA has difficulties when the classes have overlaps. Moreover, it fails when the means of the classes are equal.

To tackle FDA’s

dCor: distance correlation dependency measure

For the sake of clarity, it is imperative to devote this section to explain the distance correlation measure in details and indicate that our optimization problem is originally derived from this measure.

Distance correlation or dCor ( $R$ ), in short, which is proposed by Székely et al. [45], is a method of testing multivariate independence between two random variables of arbitrary dimensions. It can be defined for all distributions with finite first moments. The dCor of two normal univariate random

Supervised dimensionality reduction

This section provides a detailed explanation of our proposed algorithms which are based on distance correlation. Suppose we have N, p-dimensional data samples ${(x_{i}), i = 1, \dots, N}$ stored in p × N matrix X. Also, assume that Y is the l × N matrix of response variables. We are interested in finding the subspace U^TX in such a way that the dependency between the target variable Y and the projected data U^TX is maximized. U^TX is the representation of data in lower dimensional space (projected data).

Experimental settings

In this section, the performances of comparative techniques are assessed and compared on a number of visualization, classification, and regression tasks. A diversity of 32 synthetic, UCI, and biological data sets are considered for visualization and classification parts. Detailed information of these data sets is summarized in Table 1. The smallest UCI data set is Fertility with 100 instances and the largest one is Abalone with 4139 samples. Among biological data sets, the highest dimensional

Conclusion and future work

This paper proposes a new supervised linear dimensionality reduction technique which is based on distance correlation; a powerful correlation measure which indicates high statistical power. dDR projects the data samples to the direction towards which the dependency between explanatory and target variables is maximized. dDR can be solved in closed-form and it is very computationally efficient. Moreover, the kernelized version of the dDR is derived (KdDR) in order to extend our method to

Acknowledgments

The authors would like to express the deepest gratitude to Mr. Farhad Abdi for his constructive advice and assistance for editing this paper.

Lida Abdi received her B.Sc. degree in Computer Engineering from Shiraz Payamnoor University, Iran, in 2010, and her M.Sc. degree in Artificial Intelligence from Shiraz University, Iran, in 2013. Her research interests include machine learning, dimensionality reduction, transfer learning, and kernel learning.

References (54)

E. Barshan et al.
Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds
Pattern Recognit.
(2011)
K. Gajamannage et al.
A nonlinear dimensionality reduction framework using smooth geodesics
Pattern Recognit.
(2019)
J.C. Gower
Properties of euclidean and non-euclidean distance matrices
Linear Algebra Appl.
(1985)
C. Örnek et al.
Nonlinear supervised dimensionality reduction via smooth regular embeddings
Pattern Recognit.
(2019)
S. Zhang et al.
Face sketch aging via aging oriented principal component analysis
Pattern Recognit. Lett.
(2018)
H. Zhao et al.
Regularized discriminant entropy analysis
Pattern Recognit.
(2014)
J.L. Alperin
Local Representation Theory: Modular Representations as an Introduction to the Local Representation Theory of Finite Groups
(1993)
V.H. ANDEZ et al.
A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization
Electron. Trans. Numer. Anal.
(2008)
G. Andrew et al.
Deep canonical correlation analysis
International Conference on Machine Learning
(2013)
E. Bair et al.
Prediction by supervised principal components
J. Am. Stat. Assoc.
(2012)

M. Clark

A Comparison of Correlation Measures. Center for Social Research

(2013)

J. Demmel et al.

Fast linear algebra is stable

Numer. Math.

(2007)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

(2006)

O.J. Dunn

Multiple comparisons among means

J. Am. Stat. Assoc.

(1961)

R.A. Fisher

The use of multiple measurements in taxonomic problems

Ann. Eugen.

(1936)

A. Frank, A. Asuncion, UCI machine learning repository(2007)....

M. Friedman

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

J. Am. Stat. Assoc.

(1937)

K. Fukumizu et al.

Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces

J. Mach. Learn. Res.

(2004)

K. Fukumizu et al.

Gradient-based kernel dimension reduction for regression

J. Am. Stat. Assoc.

(2014)

T. Jordan et al.

End-to-end training of deep probabilistic CCA for joint modeling of paired biomedical observations

Third workshop on Bayesian Deep Learning (NeurIPS 2018), MontrȨal, Canada

(2018)

A. Gretton et al.

Measuring statistical dependence with hilbert-schmidt norms

International Conference on Algorithmic Learning Theory

(2005)

N. Halko et al.

Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions

SIAM Rev.

(2011)

M. Harandi et al.

Dimensionality reduction on SPD manifolds: the emergence of geometry-aware methods

IEEE Trans. Pattern Anal. Mach.Intell.

(2018)

H. Hotelling

Relations between two sets of variates

Biometrika

(1936)

R.L. Iman et al.

Approximations of the critical region of the fbietkan statistic

Commun. Stat.-Theory Methods

(1980)

I. Jolliffe

Principal Component Analysis

(1986)

J. Josse et al.

Measuring multivariate association and beyond

Statist. Surv.

(2016)

Cited by (8)

L1-norm discriminant analysis via Bhattacharyya error bounds under Laplace distributions
2023, Pattern Recognition
L1-norm discriminant analysis has been proposed to enhance the robustness of classical LDA in the presence of outliers. This paper develops L1-norm discriminant analysis by exploring Bhattacharyya error bounds under Laplace distributions. Unlike some previous models, we assume that the samples of each class in the projected space follow Laplace distributions. It is the first time that the Bhattacharyya error bound is derived under Laplace distributions. It is interesting to note that this bound has a closed-form expression in a one-dimensional projected space. We relax the Bhattacharyya error bound to achieve another bound that can facilitate the design of a tractable model. Since the parameters of Laplace distributions are estimated by the maximum likelihood estimation, this yields the problem of estimating class centroids in our model. We employ a simple yet effective strategy to estimate class centroids in the original space. To address the small-sample-size problem, we transform the original model into a difference criterion by introducing additional parameters. We achieve an alternative representation of our model and design an effective algorithm to solve it. In addition, we also extend our model to its kernel version. The experiments on a series of data sets are done to demonstrate the effectiveness of our method in dealing with contaminated data.
Memetic micro-genetic algorithms for cancer data classification
2023, Intelligent Systems with Applications
Fast and precise medical diagnosis of human cancer is crucial for treatment decisions. Gene selection consists of identifying a set of informative genes from microarray data to allow high predictive accuracy in human cancer classification. This task is a combinatorial search problem, and optimisation methods can be applied for its resolution. In this paper, two memetic micro-genetic algorithms (MμV1 and MμV2) with different hybridisation approaches are proposed for feature selection of cancer microarray data. Seven gene expression datasets are used for experimentation. The comparison with stochastic state-of-the-art optimisation techniques concludes that problem-dependent local search methods combined with micro-genetic algorithms improve feature selection of cancer microarray data.
A class-driven approach to dimension embedding
2022, Expert Systems with Applications
Citation Excerpt :
The dCor-based Dimensionality Reduction (dDR) technique is based on distance correlation and purposes to maximize the correlation between inputs and the outcome. There also exists the kernel version of the dDR to apply it to non-linear data sets (Abdi & Ghodsi, 2020). The Discriminative Low-Rank Projection (DLRP) method aims to improve low-rank representation (LRR) to ensure an optimal low-rank subspace and discriminative projection and to develop a robust algorithm against outliers (Lai, Bao, Kong, Wan, & Yang, 2020).
The preservation of the inbuilt structures of data sets and the more decomposition of the classes are a significant interest in dimension embedding. In this respect, the dimensionality reduction methods use novel techniques to better ascertain the fundamental structure of the manifold on which the data lies. However, both conventional and state-of-art supervised dimensionality reduction methods cannot benefit from class information good enough. Therefore, their generalization performances on the test data are weak. A new non-linear supervised algorithm, which we call Class-driven Dimension Embedding (CDE), is proposed for utilizing class information. CDE performs three outstanding characteristics: (i) preserving the intrinsic relationship between the data points and classes; (ii) producing wide margins between classes; (iii) enhancing the generalization performance on the test data. The proposed method embeds a d-dimensional data set into the c-dimensional space (c designates the number of classes) through the corresponding values to classes of each point by exploiting a neighborhood graph and a feature weighting function. The experimental results on forty-eight data sets demonstrate that CDE is comparable to or better than twenty-four dimensionality reduction algorithms in terms of classification accuracy and visualization. The source code of CDE can be found in https://doi.org/10.24433/CO.0967299.v1 for computational reproducibility.
Radiomics Approach to the Detection of Prostate Cancer Using Multiparametric MRI: A Validation Study Using Prostate-Cancer-Tissue-Mimicking Phantoms
2023, Applied Sciences (Switzerland)
Binary domain adaptation with independence maximization
2021, International Journal of Machine Learning and Cybernetics
NNR-GL: A Measure to Detect Co-Nonlinearity Based on Neural Network Regression Regularized by Group Lasso
2021, IEEE Access

View all citing articles on Scopus

Ali Ghodsi received his B.S. degree in Computer Engineering from Shiraz University, Shiraz, Iran in 1992 and Ph.D. degree in Computer Science from University of Waterloo, Waterloo, Canada, in 2005. Currently, he is a professor at the University of Waterloo. His general research interests are in the areas of machine learning, dimensionality reduction, and visualization.

View full text

Discriminant component analysis via distance correlation maximization

Highlights

Abstract

Introduction

Section snippets

Related work

dCor: distance correlation dependency measure

Supervised dimensionality reduction

Experimental settings

Conclusion and future work

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Linear Algebra Appl.

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Local Representation Theory: Modular Representations as an Introduction to the Local Representation Theory of Finite Groups

A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization

Electron. Trans. Numer. Anal.

Deep canonical correlation analysis

International Conference on Machine Learning

Prediction by supervised principal components

J. Am. Stat. Assoc.

A Comparison of Correlation Measures. Center for Social Research

Fast linear algebra is stable

Numer. Math.

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

Multiple comparisons among means

J. Am. Stat. Assoc.

The use of multiple measurements in taxonomic problems

Ann. Eugen.

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

J. Am. Stat. Assoc.

Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces

J. Mach. Learn. Res.

Gradient-based kernel dimension reduction for regression

J. Am. Stat. Assoc.

End-to-end training of deep probabilistic CCA for joint modeling of paired biomedical observations

Third workshop on Bayesian Deep Learning (NeurIPS 2018), MontrȨal, Canada

Measuring statistical dependence with hilbert-schmidt norms

International Conference on Algorithmic Learning Theory

Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions

SIAM Rev.

Dimensionality reduction on SPD manifolds: the emergence of geometry-aware methods

IEEE Trans. Pattern Anal. Mach.Intell.

Relations between two sets of variates

Biometrika

Approximations of the critical region of the fbietkan statistic

Commun. Stat.-Theory Methods

Principal Component Analysis

Measuring multivariate association and beyond

Statist. Surv.