Elsevier

Analytica Chimica Acta

Volume 648, Issue 1, 19 August 2009, Pages 52-59
Analytica Chimica Acta

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data: Part 2. Variable reduction

https://doi.org/10.1016/j.aca.2009.06.035Get rights and content

Abstract

This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods.

The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application.

Introduction

In multivariate data analysis a very important problem to be faced is the correlation within the set of data, which greatly influences results of several statistical and chemometric methods. Examples of these are methods for regression analysis, the search for optimal informative subsets of objects in experimental design, and similarity/diversity analysis. In such methods it is essential to estimate how much data variability is related to systematic and potentially useful information rather than random noise and chance correlation. Random noise or chance correlation are always potentially present, making models both unstable and unreliable and giving undesired and sudden bias in data exploration. Especially in multivariate regression analysis, data correlation greatly influences regression results due to well known troubles like model instability, overfitting, errors in the regression coefficients, related to the presence of a strong correlation structure in the predictor variables.

The problem of data correlation is relevant in methods of chemoinformatics, which produce thousands of variables such as the interaction energy fields obtained from CoMFA-like approaches [2], the molecular fingerprints used to screen libraries of molecules [3], and the large collection of topological indices derived from graph–theoretical matrices. Also in analytical chemistry, highly dimensional spectral data are affected by correlation and hence need specific pre-treatment before analysis. This is the case, for instance, of multivariate data generated through coupled chromatographic methods such as liquid chromatography coupled with diode array detector (LC–DAD), gas chromatography and mass spectrometry (GC–MS), and liquid chromatography coupled with nuclear magnetic resonance (LC–NMR), which are increasingly common methods for analytical analysis.

Variable reduction refers to the procedure that aims at selecting a subset of variables able to preserve as much information of the original data as possible, but eliminating redundancy, noise and linearly or near-linearly dependent variables. When the variables to be used in a model are chosen on the basis of general principles and not accounting for a specific goal (i.e. some experimental property to model), the term variable reduction should be more properly used than variable selection. By variable reduction techniques, variables are chosen by some comparison among the variables themselves regardless of the specific property which needs to be modelled. For instance, in QSAR studies, molecular descriptors can be selected on the basis of their information content: descriptors with high information content are more effective in discriminating different molecules and, thus, are expected to be more effective in modelling any property of molecules.

In this paper some methods for variable reduction will be reviewed and compared to a new method: this is a backward elimination procedure based on the recently proposed canonical measure of correlation between two sets of variables [1]. The first part of the paper introduces some of the well known methods for variable reduction purposes and the new procedure with its theoretical fundamentals. Then, performance of the proposed method is evaluated on both simulated and real data sets and its effectiveness is discussed by comparison with results of the other methods for variable reduction.

Section snippets

Methods for variable reduction

A preliminary approach to variable reduction consists in the elimination of all the constant variables that take the same value for all the objects in the data set. Also near-constant variables, which assume the same value except in one or very few cases, are usually excluded. A good measure for evaluating near-constant variables is the standardized Shannon's entropy [4]; in effect, the entropy of a variable with one different value over 10 objects is 0.141, over 20 objects is 0.066, with two

Canonical measures of distance and correlation

Let A and B be two different data sets comprised of the same samples but described by two different sets of variables. The simplest way to measure the distance between these two data sets disregards the actual variable values and simply consists in computing the number of diverse variables in the two data sets, that is, the squared Hamming distance:dH2=b+cwhere b is the number of variables in A but not in B, and c the number of variables present in B but not in A. Hamming distance usually has

Software

Calculations were performed in MATLAB 6.5 (Mathworks) with routines built by the authors. MultiDimensional Scaling was calculated by means of MATLAB statistical toolbox.

Results and discussion

Nowadays, variable reduction is more and more gaining importance in data mining because in several scientific fields it is very common to deal with data sets comprised of a huge number of variables. The aim of variable reduction is to reproduce by means of a subset of the original variables as much information of original data as possible. The CMC method here proposed is based on the Canonical Measure of Correlation (CMC index) between two sets of variables that in this specific case reduces to

Conclusions

In several scientific fields, such as chemoinformatics and analytical chemistry, large data sets are often produced, which need specific preprocessing before analysis in order to reduce noise and data correlation. Noise is always present in experimental data and if it is dominant in the data, then the results of any multivariate data analysis will be strongly affected by noise and, consequently, become difficult to interpret. Also data correlation greatly influences the analysis results making

Acknowledgements

This study has been financed by funds PRIN 2007 (National Ministry of University and Research and University of Milano-Bicocca, code 2007R57KT7). The Members of the International Academy of Mathematical Chemistry are warmly acknowledged for constructive discussion during the 4th IAMC meeting (2008).

References (18)

There are more references available in the full text version of this article.

Cited by (6)

  • A new hybrid filter/wrapper algorithm for feature selection in classification

    2019, Analytica Chimica Acta
    Citation Excerpt :

    Some popular independent criteria are distance measure [22,23], dependency measure [24] and information measure [25]. For example, Consonni et al. utilized correlation coefficients to estimate the dependence between candidate features and categories for measuring the importance of features [24]. Dependent criteria are used in the wrapper model and require a predetermined classifier for feature selection.

  • Multicriteria selection of uncorrelated variables for modeling

    2016, Chemometrics and Intelligent Laboratory Systems
    Citation Excerpt :

    The UFS method thus seeks to select a subset of variables close to orthogonality. The CMC method (Canonical Measure of Correlation) [19,20] measures correlation between sets of variables and is used to select the set that best reproduces the main characteristics of the full dataset. This method can be used in a step-by-step procedure where each variable is compared in turn with the set of variables not containing the most correlated variable.

  • A novel variable reduction method adapted from space-filling designs

    2014, Chemometrics and Intelligent Laboratory Systems
    Citation Excerpt :

    In particular, KS and DBOD selected a common subset of 4 variables (5, 11, 17, 18), as well as UFS and Pairwise correlation (5, 11, 17, 19), while KIF included variables 5, 11, 18, and 19. In previous analyses of the Aphid data, four or five variables were supposed to be necessary in order to account for as much information as that of the original set of variables [2]. In particular, variables 5, 11 and 17 resulted on average not much correlated with all the other variables and thus they were retained in the reduced sets of variables by several algorithms.

  • Reproducibility, complementary measure of predictability for robustness improvement of multivariate calibration models via variable selections

    2012, Analytica Chimica Acta
    Citation Excerpt :

    In the NIR transmittance spectra of tablet (Fig. 2a), there were several broad peaks located at around 10,000, 8830, 8200 and 7840 cm−1, which originated from several components in the corresponding drug tablet, such as active substance, coating material and cellulose. The direct comparison with the NIR spectra of the pure active substance indicated that the small band at 8830 cm−1 (second overtone of the aromatic CH stretch band) and below 7500 cm−1 strongly reflected the chemical composition of the active substance [33]. In contrast, other spectral bands for active substance at 8200 cm−1 were severely overlapped with broad absorption peak originated from the microcrystalline cellulose.

  • Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification

    2010, Analytica Chimica Acta
    Citation Excerpt :

    The ijth element of the data matrix X is denoted as xij and represents the value of the jth variable for the ith sample. The CMC index was early applied for variable reduction purposes in a previous paper [10]; a backward iterative procedure was proposed, which aimed at selecting a subset of variables that reproduce as much information of the original data as possible. In the present paper, the CMC index is applied to select the optimal subset of variables in classification tasks, that is, to obtain more predictive and parsimonious classification models.

  • Multivariate Analysis of Molecular Descriptors

    2012, Statistical Modelling of Molecular Descriptors in QSAR/QSPR
View full text