Determining the number of components in principal components analysis: A comparison of statistical, crossvalidation and approximated methods

https://doi.org/10.1016/j.chemolab.2015.10.006Get rights and content

Highlights

  • We compare different methods for dimensionality assessment in PCA.

  • Cross-validation, approximated and statistical random matrix theory are considered.

  • Methods are compared on simulated and real data.

  • Differential behavior is observed and commented among methods.

  • Guidelines for practitioners are offered.

Abstract

Principal component analysis is one of the most commonly used multivariate tools to describe and summarize data. Determining the optimal number of components in a principal component model is a fundamental problem in many fields of application. In this paper, we compare the performance of several methods developed for this task in different areas of research. We consider statistical methods based on results from random matrix theory (Tracy–Widom and Kritchman–Nadler testing procedures), cross-validation methods (namely the well-characterized element wise k-fold algorithm, ekf, and its corrected version cekf) and methods based on numerical approximation (SACV and GCV). The performance of these methods is assessed on both simulated and real life data sets. In both cases, differential behavior of the considered methods is observed, for which we propose theoretical explanations.

Introduction

Multivariate statistical models are widely used in many fields of research to handle data sets with a very large number of variables and (possibly) of observations. Principal component analysis (PCA) [1], [2] is one of the most commonly used multivariate tools to describe and summarize large omics data sets by finding the subspace in the space of the original variables where the data most vary [3]. In PCA the possibly correlated original variables are converted into sets of linearly uncorrelated variables called principal components which are in number less or equal than the number of original variables.

The PCA model follows the expressionX=TKPKT+EKwhere X is a n × p data matrix, TK is the n × K scores matrix containing the projection of the observations onto the K-dimensional space defined by the first K principal components, PK is the p × K matrix of the loadings, containing the linear combination of the original variables represented by each principal component, and EK is the n × p matrix of the residuals.

Determining the optimal number of components K that best fit the data is a fundamental task in multivariate analysis and, as noted by several authors [4], [5], it is an ill-posed problem when formulated without specifying for which purpose PCA is used. Generally speaking, one can refer to the optimal number of components implying that the model describes systematic variation in the data but not the noise [4], but this can be different depending whether the application is, for instance, in process monitoring or data compression. Camacho and Ferrer [5] recently proposed a taxonomy for PCA applications, depending on what the interest is focused on: 1) the (accurate approximation of the) observed variables such as in data compression or dimensionality reduction, 2) understanding and interpretation of latent variables and 3) the distribution in latent variables and residuals. In this paper we place ourselves in the situation described in 1: the interest relies in using PCA to extract information which is embedded in a high-dimensional space and describing it with a limited number of components, a problem typical in modern functional genomics, econometrics, signal theory and image processing.

A great deal of attention has been dedicated to this problem and a plethora of methods has been proposed, mostly by the chemometrics, psychometrics and statistics communities (for a review see for instance [6] and references therein).

Jolliffe [7] and Jackson [8] outlined a taxonomy of the criteria proposed to find the optimum number of components in PCA, distinguishing three broad categories:

  • 1.

    Ad-hoc rules, like Cattel's scree test [9], the indicator function or the embedded error [10].

  • 2.

    Statistical tests, like Bartlett's test for the first component [11], the sphericity test [12] or the Malinowski's F-test [13].

  • 3.

    Computational criteria, like cross-validation (CV), bootstrapping and permutation like Horn's parallel analysis [14] or the SVD based methods proposed by Dray [15].

The array of available methods for dimensionality assessment is constantly increasing. For instance, the CHull approach [16] for model selection can be added to the first category: it detects a model with an optimal balance between a (large) model fit and (low) number of parameter and can be applied to indicate the number of principal components [17]. Similarly, Josse and Husson [18] proposed new methods based on the numerical approximation of the CV procedure that add to the third category.

The idea of comparing methods for determining the number of principal components is certainly not new, and several studies presented comparative investigations [8], [19], [20], [21], [22], [23]. However, these comparative studies did not consider recently developed statistical tools based on results from random matrix theory (RMT). Moreover, the performance of the latter has never been compared with state-of-the-art implementations of the cross-validation procedure or numerical approximation. For this reason it seems timely to review and perform an in-depth assessment of cross-validation, approximated and statistical methods through a large comparative study.

For this task we made use of 5 simulation schemes, corresponding to more than 12,000 different simulated data sets accounting for different data structure, data distribution and homo- and heteroscedastic noise. In addition, we made use of 8 real life chemometrics data sets (mostly NIR spectroscopy data, among which some well-known benchmark data sets) to investigate the behavior of the methods on experimental data. As the problem of determining the number of components is not limited to chemometrics, we additionally considered 12 data sets stemming from disciplines where chemometrics tools are routinely applied to model and extract information, such as metabolomics (5 data sets), proteomics (1 data set), and other (functional genomics, computational linguistics, etc..., 4 data sets).

The paper is organized as follows. Section 2 offers a brief overview of past works related to the problem of dimensionality assessment in PCA; Section 3 is dedicated to the illustration of methods based on random matrix theory, cross-validation and approximation of the cross-validation for determining the number of components in PCA. To make the paper self-contained, we provide a theoretical background on which to base the discussion and interpretation of the results. Section 4 gives the description of the data sets used for comparison of the different methods and Section 5 is dedicated to the software used. Section 6 offers a discussion of the results. We end with some final considerations in Section 7 where we also suggest some guidelines for the practitioners.

Section snippets

Related work

Until recently, the statistical tools to attack the problem of determining the number of components in PCA consisted mainly in methods developed in the field of psychometrics (like Bartlett's test for the first component [11], the sphericity test [12], the Kaiser–Guttman's 1 rule [24]) or chemometrics (like Malinowski's F-test [13] and the Faber–Kowalski test [25]). All these methods suffer from the drawback of being of limited applicability, either because restricted to the first component,

Methods for determining the number of principal components

In this Section, the methods under comparison are introduced. Methods were proposed in different research areas. As a result, their definition to the problem of selecting the number of components, including the considered model of noise, varies. The input of the methods reflects these differences. Thus, some methods take as input the matrix of data while others operate over the eigenvalues. Here, we propose a taxonomy based on the class of input, which in turn is determined by the model of

Data sets for comparison of the methods

To compare and investigate the performance of the three approaches for determining the optimal number of components illustrated above, we will make use of simulated data matrices for which the dimensionality is a-priori known and real data from different fields.

Software

The Tracy–Widom test has been implemented using an in house scripted Matlab routine.

The Kritchman–Nadler was performed using the Matlab routine provided by the authors and available at:www.wisdom.weizmann.ac.il/~nadler/Rank_Estimation/rank_estimation.html. All test were performed at nominal 0.01 significance threshold.

The Multivariate Exploratory Data Analysis Toolbox for Matlab (MEDA toolbox) was used to perform cross-validation [51]. The toolbox is available at://github.com/josecamachop/MEDA-Toolbox/archive/v1.0.zip

Discussion of the results of Simulation A

The results of Simulation A are summarized in Table 1. The data sets used in this simulation are generated using the RMT spiked model (Eqs. (10), (16)), under which both the Johnstone's (TW) and the Kritchmann–Nader (KN) approaches have been developed. For this data we found the two RMT methods (KN and TW), the approximated GCV criterion and the cekf cross-validation to perform well, while the approximated SACV criterion and the ckf cross-validation performed poorly.

We start by commenting the

Conclusion

In this paper, we presented a comparison between different methods, originating from different fields of research, for selecting the number of principal components in PCA. Recently introduced statistical testing procedures based on random matrix theory have been compared with the element-wise k-fold (ekf) cross-validation and its corrected version cekf. The ekf is the most used and theoretically revised algorithm to select the number of components in PCA. Cross-validation approximation

Conflict of interest

The authors declare no conflitct of interest.

Acknowledgments

Research in this paper was partially supported by European Commission—funded by FP7 project INFECT (contract number: 305340), the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grant TEC2011-22579.

References (81)

  • C.A. Tracy et al.

    Level-spacing distributions and the airy kernel

    Phys. Lett. B

    (1993)
  • J. Baik et al.

    Eigenvalues of large sample covariance matrices of spiked population models

    J. Multivar. Anal.

    (2006)
  • J. Camacho et al.

    Multivariate exploratory data analysis (meda) toolbox for matlab

    Chemom. Intell. Lab. Syst.

    (2015)
  • T.K. Karakach et al.

    Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies

    Anal. Chim. Acta

    (2009)
  • L. Tenori et al.

    Metabolomic fingerprint of heart failure in humans: a nuclear magnetic resonance spectroscopy analysis

    Int. J. Cardiol.

    (2013)
  • S. Aeberhard et al.

    Comparative analysis of statistical pattern recognition methods in high dimensional settings

    Pattern Recogn.

    (1994)
  • J. Christensen et al.

    Fluorescence spectroscopy and parafac in the analysis of yogurt

    Chemom. Intell. Lab. Syst.

    (2005)
  • C.M. Andersen et al.

    Quantification and handling of sampling errors in instrumental measurements: a case study

    Chemom. Intell. Lab. Syst.

    (2004)
  • K. Pearson

    On lines and planes of closest fit to systems of points in space

    Lond. Edinb. Dublin Phil. Mag. J. Sci.

    (1901)
  • H. Hotelling

    Analysis of a complex of statistical variables into principal components

    J. Educ. Psychol.

    (1933)
  • E. Saccenti et al.

    Reflections on univariate and multivariate analysis of metabolomics data

    Metabolomics

    (2014)
  • R. Bro et al.

    Cross-validation of component models: a critical look at current methods

    Anal. Bioanal. Chem.

    (2008)
  • I. Jolliffe

    Principal Component Analysis

    (2005)
  • D.A. Jackson

    Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches

    Ecology

    (1993)
  • R.B. Cattell

    The scree test for the number of factors

    Multivar. Behav. Res.

    (1966)
  • E.R. Malinowski

    Theory of error in factor analysis

    Anal. Chem.

    (1977)
  • M.S. Bartlett

    A note on the multiplying factors for various χ2 approximations

    J. R. Stat. Soc. Ser. B Methodol.

    (1954)
  • M.S. Bartlett

    Tests of significance in factor analysis

    Br. J. Stat. Psychol.

    (1950)
  • E.R. Malinowski

    Statistical f-tests for abstract factor analysis and target testing

    J. Chemom.

    (1989)
  • J.L. Horn

    A rationale and test for the number of factors in factor analysis

    Psychometrika

    (1965)
  • E. Ceulemans et al.

    Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method

    Br. J. Math. Stat. Psychol.

    (2006)
  • T.F. Wilderjans et al.

    Chull: a generic convex-hull-based model selection method

    Behav. Res. Methods

    (2013)
  • W.R. Zwick et al.

    Factors influencing four rules for determining the number of components to retain

    Multivar. Behav. Res.

    (1982)
  • W.R. Zwick et al.

    Comparison of five rules for determining the number of components to retain

    Psychol. Bull.

    (1986)
  • L. Guttman

    Some necessary conditions for common-factor analysis

    Psychometrika

    (1954)
  • K. Faber et al.

    Modification of Malinowski's F-test for abstract factor analysis applied to the Quail Roost II data sets

    J. Chemom.

    (1997)
  • I.M. Johnstone

    On the distribution of the largest eigenvalue in principal components analysis

    Ann. Stat.

    (2001)
  • F. Arteaga et al.

    Dealing with missing data in MSPC: several methods, different interpretations, some examples

    J. Chemom.

    (2002)
  • S. Wold

    Cross-validatory estimation of the number of components in factor and principal components models

    Technometrics

    (1978)
  • J. Camacho et al.

    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects

    J. Chemom.

    (2012)
  • Cited by (0)

    View full text