Quantification and statistical significance analysis of group separation in NMR-based metabonomics studies

https://doi.org/10.1016/j.chemolab.2011.08.009Get rights and content

Abstract

Currently, no standard metrics are used to quantify cluster separation in PCA or PLS-DA scores plots for metabonomics studies or to determine if cluster separation is statistically significant. Lack of such measures makes it virtually impossible to compare independent or inter-laboratory studies and can lead to confusion in the metabonomics literature when authors putatively identify metabolites distinguishing classes of samples based on visual and qualitative inspection of scores plots that exhibit marginal separation. While previous papers have addressed quantification of cluster separation in PCA scores plots, none have advocated routine use of a quantitative measure of separation that is supported by a standard and rigorous assessment of whether or not the cluster separation is statistically significant. Here quantification and statistical significance of separation of group centroids in PCA and PLS-DA scores plots are considered. The Mahalanobis distance is used to quantify the distance between group centroids, and the two-sample Hotelling's T2 test is computed for the data, related to an F-statistic, and then an F-test is applied to determine if the cluster separation is statistically significant. We demonstrate the value of this approach using four datasets containing various degrees of separation, ranging from groups that had no apparent visual cluster separation to groups that had no visual cluster overlap. Widespread adoption of such concrete metrics to quantify and evaluate the statistical significance of PCA and PLS-DA cluster separation would help standardize reporting of metabonomics data.

Highlights

► Use of the Mahalonobis distance to quantify cluster separation in PCA scores plots. ► Use of an F-test to assess if cluster separations are statistically significant. ► PCA and PLS-DA results compared with no scaling and Pareto scaling. ► Use of these techniques will help standardize reporting of metabonomics data. ► Redistribution of significant PCA loadings is demonstrated for Pareto scaling.

Introduction

In general, metabonomics studies [1] rely on multivariate data analysis techniques to evaluate massive amounts of data. The two most widely used techniques in the literature are principal component analysis (PCA) [2] and partial least squares — discriminant analysis (PLS-DA) [3]. PCA is an unsupervised method that assesses variance across all observations in the raw data whereas in a supervised method like PLS-DA, a class discriminator, e.g. healthy sample versus diseased sample, is specified and used to maximize group separation according to class belonging. While PLS-DA tends to improve the separation between groups compared to PCA, there is some risk that increased apparent separation can be an artifact of the PLS-DA algorithm and not reflect variances that truly distinguish between the groups [4].

While statistical validation of metabolite changes between groups identified by either PCA or PLS-DA is essential [5], examples exist in the metabonomics literature where metabolites are identified as changing between groups based on PCA or PLS-DA group separation in scores plots even though the visual separation between groups is questionable. Unfortunately no standard metric has been introduced or widely adopted to quantify cluster separation and to assess the statistical significance of cluster separation in PCA and PLS-DA scores plots. If such standard protocols were widely adopted, it would standardize the reporting of data in the metabonomics literature and make the data easier to interpret.

Quantitative separation of group clusters in PCA and PLS-DA scores plots has been discussed infrequently in the literature; however, a few papers have attempted to address the issue. Fuzzy K-means clustering has been explored as a means of optimizing cluster separations to better classify class belonging of samples based on major phenotypical differences and minor phenotype subgroups observed in two different NMR datasets [6]. Another paper developed a novel method using what was defined as a PCA to Tree analysis that utilized bootstrapping techniques to improve the quantitative analysis of PCA clustering [7]. The PCA to Tree approach uses a phylogenetic algorithm to assess distance matrices resulting from various metabolic states which are organized into a phylogenetic-like tree format and a bootstrap algorithm is used to identify statistically relevant branch separations [7]. Anderson et al. used the J2 criterion to determine the quality of the clusters which closely relates to the Davis–Bouldin index [8]. Dixon et al. provided the most extensive paper written on the subject of determining the separation in PCA scores plots [9]. In this paper they used simulated data to evaluate separation indices. The four indices investigated were the Davis–Bouldin index (DBI), silhouette width, modified silhouette width index and overlap coefficient.

While all the reports mentioned above explored valid ways to quantify group cluster separation in PCA and PLS-DA scores plots, none advocated reporting of a quantitative statistic to characterize cluster separation or reporting of group whether or not the cluster separations were statistically significant. Here we demonstrate the value of quantifying cluster separation in PCA and PLS-DA scores plots based on computation of the Mahalanobis distance between the centroids of the two cluster groups, and then the statistical significance of the cluster separation is assessed by calculating the Hotelling's T2 two-sample statistic, relating this statistic to an F-value, and then applying an F-test. The approach is demonstrated using four experimental data sets that range from exhibiting no visual cluster separation to having complete visual cluster separation.

Section snippets

Datasets

Four datasets from previous metabonomics investigations performed in our lab were chosen for this study. These four datasets contained different amounts of apparent separation in the PCA scores plot based on qualitative visual inspection. The four datasets were initially qualitatively classified as having total separation, partial separation, or no separation. The term total separation means that, based on visual inspection of the PCA scores plot, none of the points from one group overlapped

Results and discussion

Four different experimental data sets were investigated to demonstrate the value of using quantitative metrics to evaluate the magnitude and statistical significance of cluster separations in PCA and PLS-DA scores plots. The results were analyzed in the context of the statistical significance of the PCA loadings that drive cluster separation in the PCA scores plot according to an approach developed previously in our lab [5]. PCA and PLS-DA scores plot separations were assessed by comparing “no

Conclusions

Here, we have demonstrated the utility of applying simple metrics for quantification of cluster separations, and for assessment of the statistical significance of cluster separations, in PCA and PLS-DA scores plots. The methods invoke computation of a Mahalanobis distance to characterize the distance between cluster centroids in two-dimensional PCA and PLS-DA scores plots, and rely on calculation of a Hotelling's T2 statistic, an associated F-value, and application of an F-test to determine the

Acknowledgements

The data collection was conducted at the Ohio Biomedicine Center of Excellence in Structural Biology and Metabonomics at Miami University. The authors acknowledge Lindsey Romick-Rosendale for providing some of the raw NMR data used for the demonstration exercises. The work was funded by the National Institutes of Health National Cancer Institutes; Grant number: 1R15CA152985.

References (13)

There are more references available in the full text version of this article.

Cited by (69)

  • A novel regression method: Partial least distance square regression methodology

    2023, Chemometrics and Intelligent Laboratory Systems
  • Putative intestinal permeability markers do not correlate with cardiometabolic health and gut microbiota in humans, except for peptides recognized by a widely used zonulin ELISA kit

    2023, Nutrition, Metabolism and Cardiovascular Diseases
    Citation Excerpt :

    Afterward, Spearman's correlations and principal component analysis (PCA) were obtained with the R packages Corrplot (V 0.92), Hmisc (V 4.6-0), FactomineR (V 2.4), and Factoextra (V 1.0.7). Finally, in the PCA, the difference between cardiometabolically healthy and cardiometabolically abnormal individuals was assessed using the Mahalanobis distance between the group's centroids with the package HDMD (V 1.1) applying the two-sample Hotelling's T2 (V 1.0–8) test, according to Goodpaster and Kennedy [29]. Then, an F-test was used to determine whether the centroids' separation was statistically different.

  • Genome-Wide Transcriptomic Analysis of Intestinal Mucosa in Celiac Disease Patients on a Gluten-Free Diet and Postgluten Challenge

    2021, Cellular and Molecular Gastroenterology and Hepatology
    Citation Excerpt :

    As an additional testing step, PCA was performed on independent and publicly available dataset GSE134900.31 To quantify the separation between healthy control and celiac disease groups in this independent dataset, the F statistics metric was applied for 4-dimensional data.69 All statistical testing was performed using R version 3.5.3 (R Foundation for Statistical Computing, Vienna, Austria).

View all citing articles on Scopus
View full text