Blind separation of analytes in nuclear magnetic resonance spectroscopy: Improved model for nonnegative matrix factorization

https://doi.org/10.1016/j.chemolab.2014.06.004Get rights and content

Highlights

  • New approximate linear mixture model of the magnitude NMR spectra is proposed.

  • It is combined with sparseness constrained nonnegative matrix factorization (sNMF).

  • It yields increased accuracy of blind estimation of analytes from mixtures spectra.

  • The method is relevant for biomarker identification studies.

Abstract

We introduce an improved model for sparseness-constrained nonnegative matrix factorization (sNMF) of amplitude nuclear magnetic resonance (NMR) spectra of mixtures into a greater number of component spectra. In the proposed method, the selected sNMF algorithm is applied to the square of the amplitude of the NMR spectrum of the mixture instead of to the amplitude spectrum itself. Afterwards, the square roots of separated squares of the component spectra and the concentration matrix yield estimates of the true component amplitude spectrum and of the concentration matrix. The proposed model remains linear on average when the number of overlapping components is increasing, while the model based on the amplitude spectra of the mixtures deviates from the linear one when the number of overlapping components is increased. This is demonstrated through the conducted sensitivity analysis. Thus, the proposed model improves the capability of the sparse NMF algorithms to separate correlated (overlapping) component spectra from the smaller number of mixture NMR spectra. This is demonstrated in two experimental scenarios: extraction of three correlated component spectra from two 1H NMR mixture spectra and extraction of four correlated component spectra from three COSY NMR mixture spectra. The proposed method can increase efficiency in a spectral library search by reducing the occurrence of false positives and false negatives. That, in turn, can yield better accuracy in biomarker identification studies, which makes the proposed method important for natural product research and the field of metabolic studies.

Introduction

Metabolites, low-molecular-weight compounds, are functional endpoints of metabolism and are a reflection of genetic and environmental perturbations of the system. Measurement of metabolites in biological fluids, typically urine and serum, is actually a measurement of a living system's responses to disease, drugs or toxins. Metabolic profiling is therefore an indispensable tool in drug development [1], [2], toxicology studies [3], disease diagnosis [4], [5], food, nutrition and environmental sciences [6], [7]. Nuclear magnetic resonance (NMR) spectroscopy is emerging as a key technique in metabolomics to identify and quantify the individual compounds of which the biological fluids are composed [8], [9], [10]. The problem is notoriously difficult as a result of the presence of a large number of analytes in the studied samples. It is estimated that 2766 metabolites are to be derived from humans, and many of them are species independent [11]. Quantitative metabolomic profiling of patients with inflammatory bowel disease characterized 44 serum, 37 plasma, and 71 urine metabolites using 1H NMR spectroscopy [12]. Because many analytes are structurally similar, their NMR spectra are highly correlated, with many overlapping peaks. It is thus the complexity of the samples that limits the identification of analytes, which is seen as one of the most challenging tasks in chemical biology [13]. Compound identification is often achieved by matching experimental spectra to spectra stored in the library [14], [15], for example, the BioMagResBank metabolomic database [16] or, in the case of mass spectrometry, the NIST 11 Mass Spectral Library [17]. However, complexity (i.e., purity) severely hampers identification of individual compounds contained in the spectra of biological samples [15], [18]. Thus, instead of analytes, their mixture is often compared with the reference components in the library. Algorithmic approaches to solve this problem may be grouped into three main categories. The scoring methods assess the matches between the experimental and theoretical spectra. To this end, similarity scores are developed to reduce the false alarm rate [19], [20]. It is clear that this approach fails when the number of analytes in a mixture spectrum increases. Machine learning approaches try to learn a classifier using reference components from the library and apply it to experimental spectra [21], [22]. Accuracy of this approach highly depends on representativeness and size of the training dataset (library). Thus, when the diversity of datasets is high or the number of spectra from a specific group is small, the accuracy of analyte identification will deteriorate. Moreover, the accuracy will be affected further by the overlapping of analyte spectra. The third category of methods is known as source separation, or “deconvolution” methods.1 The source separation methods, also known as multivariate curve resolution (MCR) methods, extract the concentration and spectra of individual components from multicomponent mixture spectra [24]. In particular, blind source separation (BSS) methods [25] refer to the class of multivariate data analysis methods capable of blind (unsupervised) extraction of analytes from mixture spectra, i.e., the concentrations of analytes are not required to be known when using the BSS algorithms. It is, however, clear that, under the stated conditions, the related inverse problem is severely ill-posed. To narrow down the infinite number of solutions to essentially unique one, constraints have to be imposed on the analyte spectra. Typically, constraints include uncorrelatedness, statistical independence, sparseness and nonnegativity. This, respectively, leads to principal component analysis (PCA) [26], independent component analysis (ICA) [27], [28], sparse component analysis (SCA) [29], [30] and nonnegative matrix factorization (NMF) [31]. These methods have already been applied successfully for analyte extraction from spectroscopic mixtures [32], [33], [34], [35], [36], [37], [38], [39]. PCA, ICA and many NMF algorithms require that the unknown number of analytes be less than or equal to the number of mixture spectra available [32], [33], [36], [37], [38], [39]. This is also true for many “deconvolution” methods [40]. This makes them inapplicable for the analysis of multicomponent mixture spectra, such as those acquired from biological samples. Sparseness-based approaches to BSS are currently a highly active research area in signal processing. Unlike PCA and ICA methods, SCA methods enable the solution of an underdetermined BSS problem, i.e., extraction of more analytes than there are mixtures available in 1D and 2D NMR spectroscopy [34], [35]. Sparseness implies that at each frequency (in the case of NMR spectroscopy), only a small number of analytes are active. However, the majority of SCA algorithms require that each analyte is active at certain spectral region alone [34], [35], [41], [42]. This assumption is increasingly hard to satisfy when the complexity of the mixture grows and when, due to reasons elaborated previously, multiple analytes become overlapped. Intuitively, it is clear that, when there are tens or hundreds of analytes in the mixture, it will be virtually impossible to isolate spectral regions where each analyte is active alone. Very recent developments in the blind separation of positive and partially overlapped sources require that each analyte be dominant, instead of active alone, at a certain spectral region [43]. Nevertheless, for complex multicomponent spectra, the same conclusion applies as above. The NMF algorithms, which in addition to nonnegativity also use a sparseness constraint, are capable of solving nonnegative underdetermined BSS problem without explicitly demanding existence of spectral regions where each analyte is active alone [44], [45], [46], [47], [48]. Thereby, the NMF algorithms that do not require a priori knowledge of the sparseness related regularization parameter are of practical value [44]. However, in the majority of cases, the NMF algorithms have been applied to extract a number of components that is smaller than the number of available mixture NMR spectra [37], [38]. Herein, we demonstrate how sparseness constrained NMF ought to be applied to mixture NMR spectra to improve the quality of separation of correlated NMR component spectra. It is conjectured that the proposed method will be practically relevant for the extraction and identification of analytes in biomarker related studies. It could also increase efficiency in spectral library search procedures through reduced occurrence of false positives and negatives. Increased robustness of the linearity of the proposed method against the number of overlapping components is compared with the amplitude mixture spectra-based model and demonstrated through sensitivity analysis. The proposed method is further compared with state-of-the-art SCA algorithms. To this end, three highly correlated 1H NMR component spectra were extracted from two mixtures [34], and four highly correlated COSY NMR component spectra were extracted from three mixtures [35].

Section snippets

Linear mixture model of multicomponent NMR spectra

The linear mixture model (LMM) is commonly used in chemometrics [24], [32], [33], [34], [35], [36], [37], [38], [39] in general and in NMR spectroscopy in particular [32], [34], [35], [36], [37], [38]. It is the model upon which linear instantaneous BSS methods are based [25], [28], [29], [30], [31]. Taking into account the fact that NMR signals are intrinsically time domain harmonic signals with their amplitude decaying exponentially with some time constant, [49], the linear mixture model in

Experiment and materials

The proposed model/method was validated on a computational example related to the comparative sensitivity analysis of models (3), (5) and two experiments: blind extraction of three analyte 1H NMR spectra from two mixtures and blind extraction of four analyte COSY NMR spectra from three mixtures.3

Results and discussion

Fig. 1 shows mean values (± standard deviation) of sensitivities (8), (9) as a function of k = 1,…,10. Results are shown for two different phases of the first component in order to demonstrate that its selection does not play a role in sensitivity analysis. Under the simulation setup described in Section 3.1, it follows that the linearity condition for model (3), implied by Eq. (10), should be ∂|Xn(ωt)|/∂|Sm(ωt)| = anm = 1. Likewise, the linearity condition for model (5), implied by Eq. (11), should

Conclusions

Quantitative metabolomics has shown tremendous potential for studying the nature of biological processes. However, the development of analytical tools for the analysis of complex datasets is necessary for full development of this potential. Samples of biological origin (plasma, urine, saliva or tissues) contain a large number of compounds. Because of this complexity, most state-of-the-art MCR methods fail to provide unambiguous results in NMR spectra analysis. Nevertheless, these methods are

Conflict of interest

Authors declare no conflict of interest.

Acknowledgments

This work has been supported through grant 9.01/232 "Nonlinear component analysis with applications in chemometrics and pathology" funded by the Croatian Science Foundation.

References (54)

  • Y. Sun et al.

    Postprocessing and sparse blind source separation of positive and partially overlapped data

    Signal Proc.

    (2011)
  • N. Gillis et al.

    Using underapproximations for sparse nonnegative matrix factorization

    Pattern Recog.

    (2010)
  • R. Peharz et al.

    Sparse nonnegative matrix factorization with ℓ0-constraints

    Neurocomputing

    (2012)
  • F. Jiru

    Introduction to post-processing techniques

    Eur. J. Radiol.

    (2008)
  • J.K. Nicholson et al.

    Metabonomoics: a platform for studying drug toxicity and gene function

    Nat. Rev. Drug Discov.

    (2002)
  • J. Keiser et al.

    Update on the diagnosis and treatment of food-bone trematode infections

    Curr. Opin. Infect. Dis.

    (2010)
  • D.G. Robertson

    Metabonomics in toxicology: a review

    Toxicol. Sci.

    (2005)
  • T. Hyotylainen

    Novel methods in metabolic profiling with a focus on molecular diagnostic applications

    Expert. Rev. Mol. Diagn.

    (2012)
  • N.R. Patel et al.

    Biofluid metabpnomics using 1H NMR spectroscopy: the road to biomarker discovery in gastroenterology and hepatology

    Expert Rev. Gastroenterol. Hepatol.

    (2012)
  • E.M. Lenz et al.

    Analytical strategies in metabonomics

    J. Proteome Res.

    (2007)
  • S.L. Robinette et al.

    NMR in metabolomics and natural products research: two sides of the same coin

    Acc. Chem. Res.

    (2012)
  • A. Smolinksa et al.

    NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review

    Anal. Chim. Acta.

    (2012)
  • R. Schicho et al.

    Quantitative metabolomics profiling of serum, plasma and urine by 1H NMR spectroscopy discriminates between patients with inflammatory bowel disease and healthy individuals

    J. Proteome Res.

    (2012)
  • J.K. Nicholson et al.

    Systems biology: metabonomics

    Nature

    (2008)
  • M. Abu-Farha et al.

    Proteomics: from technology developments to biological applications

    Anal. Chem.

    (2009)
  • B.R. Seavey et al.

    A relational database for sequence-specific protein NMR data

    J. Biomol. NMR

    (1991)
  • The NIST 11 Mass Spectral Library

  • Cited by (10)

    • Component spectra extraction and quantitative analysis for preservative mixtures by combining terahertz spectroscopy and machine learning

      2022, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
      Citation Excerpt :

      For example, the concentrations of components are not necessary to be known by using the BSS algorithms. And the BSS algorithms of non-negative matrix factorization (NMF) [24–25] and self-modeling mixture analysis (SMMA) [26] have already been applied successfully in component extraction from spectroscopic mixtures when the mixtures were characterized by nuclear magnetic resonance spectroscopy [27], Raman spectroscopy [28]. In particular, the NMF algorithm is capable of solving nonnegative underdetermined BSS problem without explicitly demanding existence of spectral range where each component is active alone [29].

    • Library-assisted nonlinear blind separation and annotation of pure components from a single <sup>1</sup>H nuclear magnetic resonance mixture spectra

      2019, Analytica Chimica Acta
      Citation Excerpt :

      That is true for chemical shifts where only one analyte is present. Otherwise, the spectrum of the mixture becomes more nonlinear when the complexity of the mixture grows, i.e. when the number of overlapping peaks increases [28].2 Compared with the method proposed herein, existing nonlinear BSS methods assume the availability of multiple nonlinear mixtures [29–42].

    • THz spectral data analysis and components unmixing based on non-negative matrix factorization methods

      2017, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
      Citation Excerpt :

      The numerical minimization of Eq. (7) may confront several challenges including existence of local minima due to the non-convexity of f(S,C) in both S and C, and lack of a unique solution [22]. To solve the problems, various constraints, namely smoothness, independence and sparseness, have been added to NMF in line with the desired effects on the computed results for different applications [23,24] and this method is called constraint NMF (CNMF). The similar situation we are facing is that THz spectroscopy has high spectral resolution, and the absorption spectra of many materials may vary smoothly and have not discontinuities in effective frequency bands [25].

    • Hyperspectral image analysis of Raman maps of plant cell walls for blind spectra characterization by nonnegative matrix factorization algorithm

      2016, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      The non-negativity constraint ensures that the profiles of pure-components have physical meaning and can be interpreted as concentration profiles and spectra. The NMF algorithm already have been used in a broad range of applications such as face recognition algorithms [35,36], separation of analytes in nuclear magnetic resonance spectroscopy [37], resolution of graph matching problems [38], and source separation in digital audio signal [39]. In bio-sciences Gobinet et al. [40] applied NMF method to the characterization and fluorescence chemical mapping of wheat grain sections.

    View all citing articles on Scopus
    View full text