Blind separation of analytes in nuclear magnetic resonance spectroscopy: Improved model for nonnegative matrix factorization
Introduction
Metabolites, low-molecular-weight compounds, are functional endpoints of metabolism and are a reflection of genetic and environmental perturbations of the system. Measurement of metabolites in biological fluids, typically urine and serum, is actually a measurement of a living system's responses to disease, drugs or toxins. Metabolic profiling is therefore an indispensable tool in drug development [1], [2], toxicology studies [3], disease diagnosis [4], [5], food, nutrition and environmental sciences [6], [7]. Nuclear magnetic resonance (NMR) spectroscopy is emerging as a key technique in metabolomics to identify and quantify the individual compounds of which the biological fluids are composed [8], [9], [10]. The problem is notoriously difficult as a result of the presence of a large number of analytes in the studied samples. It is estimated that 2766 metabolites are to be derived from humans, and many of them are species independent [11]. Quantitative metabolomic profiling of patients with inflammatory bowel disease characterized 44 serum, 37 plasma, and 71 urine metabolites using 1H NMR spectroscopy [12]. Because many analytes are structurally similar, their NMR spectra are highly correlated, with many overlapping peaks. It is thus the complexity of the samples that limits the identification of analytes, which is seen as one of the most challenging tasks in chemical biology [13]. Compound identification is often achieved by matching experimental spectra to spectra stored in the library [14], [15], for example, the BioMagResBank metabolomic database [16] or, in the case of mass spectrometry, the NIST 11 Mass Spectral Library [17]. However, complexity (i.e., purity) severely hampers identification of individual compounds contained in the spectra of biological samples [15], [18]. Thus, instead of analytes, their mixture is often compared with the reference components in the library. Algorithmic approaches to solve this problem may be grouped into three main categories. The scoring methods assess the matches between the experimental and theoretical spectra. To this end, similarity scores are developed to reduce the false alarm rate [19], [20]. It is clear that this approach fails when the number of analytes in a mixture spectrum increases. Machine learning approaches try to learn a classifier using reference components from the library and apply it to experimental spectra [21], [22]. Accuracy of this approach highly depends on representativeness and size of the training dataset (library). Thus, when the diversity of datasets is high or the number of spectra from a specific group is small, the accuracy of analyte identification will deteriorate. Moreover, the accuracy will be affected further by the overlapping of analyte spectra. The third category of methods is known as source separation, or “deconvolution” methods.1 The source separation methods, also known as multivariate curve resolution (MCR) methods, extract the concentration and spectra of individual components from multicomponent mixture spectra [24]. In particular, blind source separation (BSS) methods [25] refer to the class of multivariate data analysis methods capable of blind (unsupervised) extraction of analytes from mixture spectra, i.e., the concentrations of analytes are not required to be known when using the BSS algorithms. It is, however, clear that, under the stated conditions, the related inverse problem is severely ill-posed. To narrow down the infinite number of solutions to essentially unique one, constraints have to be imposed on the analyte spectra. Typically, constraints include uncorrelatedness, statistical independence, sparseness and nonnegativity. This, respectively, leads to principal component analysis (PCA) [26], independent component analysis (ICA) [27], [28], sparse component analysis (SCA) [29], [30] and nonnegative matrix factorization (NMF) [31]. These methods have already been applied successfully for analyte extraction from spectroscopic mixtures [32], [33], [34], [35], [36], [37], [38], [39]. PCA, ICA and many NMF algorithms require that the unknown number of analytes be less than or equal to the number of mixture spectra available [32], [33], [36], [37], [38], [39]. This is also true for many “deconvolution” methods [40]. This makes them inapplicable for the analysis of multicomponent mixture spectra, such as those acquired from biological samples. Sparseness-based approaches to BSS are currently a highly active research area in signal processing. Unlike PCA and ICA methods, SCA methods enable the solution of an underdetermined BSS problem, i.e., extraction of more analytes than there are mixtures available in 1D and 2D NMR spectroscopy [34], [35]. Sparseness implies that at each frequency (in the case of NMR spectroscopy), only a small number of analytes are active. However, the majority of SCA algorithms require that each analyte is active at certain spectral region alone [34], [35], [41], [42]. This assumption is increasingly hard to satisfy when the complexity of the mixture grows and when, due to reasons elaborated previously, multiple analytes become overlapped. Intuitively, it is clear that, when there are tens or hundreds of analytes in the mixture, it will be virtually impossible to isolate spectral regions where each analyte is active alone. Very recent developments in the blind separation of positive and partially overlapped sources require that each analyte be dominant, instead of active alone, at a certain spectral region [43]. Nevertheless, for complex multicomponent spectra, the same conclusion applies as above. The NMF algorithms, which in addition to nonnegativity also use a sparseness constraint, are capable of solving nonnegative underdetermined BSS problem without explicitly demanding existence of spectral regions where each analyte is active alone [44], [45], [46], [47], [48]. Thereby, the NMF algorithms that do not require a priori knowledge of the sparseness related regularization parameter are of practical value [44]. However, in the majority of cases, the NMF algorithms have been applied to extract a number of components that is smaller than the number of available mixture NMR spectra [37], [38]. Herein, we demonstrate how sparseness constrained NMF ought to be applied to mixture NMR spectra to improve the quality of separation of correlated NMR component spectra. It is conjectured that the proposed method will be practically relevant for the extraction and identification of analytes in biomarker related studies. It could also increase efficiency in spectral library search procedures through reduced occurrence of false positives and negatives. Increased robustness of the linearity of the proposed method against the number of overlapping components is compared with the amplitude mixture spectra-based model and demonstrated through sensitivity analysis. The proposed method is further compared with state-of-the-art SCA algorithms. To this end, three highly correlated 1H NMR component spectra were extracted from two mixtures [34], and four highly correlated COSY NMR component spectra were extracted from three mixtures [35].
Section snippets
Linear mixture model of multicomponent NMR spectra
The linear mixture model (LMM) is commonly used in chemometrics [24], [32], [33], [34], [35], [36], [37], [38], [39] in general and in NMR spectroscopy in particular [32], [34], [35], [36], [37], [38]. It is the model upon which linear instantaneous BSS methods are based [25], [28], [29], [30], [31]. Taking into account the fact that NMR signals are intrinsically time domain harmonic signals with their amplitude decaying exponentially with some time constant, [49], the linear mixture model in
Experiment and materials
The proposed model/method was validated on a computational example related to the comparative sensitivity analysis of models (3), (5) and two experiments: blind extraction of three analyte 1H NMR spectra from two mixtures and blind extraction of four analyte COSY NMR spectra from three mixtures.3
Results and discussion
Fig. 1 shows mean values (± standard deviation) of sensitivities (8), (9) as a function of k = 1,…,10. Results are shown for two different phases of the first component in order to demonstrate that its selection does not play a role in sensitivity analysis. Under the simulation setup described in Section 3.1, it follows that the linearity condition for model (3), implied by Eq. (10), should be ∂|Xn(ωt)|/∂|Sm(ωt)| = anm = 1. Likewise, the linearity condition for model (5), implied by Eq. (11), should
Conclusions
Quantitative metabolomics has shown tremendous potential for studying the nature of biological processes. However, the development of analytical tools for the analysis of complex datasets is necessary for full development of this potential. Samples of biological origin (plasma, urine, saliva or tissues) contain a large number of compounds. Because of this complexity, most state-of-the-art MCR methods fail to provide unambiguous results in NMR spectra analysis. Nevertheless, these methods are
Conflict of interest
Authors declare no conflict of interest.
Acknowledgments
This work has been supported through grant 9.01/232 "Nonlinear component analysis with applications in chemometrics and pathology" funded by the Croatian Science Foundation.
References (54)
Metabonomics: applications to food science and nutrition research
Trends Food Sci. Technol.
(2008)- et al.
Biodegradation pathway of mesotrione: complementarities of NMR, LC–NMR and LC–MS for qualitative and quantitative metabolic profiling
Chemosphere
(2010) - et al.
NMR spectroscopic analysis of mixtures: from structure to function
Curr. Opin. Chem. Biol.
(2011) - et al.
An effective two-stage spectral library search approach based on lifting wavelet decomposition for complicated mass spectra
Chemom. Intell. Lab. Syst.
(2014) Independent component analysis, a new concept?
Signal Proc.
(1994)- et al.
Model-free analysis of mixtures by NMR using blind source separation
J. Magn. Reson.
(1998) - et al.
An information-theoretic methodology for the resolution of pure component spectra without prior information using spectroscopic measurements
Chemom. Intell. Lab. Syst.
(2004) - et al.
Extraction of multiple pure component 1H and 13C NMR spectra from two mixtures: novel solution obtained by sparse component analysis-based blind decomposition
Anal. Chim. Acta.
(2009) - et al.
Blind source separation of positive and partially correlated data
Signal Proc.
(2005) - et al.
Blind spatial unmixing of multispectral images: new methods combining sparse component analysis, clustering and nonnegativity constraints
Pattern Recog.
(2012)
Postprocessing and sparse blind source separation of positive and partially overlapped data
Signal Proc.
Using underapproximations for sparse nonnegative matrix factorization
Pattern Recog.
Sparse nonnegative matrix factorization with ℓ0-constraints
Neurocomputing
Introduction to post-processing techniques
Eur. J. Radiol.
Metabonomoics: a platform for studying drug toxicity and gene function
Nat. Rev. Drug Discov.
Update on the diagnosis and treatment of food-bone trematode infections
Curr. Opin. Infect. Dis.
Metabonomics in toxicology: a review
Toxicol. Sci.
Novel methods in metabolic profiling with a focus on molecular diagnostic applications
Expert. Rev. Mol. Diagn.
Biofluid metabpnomics using 1H NMR spectroscopy: the road to biomarker discovery in gastroenterology and hepatology
Expert Rev. Gastroenterol. Hepatol.
Analytical strategies in metabonomics
J. Proteome Res.
NMR in metabolomics and natural products research: two sides of the same coin
Acc. Chem. Res.
NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review
Anal. Chim. Acta.
Quantitative metabolomics profiling of serum, plasma and urine by 1H NMR spectroscopy discriminates between patients with inflammatory bowel disease and healthy individuals
J. Proteome Res.
Systems biology: metabonomics
Nature
Proteomics: from technology developments to biological applications
Anal. Chem.
A relational database for sequence-specific protein NMR data
J. Biomol. NMR
The NIST 11 Mass Spectral Library
Cited by (10)
Component spectra extraction and quantitative analysis for preservative mixtures by combining terahertz spectroscopy and machine learning
2022, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyCitation Excerpt :For example, the concentrations of components are not necessary to be known by using the BSS algorithms. And the BSS algorithms of non-negative matrix factorization (NMF) [24–25] and self-modeling mixture analysis (SMMA) [26] have already been applied successfully in component extraction from spectroscopic mixtures when the mixtures were characterized by nuclear magnetic resonance spectroscopy [27], Raman spectroscopy [28]. In particular, the NMF algorithm is capable of solving nonnegative underdetermined BSS problem without explicitly demanding existence of spectral range where each component is active alone [29].
Library-assisted nonlinear blind separation and annotation of pure components from a single <sup>1</sup>H nuclear magnetic resonance mixture spectra
2019, Analytica Chimica ActaCitation Excerpt :That is true for chemical shifts where only one analyte is present. Otherwise, the spectrum of the mixture becomes more nonlinear when the complexity of the mixture grows, i.e. when the number of overlapping peaks increases [28].2 Compared with the method proposed herein, existing nonlinear BSS methods assume the availability of multiple nonlinear mixtures [29–42].
THz spectral data analysis and components unmixing based on non-negative matrix factorization methods
2017, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyCitation Excerpt :The numerical minimization of Eq. (7) may confront several challenges including existence of local minima due to the non-convexity of f(S,C) in both S and C, and lack of a unique solution [22]. To solve the problems, various constraints, namely smoothness, independence and sparseness, have been added to NMF in line with the desired effects on the computed results for different applications [23,24] and this method is called constraint NMF (CNMF). The similar situation we are facing is that THz spectroscopy has high spectral resolution, and the absorption spectra of many materials may vary smoothly and have not discontinuities in effective frequency bands [25].
Hyperspectral image analysis of Raman maps of plant cell walls for blind spectra characterization by nonnegative matrix factorization algorithm
2016, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :The non-negativity constraint ensures that the profiles of pure-components have physical meaning and can be interpreted as concentration profiles and spectra. The NMF algorithm already have been used in a broad range of applications such as face recognition algorithms [35,36], separation of analytes in nuclear magnetic resonance spectroscopy [37], resolution of graph matching problems [38], and source separation in digital audio signal [39]. In bio-sciences Gobinet et al. [40] applied NMF method to the characterization and fluorescence chemical mapping of wheat grain sections.
THz Spectroscopic Decomposition and Analysis in Mixture Inspection Using Soft Modeling Methods
2021, Journal of Infrared, Millimeter, and Terahertz Waves