Development of a fast and accurate method of 13C NMR chemical shift prediction
Introduction
In various fields of chemistry such as the investigation of natural products or the design of new compounds, scientists often need to either determine de novo the structure of an unknown or new compound or to verify a hypothetical chemical structure. This process, known as Structure Elucidation, is based on the analysis of available spectral data. Nuclear Magnetic Resonance (NMR) spectroscopy is certainly one of the main analytical methods applied to these challenges and is a powerful technique for acquiring highly informative spectra associated with a structure.
Nuclear magnetic resonance (NMR) is a physical phenomenon based upon the quantum mechanical magnetic properties of an atom's nucleus. Magnetic nuclei, like 1H and 31C, absorb radiofrequency energy when placed in a magnetic field of a strength specific to the identity of the nuclei. When this absorption occurs, the nucleus is described as being in resonance. Different atoms within a molecule resonate at different frequencies at a given field strength. The observation of the resonance frequencies of a molecule allows a user to discover structural information about the molecule. This phenomenon is known as the chemical shift and is the most important characteristic of a nucleus in terms of NMR. The shift of an individual atom depends on its atomic properties, such as the type of nucleus, its hybridization state and the overall electronic environment surrounding the nucleus (bonds, conjugation networks etc.).
In an NMR spectrum each atom of interest is associated with a peak characterized by a chemical shift, the intensity of the peak and one or more coupling constants associated with interacting nuclei. Since nuclei themselves are little magnets they can influence each other and change the energy of nearby nuclei as they resonate. This phenomenon is known as spin–spin coupling. The most important type of interaction is scalar coupling and occurs between two nuclei mediated through chemical bonds, and can be seen up to three bonds away. Longer range nuclei can also be experienced.
For a complex molecule a spectrum can be a complex distribution of chemical shifts, homonuclear and heteronuclear couplings and spectra can be envisaged to represent a fingerprint of a structure. The information contained within a spectrum commonly provides sufficient information to elucidate a chemical structure. For a complex structure a series of different spectra and other forms of analytical data are generally required.
Computer-based approaches to the analysis of complex data sets have revolutionized chemistry, physics and biology and applications of computers to structure elucidation problems are a natural extension. A comprehensive review of contemporary computer-based structure elucidation techniques can be found in [1]. An obvious first step when determining the structure of a compound by its NMR spectra is to search available NMR databases. There are numerous databases available, both free and commercial, allowing users to search by chemical shift information and to identify consistent chemical structures. If the structure is not found among known structures then the procedure for computer based structure elucidation consists of two parts: generating all possible isomers and choosing the structure best matching the experimental data. All possible isomers of a structure can be generated based on its molecular formula but the number structures obtained in this case is usually ridiculously large. A better approach to the analysis of the data is to hypothesize a set of possible isomers using information from mass spectrometry data, appropriate 2D NMR correlations, user input regarding reaction materials or prior knowledge of the system under study, etc. In both cases, the user will then interpret the available information or acquire additional supporting data to whittle down the list of isomers to a final structure consistent with all of the data.
If spectral data are incomplete by nature or due to the poor quality of the experimental data, as often occurs in everyday situations, more than one structure can correspond to the spectral data. The challenging problem of selecting the best structure from several candidates now arises. Generally the subset of structures is generally quite close in nature and match the spectral data fairly consistently. To resolve this challenge the comparison of calculated and experimental carbon 13C NMR spectra associated with the generated structures is the most powerful method available [1]. To determine numerically how likely a structure is, for each atom the values of the calculated and experimental chemical shifts are compared with each other. After this procedure of ranking the candidate isomers by the obtained differences in predicted and experimental shifts the structure with the smallest difference is considered as the correct structure.
Three different classes of algorithms have been developed to predict chemical shifts. Database-based algorithms use a search of similar structures in a large database of assigned chemical structures. When similar structures are found it is possible to calculate the chemical shifts of atoms of interest. The quality of prediction strongly depends on both the database size and the diversity of structures contained in the database. The Hierarchical Organization of Spherical Environments (HOSE-code [2]) is the most popular method to predict chemical shifts. It based on atom-centered encoding of a structure. For each atom in a queried structure the algorithm describes the environment of an atom in order to find atoms with a similar environment in a database. The shift value is determined from the information recovered and is an aggregated chemical shift value, commonly an average or weighted average. An obvious disadvantage of such algorithms is their poor performance for diverse structures with no representative structural fragments within the database. These algorithms also work quite slowly, of the order of seconds to tens of seconds for complex structures. Despite these limitations, database-based algorithms have been used as the basis of CSEARCH [3], [4], in commercial prediction software (for example ACD/Labs [5] and SpecInfo [6]) and in the open-source application NMRShiftDB [7].
Rule-based or incremental algorithms also require databases of structures with associated chemical shifts in order to derive a set of empirical rules. The approach was introduced in 1964 by Grant and Paul [8] and was later extended and applied in a variety of software packages [9], [10], [11], [12], [13], [14]. A molecule can be described as a combination of either atoms or larger fragments. It appears that fragment-based methods are generally more attractive from a chemist's point of view but they have their own disadvantages. Specifically, the procedure for fragment identification is markedly slower than simply assigning atom types. Moreover, the description can be ambiguous in the case of overlapping fragments. As a result of increased computational power, it became possible to calculate rules not only via least-squares regression approaches [15], [16] but also using artificial neural networks [17], [18], [19], [20]. The relative advantages and disadvantages of these methods still remains a subject of debate [21].
Chemical shifts can also be calculated using quantum mechanical approaches. The method does not rely on empirical knowledge and is the only option for hypothetical or unique structures [22]. Unfortunately, the time required for calculations, 3–24 h for one structure, is prohibitively high for routine applications.
The speed of prediction becomes a very important criterion for general application (see review [1] and ref. [23], [24]) as the number of isomers can be huge. For example, in a real-world task [25] for an unknown structure with Molecular Formula C29H32N2O5 and with both 1D and 2D NMR spectral data available a complete set of possible structures (which turned out to be more than 12,000) were generated in 20 min. To predict the 13С NMR spectra for all structures required 24 h and this is far too long, especially when compared with the generation time. The appropriate algorithm to predict NMR spectra needs to be both as precise as possible to provide good structure ranking and also fast enough to be capable of processing a large number of structures within a reasonable time.
During recent years numerous papers regarding the prediction of chemical shifts for single homogeneous classes of compounds by different methods using in-house and commercial spectroscopic databases have been published. Among the investigated structures, there were polycyclic aromatic compounds [26], steroids [27], alkanes [28] and substituted benzenes [29]. An attempt to use an artificial neural network for predicting and classifying 13C chemical shifts based on functional group descriptors has been made [30]. Such methods of predicting NMR spectra are quite limited in use due to their lack of generality since they are applied to homogenous classes of structures only and cannot then be used to predict the spectra of dissimilar compounds as the obtained result will not be precise enough. Most scientists would likely sacrifice accuracy for generality for the satisfactory prediction of NMR spectra for all classes of chemical compounds. Most existing “general” methods to predict NMR spectra of diverse structures are based on HOSE-code [3], [4], [5], [6], [7]. A number of attempts to develop a “general” additive scheme using linear regression have been conducted: the first one was based on atomic increments and was developed in 1977 [9]. Another fragment increments scheme was created in 1990 [10], [11] and an atomic increments scheme using a neural network was developed in 2002 [20]. With the present availability of electronically available NMR shift data and the calculation power available in even the most nominal desktop computer, the development of a new generalized additive scheme is quite reasonable.
It is obvious that the accuracy of chemical shift calculations depends on the extent of the diversity of the structures taken into account. We were interested in the development of a universal algorithm using a database of approximately 2 million chemical shifts of chemically different structures available to us. In this article we discuss the design of an atom-based regression scheme with prediction performance in both speed and quality sufficient to be utilized in modern structure elucidation software.
Section snippets
Data collection
The carbon chemical shift database used in this work is comprised of approximately 2 million 13C chemical shift values. Spectral assignments were taken from articles published in dozens of contemporary journals. Care was taken to avoid overlap between the datasets used for training and validation. The training dataset was compiled using experimental data published from the early 1990s up until 2004. The validation dataset contained data originally published in 2005 and consisted of 170,000 13C
Results and discussion
Our goal was to provide, if possible, a global optimization of the structure description parameters in order to choose the best scheme. To avoid performing unnecessary computations, some deliberately unacceptable combinations of parameters were not tested. For example, there is no reason to check a description that takes into account the substituents in the first sphere only.
One round of computations was aimed to define:
- 1.
The number of spheres sufficient to describe an atomic environment
- 2.
The best
Conclusions
An algorithm to predict up to 10,000 13C chemical shifts per second with a standard deviation of 2.76 ppm has been developed. Such a high speed is achieved by using atomic increments instead of larger fragments, and a regression scheme instead of a database-based approach. The accuracy of the method is the result of careful selection of a structure encoding method and availability of a large training set.
The final structure encoding scheme takes into account atoms separated from the central
References (39)
- et al.
Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation.
Prog. Nucl. Magn. Reson. Spectrosc.
(2008) HOSE — novel substructure code
Anal. Chim. Acta
(1978)- et al.
A minicomputer program based on additivity rules for the estimation of 13C NMR chemical shifts
Anal. Chim. Acta
(1977) - et al.
A computer program for the prediction of 13C NMR chemical shifts of organic compounds
Anal. Chim. Acta
(1990) - et al.
New parameters for predicting 1H NMR chemical shifts of protons attached to carbon atoms
Anal. Chim. Acta
(1995) - et al.
An approach to predict the C-13 NMR chemical shifts of acrylonitrile copolymers using artificial neural network
Eur. Polym. J.
(2007) - et al.
Using neural networks for C-13 NMR chemical shift prediction—comparison with traditional methods
J. Magn. Reson.
(2002) - et al.
Structure validation in computer-supported structure elucidation: 13C NMR shift predictions for steroids
Chemometr. Intell. Lab. Syst.
(2003) Path models with latent variables: the NIPALS approach
- W. Robien, CSEARCH:...
Das CSEARCH-NMR-Datenbanksystem
Nachr. Chem. Tech. Lab.
Chemical Concepts
NMRShiftDB: constructing a free chemical information system with open-source components
J. Chem. Inf. Comput. Sci.
Carbon-13 magnetic resonance. II. Chemical shift data for the alkanes
J. Am. Chem. Soc.
C13shift — a computer-program for the prediction of C-13 NMR-spectra based on an open set of additivity rules
J. Chem. Inf. Comput. Sci.
A computer program for the automatic estimation of 1H NMR chemical shifts
Anal. Chim. Acta
Simulation of the 13C nuclear magnetic resonance spectra of ribonucleosides using multiple linear regression analysis and neural networks
J. Chem. Inf. Comput. Sci.
Cited by (40)
Correlation between <sup>13</sup> C NMR chemical shifts and complete sets of descriptors of natural coumarin derivatives
2019, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :The statistical quality of a four-descriptor model is satisfactory, with correlation coefficient R2 being 0.9341 and rms error being 10.0590 ppm, although the descriptors do not reflect three dimensional information, especially for chiral-molecular structures. Blinov et al. [10] described a fast and accurate prediction method for 13C NMR chemical shift with a standard deviation of 2.76 ppm. Structure descriptions used for models are based on individual atoms rather than functional groups, including element, topological distance to the central atom, hybridization, valence, number of attached hydrogen atoms, and so on.
Hybrid MS/NMR methods on the prioritization of natural products: Applications in drug discovery
2018, Journal of Pharmaceutical and Biomedical AnalysisCitation Excerpt :However, several computational tools allow predicting spectral data based on machine learning algorithms, leading to the generation of in silico databases. Concerning tandem MS libraries, this limitation can be overcome by accurate fragment prediction, enabling the access to in silico databases, containing fragmentation spectra prediction or spectral simulation functions [150,151]. Based on NORINE, DNP and MarinLit, GNPS started to include in silico tandem MS spectral prediction of nonribosomal peptides (NRPS) and ribosomally synthesized and post-translationally modified peptides (RiPP), enabling the dereplication of peptidic natural product [152].
Prediction of <sup>13</sup>C NMR chemical shifts by artificial neural network. I. Partial charge model as atomic descriptor
2016, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :In other words, how can a unique description of each carbon atom in molecule be obtained? There are actual descriptors, which are based on geometry and physical–chemical properties: bond type, atom environment, electronegativity, Cartesian coordinates and van der Waals radius of each atom, charge densities, partial charge, etc. [16–18]. It is known that partial charges are determined by electron density, which has the major influence on chemical shifts.
Conformational analysis from statistical treatment of <sup>13</sup>C NMR chemical shifts
2015, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :Recently, the employment of liquid crystals has also been proposed, even though they can only be utilized in water [28]. Although chemical shifts are also very sensitive to the molecular environment and they can be easily obtained from NMR spectra, they have been barely used, since empirical algorithms [29] do not take into account geometry factors when evaluating them. On the other hand, ab initio methods provide the shielding constants for each conformer but their values do not seem to be accurate enough.