Development of a fast and accurate method of 13C NMR chemical shift prediction

doi:10.1016/j.chemolab.2009.01.010

Chemometrics and Intelligent Laboratory Systems

Volume 97, Issue 1, 15 May 2009, Pages 91-97

https://doi.org/10.1016/j.chemolab.2009.01.010 Get rights and content

Abstract

In this article we describe a fast and accurate method of ¹³C NMR chemical shift prediction. The high speed of chemical shift calculation described is achieved using a simple structure description scheme based on individual atoms rather than functional groups. The systematic choice of an appropriate encoding scheme and the usage of partial least squares regression on a large training set has resulted in a robust and fast algorithm. The approach provides accuracy comparable with other well known approaches but demonstrates accelerated calculation speeds of up to a thousand times faster.

Introduction

In various fields of chemistry such as the investigation of natural products or the design of new compounds, scientists often need to either determine de novo the structure of an unknown or new compound or to verify a hypothetical chemical structure. This process, known as Structure Elucidation, is based on the analysis of available spectral data. Nuclear Magnetic Resonance (NMR) spectroscopy is certainly one of the main analytical methods applied to these challenges and is a powerful technique for acquiring highly informative spectra associated with a structure.

Nuclear magnetic resonance (NMR) is a physical phenomenon based upon the quantum mechanical magnetic properties of an atom's nucleus. Magnetic nuclei, like ¹H and ³¹C, absorb radiofrequency energy when placed in a magnetic field of a strength specific to the identity of the nuclei. When this absorption occurs, the nucleus is described as being in resonance. Different atoms within a molecule resonate at different frequencies at a given field strength. The observation of the resonance frequencies of a molecule allows a user to discover structural information about the molecule. This phenomenon is known as the chemical shift and is the most important characteristic of a nucleus in terms of NMR. The shift of an individual atom depends on its atomic properties, such as the type of nucleus, its hybridization state and the overall electronic environment surrounding the nucleus (bonds, conjugation networks etc.).

In an NMR spectrum each atom of interest is associated with a peak characterized by a chemical shift, the intensity of the peak and one or more coupling constants associated with interacting nuclei. Since nuclei themselves are little magnets they can influence each other and change the energy of nearby nuclei as they resonate. This phenomenon is known as spin–spin coupling. The most important type of interaction is scalar coupling and occurs between two nuclei mediated through chemical bonds, and can be seen up to three bonds away. Longer range nuclei can also be experienced.

For a complex molecule a spectrum can be a complex distribution of chemical shifts, homonuclear and heteronuclear couplings and spectra can be envisaged to represent a fingerprint of a structure. The information contained within a spectrum commonly provides sufficient information to elucidate a chemical structure. For a complex structure a series of different spectra and other forms of analytical data are generally required.

Computer-based approaches to the analysis of complex data sets have revolutionized chemistry, physics and biology and applications of computers to structure elucidation problems are a natural extension. A comprehensive review of contemporary computer-based structure elucidation techniques can be found in [1]. An obvious first step when determining the structure of a compound by its NMR spectra is to search available NMR databases. There are numerous databases available, both free and commercial, allowing users to search by chemical shift information and to identify consistent chemical structures. If the structure is not found among known structures then the procedure for computer based structure elucidation consists of two parts: generating all possible isomers and choosing the structure best matching the experimental data. All possible isomers of a structure can be generated based on its molecular formula but the number structures obtained in this case is usually ridiculously large. A better approach to the analysis of the data is to hypothesize a set of possible isomers using information from mass spectrometry data, appropriate 2D NMR correlations, user input regarding reaction materials or prior knowledge of the system under study, etc. In both cases, the user will then interpret the available information or acquire additional supporting data to whittle down the list of isomers to a final structure consistent with all of the data.

If spectral data are incomplete by nature or due to the poor quality of the experimental data, as often occurs in everyday situations, more than one structure can correspond to the spectral data. The challenging problem of selecting the best structure from several candidates now arises. Generally the subset of structures is generally quite close in nature and match the spectral data fairly consistently. To resolve this challenge the comparison of calculated and experimental carbon ¹³C NMR spectra associated with the generated structures is the most powerful method available [1]. To determine numerically how likely a structure is, for each atom the values of the calculated and experimental chemical shifts are compared with each other. After this procedure of ranking the candidate isomers by the obtained differences in predicted and experimental shifts the structure with the smallest difference is considered as the correct structure.

Three different classes of algorithms have been developed to predict chemical shifts. Database-based algorithms use a search of similar structures in a large database of assigned chemical structures. When similar structures are found it is possible to calculate the chemical shifts of atoms of interest. The quality of prediction strongly depends on both the database size and the diversity of structures contained in the database. The Hierarchical Organization of Spherical Environments (HOSE-code [2]) is the most popular method to predict chemical shifts. It based on atom-centered encoding of a structure. For each atom in a queried structure the algorithm describes the environment of an atom in order to find atoms with a similar environment in a database. The shift value is determined from the information recovered and is an aggregated chemical shift value, commonly an average or weighted average. An obvious disadvantage of such algorithms is their poor performance for diverse structures with no representative structural fragments within the database. These algorithms also work quite slowly, of the order of seconds to tens of seconds for complex structures. Despite these limitations, database-based algorithms have been used as the basis of CSEARCH [3], [4], in commercial prediction software (for example ACD/Labs [5] and SpecInfo [6]) and in the open-source application NMRShiftDB [7].

Rule-based or incremental algorithms also require databases of structures with associated chemical shifts in order to derive a set of empirical rules. The approach was introduced in 1964 by Grant and Paul [8] and was later extended and applied in a variety of software packages [9], [10], [11], [12], [13], [14]. A molecule can be described as a combination of either atoms or larger fragments. It appears that fragment-based methods are generally more attractive from a chemist's point of view but they have their own disadvantages. Specifically, the procedure for fragment identification is markedly slower than simply assigning atom types. Moreover, the description can be ambiguous in the case of overlapping fragments. As a result of increased computational power, it became possible to calculate rules not only via least-squares regression approaches [15], [16] but also using artificial neural networks [17], [18], [19], [20]. The relative advantages and disadvantages of these methods still remains a subject of debate [21].

Chemical shifts can also be calculated using quantum mechanical approaches. The method does not rely on empirical knowledge and is the only option for hypothetical or unique structures [22]. Unfortunately, the time required for calculations, 3–24 h for one structure, is prohibitively high for routine applications.

The speed of prediction becomes a very important criterion for general application (see review [1] and ref. [23], [24]) as the number of isomers can be huge. For example, in a real-world task [25] for an unknown structure with Molecular Formula C₂₉H₃₂N₂O₅ and with both 1D and 2D NMR spectral data available a complete set of possible structures (which turned out to be more than 12,000) were generated in 20 min. To predict the ¹³С NMR spectra for all structures required 24 h and this is far too long, especially when compared with the generation time. The appropriate algorithm to predict NMR spectra needs to be both as precise as possible to provide good structure ranking and also fast enough to be capable of processing a large number of structures within a reasonable time.

During recent years numerous papers regarding the prediction of chemical shifts for single homogeneous classes of compounds by different methods using in-house and commercial spectroscopic databases have been published. Among the investigated structures, there were polycyclic aromatic compounds [26], steroids [27], alkanes [28] and substituted benzenes [29]. An attempt to use an artificial neural network for predicting and classifying ¹³C chemical shifts based on functional group descriptors has been made [30]. Such methods of predicting NMR spectra are quite limited in use due to their lack of generality since they are applied to homogenous classes of structures only and cannot then be used to predict the spectra of dissimilar compounds as the obtained result will not be precise enough. Most scientists would likely sacrifice accuracy for generality for the satisfactory prediction of NMR spectra for all classes of chemical compounds. Most existing “general” methods to predict NMR spectra of diverse structures are based on HOSE-code [3], [4], [5], [6], [7]. A number of attempts to develop a “general” additive scheme using linear regression have been conducted: the first one was based on atomic increments and was developed in 1977 [9]. Another fragment increments scheme was created in 1990 [10], [11] and an atomic increments scheme using a neural network was developed in 2002 [20]. With the present availability of electronically available NMR shift data and the calculation power available in even the most nominal desktop computer, the development of a new generalized additive scheme is quite reasonable.

It is obvious that the accuracy of chemical shift calculations depends on the extent of the diversity of the structures taken into account. We were interested in the development of a universal algorithm using a database of approximately 2 million chemical shifts of chemically different structures available to us. In this article we discuss the design of an atom-based regression scheme with prediction performance in both speed and quality sufficient to be utilized in modern structure elucidation software.

Section snippets

Data collection

The carbon chemical shift database used in this work is comprised of approximately 2 million ¹³C chemical shift values. Spectral assignments were taken from articles published in dozens of contemporary journals. Care was taken to avoid overlap between the datasets used for training and validation. The training dataset was compiled using experimental data published from the early 1990s up until 2004. The validation dataset contained data originally published in 2005 and consisted of 170,000 ¹³C

Results and discussion

Our goal was to provide, if possible, a global optimization of the structure description parameters in order to choose the best scheme. To avoid performing unnecessary computations, some deliberately unacceptable combinations of parameters were not tested. For example, there is no reason to check a description that takes into account the substituents in the first sphere only.

One round of computations was aimed to define:

1.
The number of spheres sufficient to describe an atomic environment
2.
The best

Conclusions

An algorithm to predict up to 10,000 ¹³C chemical shifts per second with a standard deviation of 2.76 ppm has been developed. Such a high speed is achieved by using atomic increments instead of larger fragments, and a regression scheme instead of a database-based approach. The accuracy of the method is the result of careful selection of a structure encoding method and availability of a large training set.

The final structure encoding scheme takes into account atoms separated from the central

References (39)

M.E. Elyashberg et al.
Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation.
Prog. Nucl. Magn. Reson. Spectrosc.
(2008)
W. Bremser
HOSE — novel substructure code
Anal. Chim. Acta
(1978)
J.T. Clerc et al.
A minicomputer program based on additivity rules for the estimation of 13C NMR chemical shifts
Anal. Chim. Acta
(1977)
A. Fürst et al.
A computer program for the prediction of 13C NMR chemical shifts of organic compounds
Anal. Chim. Acta
(1990)
R.B. Schaller et al.
New parameters for predicting 1H NMR chemical shifts of protons attached to carbon atoms
Anal. Chim. Acta
(1995)
J. Kaur et al.
An approach to predict the C-13 NMR chemical shifts of acrylonitrile copolymers using artificial neural network
Eur. Polym. J.
(2007)
J. Meiler et al.
Using neural networks for C-13 NMR chemical shift prediction—comparison with traditional methods
J. Magn. Reson.
(2002)
J.S.L.T. Militao et al.
Structure validation in computer-supported structure elucidation: 13C NMR shift predictions for steroids
Chemometr. Intell. Lab. Syst.
(2003)
H. Wold
Path models with latent variables: the NIPALS approach
W. Robien, CSEARCH:...

W. Robien

Das CSEARCH-NMR-Datenbanksystem

Nachr. Chem. Tech. Lab.

(1998)

ACD/NMR Predictors, V 11; Advanced Chemistry Development: Toronto, Canada, 2007,...

SpecInfo

Chemical Concepts

(2001)

C. Steinbeck et al.

NMRShiftDB: constructing a free chemical information system with open-source components

J. Chem. Inf. Comput. Sci.

(2003)

D.M. Grant et al.

Carbon-13 magnetic resonance. II. Chemical shift data for the alkanes

J. Am. Chem. Soc.

(1964)

E. Pretsch et al.

C13shift — a computer-program for the prediction of C-13 NMR-spectra based on an open set of additivity rules

J. Chem. Inf. Comput. Sci.

(1992)

R.B. Schaller et al.

A computer program for the automatic estimation of 1H NMR chemical shifts

Anal. Chim. Acta

(1994)

Upstream Solutions GMBH, NMR Prediction Products (SpecTool):...

D.L. Clouser et al.

Simulation of the 13C nuclear magnetic resonance spectra of ribonucleosides using multiple linear regression analysis and neural networks

J. Chem. Inf. Comput. Sci.

(1996)

Cited by (40)

Correlation between 13 C NMR chemical shifts and complete sets of descriptors of natural coumarin derivatives
2019, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
The statistical quality of a four-descriptor model is satisfactory, with correlation coefficient R2 being 0.9341 and rms error being 10.0590 ppm, although the descriptors do not reflect three dimensional information, especially for chiral-molecular structures. Blinov et al. [10] described a fast and accurate prediction method for 13C NMR chemical shift with a standard deviation of 2.76 ppm. Structure descriptions used for models are based on individual atoms rather than functional groups, including element, topological distance to the central atom, hybridization, valence, number of attached hydrogen atoms, and so on.
Choosing the best set of descriptors for quantitative structure-property relationship (QSPR) is subjective and different descriptor set may be obtained for the same case. A complete set of descriptors means that the set is perfect since there is no any element redundant or need to be added. Here we report the first application of complete sets of descriptors calculated with PBE1PBE/6-311G(2d,2p) and B3LYP/6-31G(d) approaches to develop QSPR models for ¹³C NMR chemical shifts (δ_C parameters) of carbon atoms in coumarin derivatives. Four QSPR models for δ_C parameters were developed with support vector machine (SVM) algorithm, by applying the particle swarm optimization (PSO) technique to optimize SVM parameters C and γ. The four SVM models based on complete sets of descriptors have root mean square (rms) errors of 1.962 ppm, 2.145 ppm, 1.975 ppm and 2.363 ppm for the total data set (315 δ_C parameters), which are less than the rms errors from multiple linear regression (MLR) models. To check the method based on complete sets, a large diverse dataset was predicted. Moreover, the total data set from 35 coumarin derivatives was predicted by using ChemDraw to make a head-to-head comparison. Results of the study suggest that applying complete sets of descriptors for QSPR models is successful.
Hybrid MS/NMR methods on the prioritization of natural products: Applications in drug discovery
2018, Journal of Pharmaceutical and Biomedical Analysis
Citation Excerpt :
However, several computational tools allow predicting spectral data based on machine learning algorithms, leading to the generation of in silico databases. Concerning tandem MS libraries, this limitation can be overcome by accurate fragment prediction, enabling the access to in silico databases, containing fragmentation spectra prediction or spectral simulation functions [150,151]. Based on NORINE, DNP and MarinLit, GNPS started to include in silico tandem MS spectral prediction of nonribosomal peptides (NRPS) and ribosomally synthesized and post-translationally modified peptides (RiPP), enabling the dereplication of peptidic natural product [152].
The discovery of new chemical entities from natural sources has been historically driven by the biological activity observed in crude extracts or by the significant abundance of secondary metabolites, through a laborious workflow, frequently culminating on the re-isolation of known compounds. This high rate of rediscovery was one of the main contributing factors leading to the decreased interest in natural products from pharmaceutical companies and academic research groups until very recently. Largely driven by the technological advances in analytical instrumentation, mainly mass spectrometry (MS) and nuclear magnetic resonance (NMR), a renaissance on natural products research and associated drug discovery programs is being witnessed. The development of hybrid platforms combining the strengths of MS and NMR enables a faster and more accurate identification of known metabolites, allowing the prioritization of new natural products, not only with new chemical scaffolds but also frequently exhibiting relevant biological effects that may ultimately provide new lead drugs. Furthermore, the synergism resulting from the combination of both techniques allows almost unlimited access to the chemical space available formed by new natural sources.
Prediction of 13C NMR chemical shifts by artificial neural network. I. Partial charge model as atomic descriptor
2016, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
In other words, how can a unique description of each carbon atom in molecule be obtained? There are actual descriptors, which are based on geometry and physical–chemical properties: bond type, atom environment, electronegativity, Cartesian coordinates and van der Waals radius of each atom, charge densities, partial charge, etc. [16–18]. It is known that partial charges are determined by electron density, which has the major influence on chemical shifts.
Mulliken population analysis (MPA), Hirshfeld population analysis (HPA), Charge Model 5 (CM5) and Hu Lu Yang charge fitting method (HLY) were considered in order to reveal influence of atomic partial charges on the ¹³C NMR chemical shifts. The test set included seven classes of organic molecules. Partial charges of carbon atoms were obtained from quantum-chemical calculations at DFT/HISS level. Linear regressions were constructed as estimators of accuracy of each model. The best approach was shown by multivariate regression with MPA, HPA, and CM5 charges as predictors in a linear model with mean value of R² = 0.8917.
Conformational analysis from statistical treatment of 13C NMR chemical shifts
2015, Chemometrics and Intelligent Laboratory Systems
Citation Excerpt :
Recently, the employment of liquid crystals has also been proposed, even though they can only be utilized in water [28]. Although chemical shifts are also very sensitive to the molecular environment and they can be easily obtained from NMR spectra, they have been barely used, since empirical algorithms [29] do not take into account geometry factors when evaluating them. On the other hand, ab initio methods provide the shielding constants for each conformer but their values do not seem to be accurate enough.
Statistical treatment of experimental ¹³C NMR chemical shifts of different compounds and calculated isotropic shielding constants of their respective conformers has been carried out in two different ways. The first method was a ridge linear regression between experimental chemical shifts and calculated shielding constants where the calculated coefficients represent the mole fraction of each conformer. Consequently, the sum of all coefficients has to be restricted to 1. The second method was a linear regression between experimental and calculated internal chemical shifts using the same restriction. In general, both methods gave similar results although the second one had a larger standard deviation. The results showed that, in nearly all cases, there is a significant correlation between experimental and calculated data for, at least, one conformer, being this the major one present in the conformational equilibrium. For planar aromatic compounds the conformational equilibrium has been fully characterized when other conformers have been found statistically significant. The statistical analysis on cyclic aliphatic molecules always yielded a conformational composition comparable with the published values. The advantages and drawbacks of the methodology are discussed.
Assembly of 4H-chromenes, imidazobenzothiazines and quinazolines via copper-catalyzed domino reactions using 2-halobenzyl tosylates as substrates
2014, Tetrahedron
The use of 2-halobenzyl tosylates as substrates in copper-catalyzed domino intermolecular substitution/intramolecular arylation processes for the efficient and selective preparation of heterocycles is reported for the first time. Reaction of 2-halobenzyl tosylates with β-ketoesters delivers 4H-chromenes with yields ranging between 59 and 89%. Imidazobenzothiazines are formed with yields up to 82% upon reaction of 2-halobenzyl tosylates with 2-mercaptoimidazoles. When 2-halobenzyl tosylates are reacted with benzamidines the corresponding quinazolines are obtained.
Heterocyclization of electrophilic alkenes with tetranitromethane revisited: Regiochemistry and the mechanism of nitroisoxazole formation
2012, Tetrahedron Letters
Revised regiochemistry for the heterocyclization of electrophilic alkenes with tetranitromethane (TNM) in the presence of triethylamine, providing rapid access to nitroisoxazoles, is reported. The formation of 5-nitroisoxazoles previously incorrectly assigned as 3-nitro regioisomers, has now been established unambiguously by X-ray crystallography. Empirical computations with ACD/CNMR Predictor, based both on hierarchical ordering of spherical environments (HOSE) and an algorithm of artificial neural networks (ANN), and also Density Functional Theory computations of the ¹³C NMR chemical shifts for the 3- versus 5-nitroisoxazoles are shown to consistently match the spectra of the experimentally observed 5-regioisomers.

View all citing articles on Scopus

View full text

Development of a fast and accurate method of 13C NMR chemical shift prediction

Abstract

Introduction

Section snippets

Data collection

Results and discussion

Conclusions

Prog. Nucl. Magn. Reson. Spectrosc.

Anal. Chim. Acta

Anal. Chim. Acta

Anal. Chim. Acta

Anal. Chim. Acta

Eur. Polym. J.

J. Magn. Reson.

Chemometr. Intell. Lab. Syst.

Das CSEARCH-NMR-Datenbanksystem

Nachr. Chem. Tech. Lab.

Chemical Concepts

NMRShiftDB: constructing a free chemical information system with open-source components

J. Chem. Inf. Comput. Sci.

Carbon-13 magnetic resonance. II. Chemical shift data for the alkanes

J. Am. Chem. Soc.

C13shift — a computer-program for the prediction of C-13 NMR-spectra based on an open set of additivity rules

J. Chem. Inf. Comput. Sci.

A computer program for the automatic estimation of 1H NMR chemical shifts

Anal. Chim. Acta

Simulation of the 13C nuclear magnetic resonance spectra of ribonucleosides using multiple linear regression analysis and neural networks

J. Chem. Inf. Comput. Sci.

Development of a fast and accurate method of ¹³C NMR chemical shift prediction