doi:10.1016/j.knosys.2005.03.001
Copyright © 2005 Elsevier B.V. All rights reserved.
Modelling expertise for structure elucidation in organic chemistry using Bayesian networks
Michaela Hohenner
,
, Sven Wachsmuth and Gerhard Sagerer
Applied Computer Science Group, Faculty of Technology, Bielefeld University, P.O. Box 100 131, Bielefeld, Germany
Available online 9 April 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
The development of automated methods for chemical synthesis as well as for chemical analysis has inundated chemistry with huge amounts of experimental data. To refine them into information, the field of chemoinformatics applies techniques from artificial intelligence, pattern recognition and machine learning. A key task concerning organic chemistry is structure elucidation. NMR spectra have become accessible at low expenses of time and sample size, they also are predictable with good precision, and they are directly related to structural properties of the molecule. So the classical approach of ranking structure candidates by comparison of NMR spectra works well, but since the structural space is huge, more sophisticated approaches are in demand. Bayesian networks are promising in this concern, as they allow for contemplation in a dual way: provided an appropriate model, conclusions can be drawn from a given spectrum regarding the corresponding structure or vice versa, since the same interrelations hold in both directions. The development of such a model is documented, and first results are shown supporting the applicability of Bayesian networks to structure elucidation.
Keywords: Bayesian networks; Structure elucidation; NMR spectra
Fig. 1. Molecular structure and 13C NMR spectrum of caffeine (C8H10N4O2).
Fig. 2. Example of a Bayesian network adapted from [10]: let all variables be binary with states ‘yes’ and ‘no’. The ALARM is caused to go off by a BURGLARY. But it can also be triggered by an EARTHQUAKE. However, only the earthquake will cause a RADIO report. Entering the diagnostic evidences ed1 and ed2 concerning the alarm and the radio allows for computation of the likelihood of a burglary taking place. Causal evidence ec can easily be integrated, e.g. by adding a variable to represent that a WEALTHY NEIGHBOURHOOD is likely to increase the risk of a burglary.
Fig. 3. Exemplary compound depicting how the molecular environment is organised into spheres. The first sphere consists of all atoms one bond away from the atom in focus (direct neighbours), the second sphere consists of all atoms two bonds away.
Fig. 4. Schematic representation of benzene derivatives. In benzene itself, positions a–f are occupied by hydrogen atoms. There are n6 combinations of n different admissible substituents (some of which are identical due to symmetry).
Fig. 5. Initial building blocks of the causal model. Aromatic carbon is the hypothesis variable (its states are shown in the box on the right) while the other variables represent the information available to draw conclusions from. The variables on the top are all binary with states ‘yes’ and ‘no’, the peak position variable's states (shown in the left box) represent a discretisation of the spectrum's x-axis.
Fig. 6. The carbons of the benzene ring are termed according to their relative positions. The carbon in focus is called ipso carbon, its direct neighbours are called the ortho positions. Like in Fig. 3 the first and second sphere are distinguished in different shades of gray.
Fig. 7. The model of the chemical shift of an aromatic carbon now contains mediating variables representing the contributions of the first and second sphere atoms to the position of the observed peak. Furthermore, the second sphere is subdivided into ipso and ortho portions. All atom combinations depend on the molecular formula, while all chemical shift increments depend on the atoms in the corresponding relative positions.
Fig. 8. Step 3: assembling the full substitution pattern: the three-membered ring fragments (box on the top) resulting from step 2 are given as input. Proceeding clockwise, matching fragments (parts that must match are shaded in gray) are added. If several fragments qualify, the one with the best probability score is chosen. Successful matches are shown by a dotted line.