Elsevier

Pattern Recognition Letters

Volume 45, 1 August 2014, Pages 11-16
Pattern Recognition Letters

NLP-inspired structural pattern recognition in chemical application

https://doi.org/10.1016/j.patrec.2014.02.012Get rights and content

Highlights

  • NLP-inspired structural pattern recognition was applied to predict chemical activity.

  • It is combinable with statistical pattern recognition with early or late fusion.

  • The new method allows searching for “structural alerts” and their combinatorial patterns.

  • A biodegradability prediction system serves to illustrate the new method.

  • The source code, data sets and the list of found detected structural alerts are available from the corresponding author on request.

Abstract

In this paper we report on a new structural pattern recognition approach for in silico prediction of chemical activity. It is based on grammatical inference on strings representing chemical compounds and string edit distance between a chemical compound and a formal grammar generalizing an activity class. In the late 1980s Weininger published a chemical language with a very simple and natural grammar. Recently, the algorithms suitable to process this language have been developed. From modeling of chemical activity with formal grammars and chemical compounds as words, a functionality is derivable to search for “structural alerts”, that is, molecular substructures and their combinatorial patterns that cause a molecule to have properties of interest. A biodegradability prediction system has been constructed to serve as an example throughout the paper. The source code and various files from the experiment are available from the corresponding author on request.

Introduction

New chemicals keep appearing and require certification with respect to their activity properties. In order to substitute costly wet-lab testing, computational alternatives are sought. Apart from that, in silico methods are necessary to guide chemical design in order to discard candidate chemical structures that can give rise to unwanted properties as cheaply as possible and as early as possible. Biodegradable activity will be used as an example throughout this paper and is a measure of how readily a chemical compound can be put to use by microorganisms and decomposed in nature. Biodegradability is a quality necessary to take into account on in a wide range of contexts, for example, its high value is a must for modern packing material. A good predictive model with high probability discards unsuitable candidate materials and thus cuts down on the exploration space.

A fundamental concept of chemistry is that the structural characteristics of the molecule are responsible for its properties. The task of in silico prediction is as follows: given formula or graph of a chemical compound, predict whether it is active with respect to a property of interest. For example,–Will it be toxic? or –Will it be biodegradable? An early comprehensive review of the problem, with the key concepts defined, and classical solution can be found in [1]. Twenty years of efforts since then have resulted in big progress and distinct methodologies. Despite that, there are still calls out for the improvement of the existing predictive models on the side of wet-lab practitioners [2] and active work in the search for good new features (“descriptors”) [3], [4].

Statistical pattern recognition is often successfully undertaken: many numeric features are extracted, for example with the DRAGON system [5], then significant ones are selected, and a classification function is trained. Statistical pattern recognition is straightforward, but may not be ideal for many realistic problems, where patterns include structural or relational information and are difficult to quantify as feature vectors.

Graphs would be an ideal representation of molecules, but unfortunately the graph space has no strong algebraic properties, – neither a group nor a vector space. This lack of mathematical properties does not allow to readily combine structural and statistical pattern recognition methods. Graph embedding methods bridge the expressive (and prohibitively expensive) graph-based representation to statistical machine learning models. These methods are categorized into implicit and explicit graph embedding [6].

Implicit graph embedding refers to kernel methods, e.g. [7], [8], [9]. A graph kernel is a function that can be thought of as a dot product in some implicitly existing vector space. Kernel methods handle non-linear complex tasks in a new space and the separating function has the formf(u)=signi=1lαiyiϕ(ui),ϕ(u)+b,where u is a new sample about which the decision needs to be outputted. The training set S={(u1,y1),,(ul,yl)},(ui,yi) is in X × Y and i=1,,l. X is the inner product space and Y is the class value. Y={-1,1}. The solution to the hyperplane (a set of values for αi) is found via quadratic programming, and αi are non-zero for support vectors only. The decision function f outputs a prediction depending on which side of the hyperplane the input pattern u lies. The hyperplane isi=1lαiyiϕ(ui),ϕ(u)+b)=0.A good kernel method maps graphs having a small edit-distance to close values. A valid kernel function must be a positive semi-definite.

Implicit graph embedding is not an optimal choice if interpretation of the decision is wanted, say for a banker or a chemist. Also kernel expressivity is significantly lower than wanted (that of graphs), and thus their use implies loss of potentially important structural information. For example, tree-let kernels take into account labeled subtrees with 6 nodes, while some groups of atoms predefining typical chemical relations are bigger. Combinatorial patterns of these subtrees are also out of reach for the kernel. Sometimes graph-distance is incorporated into a kernel, e.g. in Laplacial kernel [10], but it has complexity problems and needs to rely on an approximation.

Explicit graph embedding methods explicitly embed an input graph into a feature vector and thus enable the use of methodologies and techniques devised for vector spaces. The selection of the axis of the feature vector requires thus a fine analysis of the dataset in order to select representative features. Explicit graph embedding can be naturally combined with statistical pattern recognition (if at all needs to be separated from it).

A conclusion on the comparison of different methods achieved in [6] is that implicit and explicit graph methods are competitive and comparable. Certainly, depending on the data some variations can appear but they are small and they rather not be a criterion for choosing one method or the other, while other considerations should be taken into account, such as computational complexity or how a method is thought to be used.

Our idea is a natural language inspired approach, in which grammars are used to create a definition of the structure of each pattern class. In the late 1980s Weininger proposed “a chemical language with a very small and natural grammar” [11], called SMILES (Simplified Molecular Input Line Entry System). Although the generative grammars were not explicitly mentioned, the allusion is strong and implied that grammar inference, in the style of natural language processing communities, might be feasible for the chemical language too. The objects under modeling, i.e. formal expressions with tree structure, have good algebraic properties [13].

Grammatical features were found to be useful for activity prediction in our earlier work [14], [15], but then in both cases they were used within the explicit graph embedding scheme with other features containing key chemical information known to be highly relevant to activity, such as the graph shape of the compound and counts of the specific atomic groups predefining typical reactions.1 The method was demonstrated to be naturally combinable with statistical pattern recognition in the form of early or late fusion. Under the scheme with early fusion, the edit distances are used as features with other graph features in the explicit scheme. For late fusion distinct predictive models (physicochemical and structural) compare their answers and their probabilities. Should the models disagree, a more confident method overtakes the decision.

The contributions of this submission are:

  • NLP-inspired structural pattern recognition approach for chemical activity prediction, and

  • a successful attempt to have expressiveness close to that of a graph without running into notorious complexity problems or poor mathematical characterizations of the objects, and

  • a functionality derivable from it to search for “structural alerts”, i.e. the substructures and their combinatorial patterns that make a molecule active.

The rest of the paper is organized as follows. Section 2 covers the chemical language SMILES. In Section 3 the proposed pattern recognition based on grammatical inference and distances to grammars is explained together with the functionality to search for activity alerts. In Section 4 an experiment of biodegradability prediction serves as an example for the previous sections.

The source code, various files used in the experiment and the list of obtained structural alerts are available from the corresponding author on request.

Section snippets

The chemical language, similar to natural language

A chemical language SMILES was designed “to represent molecular structure by a linear string of symbols, similar to natural language” [11]. The prime objectives set by the creator of the SMILES language were:

  • 1.

    the graph of a chemical structure was to be uniquely described”;

  • 2.

    a user-friendly structure specification was to be provided, so that all input rules could be learned quickly and naturally”.

A word in SMILES denotes a molecular structure as a graph, which is the two-dimensional picture

Structural pattern recognition based on grammars

In order to construct a model that accounts for the characteristics of the observed objects, a formal grammar is learned from a set of observations. The process of grammar acquisition is called grammatical inference [16]. For a new observation, an edit distance measures how distinct it is from the set of other samples generalized in the grammar. In order to be able to perform classification based on distances, the distributions of distances to various recognition classes are learned from the

An experiment

Biodegradability is expressed on the scale of 0 to 1, denoting a gradual shift in activity: from very hard to break compounds to quickly decomposable. If the value is > 0.7 the substance is said to be “readily-biodegradable”. For this study 533 compounds with their activity value were used: 164 readily biodegradable and 369 not readily biodegradable.

Regarding the categorical values, it was decided to cut the biodegradability interval of [0–1] into the three intervals:[0-0.2),[0.2-0.7),[0.7-1].

Conclusions

The contributions of this submission have been:

  • NLP-insprired structural pattern recognition approach for chemical activity prediction,

  • a successful attempt to have expressiveness close to that of a graph without running into notorious complexity problems or poor mathematical characterizations of the objects, and

  • a functionality derivable from it to search for “structural alerts”, i.e. the substructures and their combinatorial patterns that make a molecule active.

Acknowledgements

Our appreciation goes to Gaston Gonnet for helping the linguistic methods grow stronger in the stimulating environment of CBRG, to Ricard Gavalda and Torben Hagerup for always thoughtful revisions and guidance, and to the Swiss National Science Foundation for financial support: JS and MA acknowledge grant ref. IZK0Z2-141413 (International Short Visit). Last but not least JS acknowledges the current visiting grant at the UC3M.

References (20)

  • L. Ralaivola et al.

    Graph Kernels for chemical information

    Neural Networks

    (2005)
  • D. Lopez et al.

    Error-correcting tree language inference

    Pattern Recognit. Lett.

    (2002)
  • Z. Mihalic et al.

    A graph theoretical approach to structure-property relations

    J. Chem. Educ.

    (1992)
  • T. Steger-Hartmann et al.

    Incorporation of in silico biodegradability screening in early drug development – a feasible approach?

    J. Environ. Sci. Pollut. Res.

    (2011)
  • A. Koutsokas et al.

    From in silico target prediction to multi-target grug design: current databases, methods and applications

    J. Proteomics

    (2011)
  • M. Sastry et al.

    Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments

    J. Chem. Inf. Model

    (2010)
  • I.V. Tetko

    Virtual computational chemistry laboratory – design and description

    J. Comput.-Aided Mol. Des.

    (2005)
  • D. Conte et al.

    A comparison of explicit and implicit graph embedding methods for pattern recognition

  • B. Gauzere, L. Brun, D. Villemin, M. Brun, Graph kernels based on relevant patterns and cycle information for...
  • G. Hinselman et al.

    Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments

    Neurocomputing

    (2010)
There are more references available in the full text version of this article.

Cited by (6)

This paper has been recommended for acceptance by Andrea Torsello.

View full text