Elsevier

Bioorganic & Medicinal Chemistry

Volume 20, Issue 18, 15 September 2012, Pages 5324-5342
Bioorganic & Medicinal Chemistry

Early phase drug discovery: Cheminformatics and computational techniques in identifying lead series

https://doi.org/10.1016/j.bmc.2012.04.062Get rights and content

Abstract

Early drug discovery processes rely on hit finding procedures followed by extensive experimental confirmation in order to select high priority hit series which then undergo further scrutiny in hit-to-lead studies. The experimental cost and the risk associated with poor selection of lead series can be greatly reduced by the use of many different computational and cheminformatic techniques to sort and prioritize compounds. We describe the steps in typical hit identification and hit-to-lead programs and then describe how cheminformatic analysis assists this process. In particular, scaffold analysis, clustering and property calculations assist in the design of high-throughput screening libraries, the early analysis of hits and then organizing compounds into series for their progression from hits to leads. Additionally, these computational tools can be used in virtual screening to design hit-finding libraries and as procedures to help with early SAR exploration.

Introduction

The drug discovery and development path from novel drug target to FDA approved drug to marketing of a novel molecular entity (NME) is long and expensive. The recent approval of Factor Xa anticoagulation drugs illustrates the potential length of time from biochemical target discovery (the 1970s) to the entry of investigational new drugs (IND) into the clinic (late 1990s) and finally approval of Factor Xa based-NMEs (2010).1 In order to increase drug discovery efficiency, new technologies such as high throughput screening (HTS) and fragment-based drug discovery are used to identify molecules (hits) with a given biological activity. A greater understanding of new biochemical targets through genomics and chemical biology has also increased the number of novel drug targets for which biological screens are being developed. The combination of new biological screening technology and the information-rich environment created from modern computational methods combined with early in vitro ADMET tools provides an opportunity to speed the process of identifying new drug candidates. The first challenge is selecting the drug target and identifying hit compounds that have the best chance of forming the basis for chemical optimization and then ultimately translating into INDs and NMEs. High-throughput screening of diverse libraries is an often used approach to identify new lead series. However, the fairly low bar of identifying new hits series often proves elusive as illustrated by Macarron et al.2 who reported that 50–84% of HTS make it to the chemical optimization stage. In this review, we will describe computer-based techniques that organize the data in a chemically intuitive fashion and which we believe will help to improve the success rate in hit-finding programs (such as virtual screening and HTS).

The fields of cheminformatics and computational chemistry are intertwined. In this issue, the editor has provided a thorough review of the definitions of the many tools that are loosely categorized under the term ‘cheminformatics’.3 In spite of the possibility of broad applications throughout early drug discovery processes, often cheminformatics has been limited to the large-scale calculation of physical properties and then their application as filters for library design and experimental results. Clearly, filtering by biological and calculated properties is now a useful part of the screening campaigns, hit selection and other early drug discovery steps. However, these filtering steps are often performed without regard to the chemical relationships between data points. Collecting and sorting biological data by core-common substructures is a natural approach in lead optimization as it is at the core of traditional structure–activity relationships (SAR) in the medicinal chemistry toolbox. Grouping compounds according to their chemical similarity, clustering and chemical scaffolding calculations can also be valuable to help design screening libraries and to analyze very crude data. When experimental data is grouped with consideration for chemical relationships, we have found that some screening campaigns identify higher quality or more attractive chemical series for lead optimization.

We begin this review with the description of a generic process for identifying hits and turning them into leads. We continue with an overview of useful cheminformatic techniques to design compound collections which result in an improved probability of identifying screening hits. Next, we emphasize the role of computational and cheminformatic tools that assist in the selection, validation and prioritization of screening hits into series and the SAR tools that will guide the transformation of these prioritized hit series into leads. We especially consider the integration of computed values with various experimental methods, a crucial area to de-risk the process leading all the way to lead generation. Today, the successful execution of hit identification and hit-to-lead steps results in optimizable lead series and eliminates non-druggable biology targets. Cheminformatics has become crucial to ensure that drug discovery resources are applied efficiently.

Most drug discovery programs now involve the process known as ‘hit-to-lead’ (or lead generation) to develop ‘hits’ that come from an initial screen where biological activity could be validated. The criteria for the selection of compounds and when to call them a hit or a lead vary from one organization to another. Features which impact hit and lead selection also vary based on organizational preferences. Examples of other factors which impact target product profiles are competitive environment and knowledge of the standard of care in the relevant therapeutic area.

In Figure 1 we illustrate some common steps in the hit-to-lead process and we particularly emphasize very early steps in the process because these steps. Often, a hit-finding library is assembled for use in many different biological screens. These libraries are assembled from various physical samples which may be purchased or the results of drug discovery programs. These libraries are then used as the ‘input’ to various primary screens. The screens are typically single-concentration experiments and those samples which meet some threshold signal (preliminary hits) are then subjected to secondary screening to validate the original measurement. These validated hits will undergo various chemical analyses to assure that each positive biological response is due to a single chemical structure. The hit selection process prioritizes and organizes the chemical structures so that the project team can perform a rapid SAR explorations and refine the list available lead series. This iterative process can be termed hit-to-lead. We use the term ‘hit’ in the phrase ‘hit-to-lead’ to describe a molecule that binds to the intended biological target and promotes the biological response above a certain threshold of desired activity in a biological screen. Hit compounds are intended to be valid starting points upon which additional structural modifications can be made to improve effects on biological activity and drug properties such as off-target selectivity, pharmacodynamics and pharmacokinetic parameters are grouped into hit series. Lead molecules are representative examples of a large series of chemical analogs developed from a common scaffold identified in the hit series. Organizations will often commit additional resources to these hit series. The terms hit and lead have many possible modifiers. Table 1 provides a summary of compound descriptions encountered in hit and hit series identification. A preliminary hit from a primary screen will not justify significant additional resources until after validation steps. After the experimental measurement on a sample has been validated, other chemical analysis is required to confirm its identity. Even at this stage, a validated hit may require many additional studies to promote it to the status of a true ‘hit’.

The overall goal for the hit-to-lead process itself is the simultaneous selection of the many properties and characteristics that transform compounds from a hit series into a small number of compounds suitable for development and eventual use as drugs. This is a very broad description of a goal that has very specific requirements to be achieved. This generalized process involves the confirmation, expansion, selection, and transformation of the initially identified active compound series, hits from a high-throughput screen or other sources, into lead compounds possessing the chemical, biological, and physical properties suitable for lead development.4

The steps required to go from a hit to a lead compound are individualized depending on the project goals, but generally follow the sequence of: hit identification, activity validation, structure validation, clustering, hit expansion, early SAR exploration, Structure–property relationship (SPR) determination and lead selection. During the process of transforming a ‘hit’, which is a known entity, to a ‘lead’, which will translate to a new drug invention, there must be a concurrent development of intellectual property. The interplay of techniques in early hit-to-lead is exemplified in Figure 1. During the process, decisions are made as data becomes available and compounds progress through a funnel-like process. Table 2 provides a description of the decision filters that are commonly applied during the hit-to-lead process.

Section snippets

Designing hit-finding libraries to facilitate hit identification

Identifying a large number of hits is the critical first step in a drug discovery program. Different organizations may have different screening technologies and chemical libraries available. A common method for identifying hits is through a HTS campaign and that is the focus of this review. However, hits are often identified from other sources, such as virtual screening, and can be combined with HTS hits. Fragment-based screening5 and natural products6 have also been excellent sources of hits

Screening metrics

The overall quality of HTS and virtual screening results should always be considered when analyzing the hit compounds. Statistical methods are commonly applied in virtual screening to evaluate the predictive power of models. These methods include the calculation of Receiver Operator Characteristic curves (ROC curves),41 various signal-to-noise calculations, and selectivity statistics to evaluate the predictive power of the virtual screen.42, 43 Most of these methods rely on the use of internal

Calculated properties

Cheminformatic physical property are usually pre-calculated and are useful to provide a profile of descriptors for each compound in a hit set. These calculations fall under three categories: (1) constitutive properties descriptors, such as molecular weight, molecular formula, and heavy atom count; (2) calculated prediction of physical properties, such as solubility log P, and (3) other properties which are not physical observables (such as polar surface area) and/or are dependent on choice of

Developing hits into leads: Synthetic considerations and SAR analysis tools

As a result of HTS, virtual screening, high-throughput ADMET, descriptor calculations, and clustering of experimental, cheminformatic and structural data, there is a tremendous amount of biological, physical, and cheminformatic information available to the drug discovery teams. The availability of physical compounds is an important and often overlooked aspect to drug discovery. The commercial availability of starting materials and the methods for preparing the core scaffold structure are

Other factors in hit-to-lead and series prioritization

Another large factor during the hit development process is intellectual property. While individual compounds in hit containing libraries may be clear of patent encumbrances, an estimate must be made of how broad the available IP space needs to be in order to conduct meaningful SAR. The availability of patent space can vary greatly among biological targets. While the defining reason for drug discovery is to relieve human pain, suffering and death the research and development endeavor must also

Summary

Proceeding from a large number of hits to a few good starting points for hit-to-lead is an inherently risky process with many opportunities for poor choices. Published examples readily demonstrate the importance of careful library, assay, and cheminformatic planning before the HTS even begins. At each stage of drug discovery, the cost of experimentation increases; therefore, the use of cheminformatics, computational, and database systems are a significantly aid in organizing possible lead

Acknowledgments

The authors would like to thank Bruce Molino and Robb Lewis for their input and suggestions and the referees for their review and valuable comments.

References and notes (127)

  • G. Keserü et al.

    Drug Discovery Today

    (2006)
  • R. Macarron

    Drug Discovery Today

    (2006)
  • P. Gribbon et al.

    Drug Discovery Today

    (2005)
  • J.H. Zhang et al.

    J. Biomol. Screen.

    (1999)
  • L. Di et al.

    Drug Discovery Today

    (2006)
  • T. Hesterkamp et al.

    Curr. Opin. Chem. Biol.

    (2008)
  • C.G. Wermuth

    Drug Discovery Today

    (2006)
  • L.M. Mayr et al.

    Curr. Opin. Pharmacol.

    (2009)
  • M. Habig et al.

    J. Biomol. Screen.

    (2009)
  • B.K. Shoichet

    Drug Discovery Today

    (2006)
  • C. McInnes

    Curr. Opin. Chem. Biol.

    (2007)
  • W.J. Jorgensen et al.

    Adv. Drug Discov. Rev.

    (2002)
  • C.A. Lipinski et al.

    Adv. Drug Delivery Rev.

    (1997)
  • T.J. Ritchie et al.

    Drug Discovery Today

    (2009)
  • C. Manly et al.

    Drug Discovery Today

    (2008)
  • T.J. Ritchie et al.

    Drug Discovery Today

    (2011)
  • A.L. Hopkins et al.

    Drug Discovery Today

    (2004)
  • T. Ryckmans et al.

    Bioorg. Med. Chem. Lett.

    (2009)
  • C. Abad-Zapatero et al.

    Drug Discovery Today

    (2005)
  • Weitz, J. Nat. Rev. Drug Disc. http://www.nature.com/nrd/posters/warfarin/warfarin_poster.pdf (accessed January 3,...
  • R. Macarron et al.

    Nat. Rev. Drug Disc.

    (2011)
  • Vogt, M.; Bajorathh, J. Bioorg. Med. Chem., in press....
  • K.H. Bleicher et al.

    Nat. Rev. Drug Disc.

    (2003)
  • M. Congreve et al.

    J. Med. Chem.

    (2008)
  • D.J. Newman et al.

    J. Nat. Prod.

    (2003)
  • A. Smith

    Nature

    (2002)
  • N. Malo et al.

    Nat. Biotechnol.

    (2006)
  • J.B. Baell et al.

    J. Med. Chem.

    (2010)
  • O. Roche et al.

    J. Med. Chem.

    (2002)
  • D.R. Goode et al.

    J. Med. Chem.

    (2008)
  • S.L. McGovern et al.

    J. Med. Chem.

    (2002)
  • J.A. Grant et al.

    J. Chem. Inf. Model.

    (2006)
  • A. Simeonov et al.

    J. Med. Chem.

    (2008)
  • J. Hochlowski et al.

    Comb. Chem.

    (2003)
  • B.C. Pearce et al.

    J. Chem. Inf. Model.

    (2006)
  • M.C. Wenlock et al.

    J. Med. Chem.

    (2003)
  • T.I. Oprea et al.

    J. Chem. Inf. Model. Sci.

    (2001)
  • A.M. Clark

    J. Chem. Inf. Model.

    (2010)
  • P. Ertl et al.

    Cheminformatics

    (2009)
  • I. McFayden et al.

    Enhancing Hit Quality and Diversity within Assay Throughput Constraints

  • P. Willett et al.

    J. Chem. Inf. Comput. Sci.

    (1998)
  • A. Schuffenhauer et al.

    J. Chem. Inf. Model.

    (2007)
  • S.J. Wilkens et al.

    J. Med. Chem.

    (2005)
  • R. Nilakantan et al.

    Comb. Chem. High Throughput Screening

    (2002)
  • B.A. Posner et al.

    J. Chem. Inf. Model.

    (2009)
  • Y.C. Martin et al.

    J. Med. Chem.

    (2002)
  • M.A. Johnson et al.

    Concepts and Applications of Molecular Similarity

    (1990)
  • G.M. Makara

    J. Med. Chem.

    (2007)
  • P.J. Hajduk

    J. Med. Chem.

    (2006)
  • R. Brenk et al.

    ChemMedChem

    (2008)
  • Cited by (63)

    • Chemogenomics and bioinformatics approaches for prioritizing kinases as drug targets for neglected tropical diseases

      2021, Advances in Protein Chemistry and Structural Biology
      Citation Excerpt :

      There is an urgent need to develop new drugs targeting NTDs, since many of the infective agents have already developed resistance to the limited number of currently available drugs and face many other issues, such as toxicity and pharmacokinetics problems, that prevent the eradication of these diseases. Among the first tasks in a drug discovery pipeline is the identification and validation of a drug target based on the organism's biology (Duffy, Zhu, Decornez, & Kitchen, 2012). One of the most widely targeted groups of proteins for many non-communicable as well as infectious diseases are the protein kinases and the elucidation of an organism's kinome is often a starting point for these studies (You, McManus, & Gobert, 2015).

    • Shortcuts to schistosomiasis drug discovery: The state-of-the-art

      2019, Annual Reports in Medicinal Chemistry
      Citation Excerpt :

      The early stages of the drug discovery process require the availability of appropriate high-throughput screening (HTS) technologies. HTS uses large compound libraries (often > 1000,000), in a miniaturized format (e.g., 384 well plates) to identify molecules with some activity against a biological target or parasite.17,18 Meanwhile, the biological assays with schistosomes has been carried out by manual microscopy techniques (e.g., bright-field).

    • Drug Discovery by Deep Learning and Virtual Screening: Review and Case Study

      2023, Bioinformatics and Computational Biology: Technological Advancements, Applications and Opportunities
    View all citing articles on Scopus
    View full text