Chapter 12 - PubChem: Integrated Platform of Small Molecules and Biological Activities
Section snippets
INTRODUCTION
PubChem [1], an open repository for experimental data identifying the biological activities of small molecules, is a part of the Molecular Libraries and Imaging (MLI) component of the National Institutes of Health (NIH) Roadmap for Medical Research initiative [2]. This program includes the Molecular Libraries Screening Center Network (MLSCN), grant-supported experimental laboratories, and a shared compound repository, referred to as the Molecular Libraries Small Molecular Repository (MLSMR)
DESCRIPTION
PubChem is organized as three distinct databases: PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem Substance contains descriptions of chemical samples, provided by data depositors, and links to information on their biological activities. The description includes PubChem Compound identifiers in cases where the chemical structures of compounds in the sample are known. Links providing information on biological activity include those to PubMed [8] citations, protein 3-D structures
DATA RELATIONSHIPS
The fundamental relationships between the three PubChem databases are straightforward. PubChem Substance identifiers (SIDs) relate to PubChem Compound identifiers (CIDs) through chemical structure standardization. Each substance, if it standardizes, will have a corresponding CID that is the main “standardized” form of that substance, representing the whole structure. There may also be “component form” CIDs that include unique covalently bonded units, when the substance is a mixture, or an
INTERFACE
The primary interface to PubChem data is through the NCBI search engine, Entrez. This web-based interface is simple, yet powerful, with many features not immediately apparent to those unfamiliar with the Entrez system. This section is intended as both an introduction and a guide to the more advanced Entrez features, and the types of Entrez PubChem queries that can be performed.
TOOLS
We have described how PubChem databases are integrated into Entrez, enabling detailed and flexible searches across the PubChem data; however, Entrez is essentially a text search engine and is not amenable to more detailed chemical and bioassay data analysis. Such analysis must be handled by specialized applications. As the PubChem data content grows, there is an ever increasing need for facile methods of efficient large-scale data management and analysis.
Researchers require the ability to
PROGRAMMATIC TOOLS
While giving access to all available PubChem data and functionality, interactive web-based interfaces are not particularly well suited to highly repetitive or automated tasks. Without programmatic tools, tasks such as performing specific data lookups for a large number of chemical structures would be tedious if not impossible to perform and a software tool that integrates with PubChem services and data would be difficult to create and maintain. With programmatic access to PubChem, data can be
DEPOSITION SYSTEM
PubChem is an open repository. Organizations may contribute information about small molecules and integrate their public resource with PubChem, in part by providing URLs back to and from their website to PubChem. The types of PubChem depositors are greatly varied with contributors from government organizations, academic groups, chemical reagent and screening library suppliers, scientific journals, scientific data publishers, physical property databases, and more. To handle this quantity and
FUTURE DIRECTIONS
Expansion and enrichment of the bioassay data are ongoing, by adding annotations for small molecules and drugs using publicly available information, such as that provided at the National Library of Medicine (NLM) or the Food and Drug Administration (FDA). With efforts from the scientific community, bioassay data is becoming better annotated by linking target to protein classification resources or molecular pathway information. With further integration with NCBI resources such as PubMed and the
ACKNOWLEDGEMENTS
This research was supported, in part, by the Intramural Research Program of the NIH, National Library of Medicine.
REFERENCES (28)
- et al.
Distributed Structure-Searchable Toxicity (DSSTox) public database network: A proposal
Mutat Res.
(2002) - et al.
Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings
Adv. Drug Del. Rev.
(2001) - et al.
GenBank
Nucleic Acids Res.
(2007) - et al.
The Swiss-Prot Protein Knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Res.
(2003) - et al.
The Protein Data Bank
Nucleic Acids Res.
(2000)
MMDB: Entrez's 3D-structure database
Nucleic Acids Res.
An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier
Cited by (1129)
A pre-trained multi-representation fusion network for molecular property prediction
2024, Information FusionEmden: A novel method integrating graph and transformer representations for predicting the effect of mutations on clinical drug response
2023, Computers in Biology and MedicineMolecule auto-correction to facilitate molecular design
2024, Journal of Computer-Aided Molecular Design