Supporting data for "Annotation of the Giardia proteome through structure-based homology and machine learning"
Dataset type: Software
Data released on November 19, 2018
Ansell BRE; Pope BJ; Georgeson P; Emery-Corbin SJ; Jex AR (2018): Supporting data for "Annotation of the Giardia proteome through structure-based homology and machine learning" GigaScience Database. https://doi.org/10.5524/100534
Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence, and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research, however informatic methods are now required to assign confidence in large volumes of predicted structures.
To predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination.
We used the I-TASSER suite to predict structural models for ~5000 proteins encoded in Giardia duodenalis and identify their closest empirically determined structural homologues in the Protein Data Bank. Models were assigned to high or lower-confidence categories depending on the presence of matching PFAM domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high confidence category individually, and in combination through development of a random forest classifier.
We identified 1095 high confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high confidence status, however the random forest classifier out-performed any metric in isolation (AUC = 0.977), and identified a subset of 305 high confidence-like models, corresponding to false positive predictions. High confidence models exhibited higher transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high confidence-like proteins yielded substantial new insight into mechanisms of redox balance in Giardia duodenalis— a system central to the efficacy of limited anti-giardial drugs.
Keywords:
Additional details
Read the peer-reviewed publication(s):
- Ansell, B. R. E., Pope, B. J., Georgeson, P., Emery-Corbin, S. J., & Jex, A. R. (2018). Annotation of theGiardiaproteome through structure-based homology and machine learning. GigaScience, 8(1). https://doi.org/10.1093/gigascience/giy150 (PubMed:30520990)
Additional information:
http://predictein.org/giardia_duodenalis
Click on a table column to sort the results.
Table SettingsFile Name | Description | Sample ID | Data Type | File Format | Size | Release Date | File Attributes | Download |
---|---|---|---|---|---|---|---|---|
Readme | TEXT | 5.08 kB | 2018-11-13 | MD5 checksum: 548f7af012d8917a439b9ba81a029d38 |
||||
Archival copy of the GitHub repository https://github.com/bansell/structurehomology, downloaded 8 Nov 2018. A repo of the scripts to reproduce figures and tables in the GigaScience manuscript | GitHub archive | archive | 23.51 MB | 2018-11-13 | MD5 checksum: 6b49722301899534e77f82403a551cbc |
|||
Protein structures predicted for 4901 Giardia duodenalis (assemblage A; strain WB) proteins, with associated ligand binding site predictions, ligand-protein complexes, closest empirically determined structural homologues (RSCB PDB reference structures) and gene ontology predictions. File hosted by FigShare DOI : 10.26188/5bd78e3f49e3f | External link | TAR | 9.13 GB | 2018-11-13 |
Funding body | Awardee | Award ID | Comments |
---|---|---|---|
Australian Research Council | AR Jex | LP120200122 | |
Jack Brockhoff Foundation (AU) | SJ Emery-Corbin | 4184 |
Code Ocean:
Date | Action |
---|---|
November 19, 2018 | Dataset publish |
December 3, 2018 | Manuscript Link added : 10.1093/gigascience/giy150 |
November 11, 2022 | Manuscript Link updated : 10.1093/gigascience/giy150 |
November 15, 2022 | File structurehomology-master.zip updated |
November 15, 2022 | Data type for File structurehomology-master.zip updated |