Keywords
comparative genomics, Course-based Undergraduate Research Experience, CURE, structural gene annotation
This article is included in the Genomics and Genetics gateway.
This article is included in the Bioinformatics gateway.
comparative genomics, Course-based Undergraduate Research Experience, CURE, structural gene annotation
We have made minor changes to the figure to include the number of gene models rather than the number of transcripts predicted. We have also updated the language in the figure caption to include this change.
See the authors' detailed response to the review by Madeline Crosby
See the authors' detailed response to the review by Terence D. Murphy
Genome annotation requires assessing and integrating multiple lines of computational and empirical evidence. Several computational pipelines have been developed (e.g., BRAKER and MAKER) for constructing an initial set of structural gene annotations for eukaryotic genomes.1,2 Accuracy of the gene models produced by gene prediction algorithms depends on multiple biological (e.g., genome size, ploidy, repeat density, and complexity of the transcriptome) and technical factors (e.g., quality of the genome assembly, evolutionary distance from the reference species, and availability of transcriptome data).3 These factors can contribute to differing numbers of gene predictions in closely related species (Figure 1).
Evidence-based gene prediction algorithms (e.g., Gnomon7) use extrinsic evidence, such as RNA sequencing (RNA-Seq) data derived from the target species, and sequence similarity to proteins from a reference species, to predict genes within the target assembly. The advent of high-throughput RNA sequencing technologies have led to substantial improvement in the quality of gene annotations,8,9 particularly for species that lack high-quality gene annotations from a closely related reference species; however, the efficacy of assembling transcripts from RNA-Seq data depends on transcript expression levels in the specific developmental stages and tissue types that are sampled.10 Long-read RNA sequencing technologies (e.g., Iso-Seq by Pacific Biosciences and Direct RNA sequencing by Oxford Nanopore) can produce reads that span the entire transcript, which facilitates identification of alternative splicing patterns and characterization of different gene isoforms. There are, however, several challenges for producing high quality long-read data including the robustness of RNA extraction methods, bias towards short transcripts, low sequencing throughput, and low read accuracy (reviewed in Ref. 11). Additionally, past studies have shown that transcriptomes constructed from long-read RNA-Seq data have high sensitivity but low precision.12
Consequently, despite recent advances in gene prediction algorithms and the increasing availability of RNA-Seq data, gene predictions produced by computational algorithms can still benefit from manual review and refinement.13,14 This article describes a protocol, developed by the Genomics Education Partnership (GEP; https://thegep.org), to engage undergraduate students in the comparative annotation of protein-coding genes involved in the insulin signaling pathway (ISP) across 27 species of Drosophila, using D. melanogaster as the reference species (Table 1). The species/assemblies described in Table 1 will be updated periodically to reflect the scientific goals of the project, and to conform to the latest NCBI assemblies for those species.
The GEP is a collaborative effort involving 200+ institutions across the nation. It aims at incorporating active learning into the undergraduate curriculum by implementing Course-based Undergraduate Research Experiences (CUREs) focused on bioinformatics and genomics. The primary objectives of the GEP are: (1) to introduce bioinformatics and genomics into undergraduate education, (2) to offer undergraduates research opportunities, and (3) to foster an inclusive and open partnership. Since its establishment in 2006, the GEP has made significant contributions to the field, advancing our understanding of genome evolution and effective teaching practices in STEM. Through its research initiatives, over 1,100 students have actively participated as co-authors in published scientific research papers.15–27
As of January 2023, GEP students from 79 institutions have used this annotation protocol to construct 2,394 gene models across 31 Drosophila species (number of species/gene combinations that have been constructed can be found in Supplement 7). Despite differing in instructional settings and teaching modalities, this protocol ensures that GEP students use a uniform standard to construct gene models that are best supported by the available evidence. As an additional level of quality control, each gene is annotated by at least two students working independently, and the submitted gene models are then reconciled by an experienced student using the Apollo genome annotation editor.28 Reconciled gene models will typically be described in microPublication articles29 and submitted to the NCBI Third Party Annotation (TPA) database.30 Researchers can utilize the high-quality, manually curated gene models constructed by GEP students to investigate the evolution of genes and genomes (Figure 2).
Drosophila researchers and curators at FlyBase have produced high-quality, comprehensive gene annotations for D. melanogaster based on a large amount of genetic and sequencing data.31 Our protocol utilizes the gene annotations from D. melanogaster (reference species) to facilitate annotation of the protein-coding sequence (CDS) of orthologous genes in other Drosophila species (target species). In the absence of compelling evidence (e.g., RNA-Seq data) indicating significant differences in the gene model, the proposed gene model in the target species minimizes the number of changes compared to the ortholog in the reference species (i.e., construct the most parsimonious gene model assuming evolutionary conservation).
In order to generate manual annotations for the CDS of a gene in the target species, we need to (1) identify the ortholog of that gene from D. melanogaster in the target species using sequence similarity and local synteny, (2) determine the structure and approximate coordinates of each isoform and their coding exons, and then (3) refine those coordinates for each isoform. The key analysis steps are also summarized in the Results section and in the Annotation Workflow for the Pathways Project.32 A walkthrough illustrating each step of the annotation protocol from the perspective of a naive student annotator on an example gene is both available on the Pathways Project page of the GEP website (https://thegep.org/projects/pathways/) and in Ref. 33. Here we highlight the essential conceptual steps the student annotator will follow.
The annotation and reconciliation protocols described below utilize multiple bioinformatics tools that are briefly summarized in the “Data (and Software) Availability” section.
CDS annotation procedure
Project claiming
GEP faculty members select at least one D. melanogaster gene involved in the ISP (i.e., target gene in the reference genome) and one or more of the 27 Drosophila species (i.e., target species) for their students to annotate. Each gene project includes an estimated difficulty level based on the number of isoforms, number of coding exons, and evolutionary distance from the reference genome. Faculty members can take the estimated difficulty of the gene projects into consideration when selecting projects that best suit the pedagogical goals of their courses, the amount of class time devoted to the annotation project, the academic levels of their students, and specific interests in the biological function of a gene. For example, faculty members might select the same gene in multiple Drosophila species for their students to annotate (working individually or in groups) in order to teach students about conservation relative to divergence time.
Identify the ortholog
Ortholog assignment in the target species is based on the analysis of protein sequence similarity and local synteny (i.e., relative gene order and orientation within a syntenic chromosomal region) compared to the reference species.34,35 The key analysis steps for identifying the ortholog are also summarized in the extended data.36
To identify the ortholog of the target gene, the student annotator examines the genomic neighborhood surrounding the gene in both the reference species and the target species using the GEP UCSC Genome Browser (further described in the Software section below; https://gander.wustl.edu). This analysis includes identifying the nearest two upstream and downstream genes and their orientations relative to the target gene. The local synteny analysis also includes any nested genes in the locus surrounding the putative ortholog in the target species.
Locating the putative ortholog requires the student to obtain the protein sequence for the target gene in the reference species from the Gene Record Finder, and use it as the query to perform a tblastn search against the genome assembly of the target species via NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi). This protocol uses the tblastn program to compare the protein sequence query against a nucleotide database because amino acid sequences show higher sequence conservation than nucleotide sequences across evolutionary time. In addition, due to the degeneracy of the genetic code, sequence similarity searches at the amino acid level are more sensitive than searches at the nucleotide level.37,38 There are three possible outcomes of their tblastn search: (1) zero matches, (2) one match, or (3) more than one match. If there is a single high-quality match, that match is a good candidate for being the ortholog.
The tblastn search may report zero significant matches even if the ortholog exists in the target species due to gaps or misassemblies in the genome assembly or the lack of sequence conservation between orthologs. In the latter case, the two genes upstream and downstream of the target gene in the reference species will be used to infer the location of the target gene in the target species (i.e., local synteny). This is done for Ilp3 in D. grimshawi as shown in Figure 3A. Here, part of the Ilp3 CDS erroneously maps to the end of the Ilp1 CDS, and does not map to the location assessed as correct by conforming to the conservation of local synteny. Figure 3B also shows the change in duplication of Ilp2 in the local neighborhood of D. grimshawi (as indicated by XM_001983645 and XM_001983644 for Ilp2-p and Ilp2-d respectively). The student first identifies the orthologs to the genes flanking the target gene in the target species. If there is a locally syntenic region in the target species, then there will likely be an additional feature located between the flanking orthologs in the target species. This additional feature can then be considered as the likely ortholog to the target gene in the target species based on local synteny. Orthology should be confirmed using a blastp search of the predicted protein sequence of the feature in the target species against a database of annotated proteins for the reference species. If the location and candidate for the ortholog cannot be established using this strategy, then the ortholog may be absent from the genome assembly for the target species.
If the tblastn search reports two or more significant matches, then the ortholog assignment will be based on parsimony with the reference species. The match with the lowest E-value, highest percent identity, and highest alignment coverage to the target gene will be assigned as the putative ortholog in the target species. The student then confirms the putative ortholog assignment derived from the best tblastn match via local synteny by comparing the genes surrounding the target gene in the target and reference species. If there is no match for neighboring genes in an evolutionarily diverged species but a single match exists for the target gene, then the ortholog assignment will be determined by BLAST searches using the Reciprocal Best Hits (RBH) strategy.39
Figure 4 shows an example of how the student can use local synteny to establish the ortholog of the target gene Ilp2. The coding region of the same two genes are located upstream (i.e., Zasp67 and Ilp1) and downstream (i.e., Ilp3 and Ilp4) of Ilp2 in D. melanogaster and D. yakuba. In addition, Ilp2, Ilp3, and Ilp4 are all nested within the coding span of CG32052 in both species, which further supports the hypothesis that the genomic neighborhood is conserved, and thus, the ortholog of Ilp2 has likely been identified.
Identify the approximate coordinates of each coding exon
Once the student identifies the putative ortholog, they begin constructing the gene model by separately mapping each coding exon to determine their approximate locations and their reading frames in the target genome using the “Align two or more sequences” (bl2seq) feature provided by NCBI Web BLAST.
For each isoform of the target gene in D. melanogaster (reference genome), they perform tblastn searches of each coding exon of the isoform (query) in D. melanogaster against the scaffold which contains the putative ortholog of the target gene in the target species (subject). The amino acid sequence for each coding exon in the D. melanogaster isoform is obtained from the Gene Record Finder (further described in the Software section below). The scaffold in the target species, which contains the putative ortholog of the target gene, is identified by the “Identify the ortholog” step above. For scaffolds larger than 10 Megabases (Mb), the “From” and “To” fields under “Subject subrange” in the NCBI tblastn search interface are used to limit the size of the search region to the approximate location of the target gene (inferred from results of the “Identify the ortholog” step).
The default tblastn search parameters for NCBI Web BLAST are used in this search except for the following parameters:
1. Select the “Align two or more sequences” checkbox
2. Specify a subject subrange that corresponds to the approximate location of the target gene estimated at the “Identify the ortholog” step
3. Select “No adjustment” under “Compositional adjustments”
4. Uncheck the “Low complexity regions” filter (under “Filters and Masking”)
At the end of this process, the student usually identifies a collinear set of coordinates for most of the coding exons of the isoform. Results of the tblastn searches provide further supporting evidence for the ortholog assignment and provide anchors from which to define the search regions for small or weakly conserved coding exons.
Refine coding exon coordinates
Since tblastn only aligns to complete codons, the BLAST alignments will not include partial codons adjacent to the splice junctions. In addition, tblastn does not take the locations of potential splice sites into account when it generates the alignment. Consequently, other lines of evidence (e.g., computational gene predictions, RNA-Seq data) must be used to refine the start and end coordinates of each coding exon. The student manually refines the collinear coding exon coordinates identified above through visual inspection of the region in the Genome Browser utilizing the RNA-Seq track, other homology-based alignment algorithms, and gene predictions, as well as comparing the sequence to other Drosophila species based on whole genome multiple sequence alignments (more details about the tracks used can be found in the extended data).4
If available, the RNA-Seq data can provide empirical support for the proposed gene model in the target species. RNA-Seq data for the target species were obtained from the NCBI Sequence Read Archive (Table 140). The RNA-Seq reads were mapped against the target genome using HISAT2,41 and putative introns were inferred from the alignments of spliced RNA-Seq reads using the junctions extract command provided by RegTools.42
When defining the intron coordinates, canonical splice sites (i.e., GT donor and AG acceptor) are adhered to unless there is good evidence to the contrary, such as spliced RNA-Seq reads from the target species and conservation of non-canonical splice sites across the clade. The student also needs to ensure the donor and acceptor sites have compatible phases (i.e., the sum of the phases of the donor and acceptor sites of an intron is either zero or three) in order to maintain the open reading frame after splicing. A more detailed version of the workflow for refining the coding exon coordinates, from the perspective of a naive student, can be found within extended data.43 Figure 5 gives an example coordinate refinement to account for the complete exon boundaries considering RNA-Seq data and splice site compatibility, and Table 2 shows the comparison of the approximate coding exon coordinates determined by the tblastn searches with their refined counterparts for the Rheb gene in D. yakuba.
Verify gene model
The student uses the Gene Model Checker to verify that the refined coordinates for the gene model in the target species satisfy the biological constraints for protein-coding genes in most eukaryotes and reflect the gene structure of the D. melanogaster ortholog. The dot plot and protein alignment identify differences between the proposed gene model and the D. melanogaster ortholog, and they help to verify that their proposed gene model is the most parsimonious compared to the D. melanogaster ortholog. For more information on the Gene Model Checker, refer to the “Data (and Software) Availability” section.
The student repeats the “Identify the approximate coordinates of each coding exon” and “Refine coding exon coordinates” steps to construct gene models for each unique protein-coding isoform of their target gene and then verifies them with the Gene Model Checker.
Final submission
The student submits a file containing the coordinates of the coding exons for all isoforms in the Generic Feature Format (GFF), a file containing the transcript sequence for the coding region of all isoforms in FASTA format (FNA), and a file containing the peptide sequence for all isoforms in FASTA format (FAA). These files are generated for each isoform by the Gene Model Checker and then concatenated by the Annotation Files Merger tool described below. They also submit an Annotation Report44 to document the evidence supporting their proposed gene models.
Exceptions to the standard annotation workflow
While the standard annotation workflow provides a good starting point for the annotation of target genes in the Pathways Project, additional tools and strategies are needed to address challenges with the annotations of a subset of genes. Resolving these challenges typically require the integration of multiple lines of empirical and computational evidence. Below we describe the process for resolving two common challenges—non-canonical splice sites and assembly errors—and provide a list of other potential challenges annotators may encounter.
Non-canonical splice sites
The most common (canonical) sequence for the splice donor site is GT (GU in the pre-mRNA), and the most common sequence for the splice acceptor site is AG. Variant splice sites are termed non-canonical. For example, the GC splice donor site appears in ~0.8% (603/71,922) of the unique introns in D. melanogaster (FlyBase release 6.43). The use of a non-canonical splice site in a gene model will typically be supported by splice junction predictions derived from spliced RNA-Seq reads. The presence of a non-canonical splice site in the orthologous intron in the reference species (D. melanogaster), or in multiple Drosophila species closely related to the target species, can also be used as supporting evidence for the annotation of a non-canonical splice site.
Assembly errors
Each sequencing platform (e.g., Sanger, Illumina, PacBio, and Nanopore) has a distinct error profile45,46 that could introduce errors (e.g., base substitutions, insertions, and deletions) into the consensus sequence of an assembly. Transposons and other repetitive sequences (e.g., tandem repeats) in eukaryotic genomes can also lead to gaps and misassemblies.47 Gene annotation challenges caused by assembly errors include partial or missing genes due to gaps in the assembly, apparent frameshifts within CDSs due to extra or missing nucleotides in the consensus sequence, and errors in ortholog/paralog assignments (e.g., due to “duplications” caused by misassemblies).
The publicly available genome assemblies utilized by the Pathways Project were constructed using different sequencing technologies and assembly protocols. Since these genome assemblies have not been manually improved, they might contain assembly errors that could interfere with coding region annotations. Consequently, in cases where proposed gene models for the target species includes changes in gene structure compared to the D. melanogaster ortholog (e.g., novel/missing isoforms, exons, and/or introns), further investigations are needed to ascertain if the difference is caused by an assembly error or reflects true divergence. As part of this assessment, the student evaluates multiple lines of evidence including: (1) sequence conservation with other Drosophila species besides D. melanogaster, (2) consistency with Illumina genomic reads and RNA-Seq reads in the NCBI Sequence Read Archive (SRA), and (3) consistency with other genome assemblies for the same species.48
Other exceptions
Past studies have shown that the number of genes involved in the ISP varied in the different Drosophila species due to gene duplications and pseudogenization,49 which leads to challenges in ortholog assignments. Other annotation challenges are caused by changes in gene structure (e.g., gain or loss of coding exons and isoforms) compared to the target gene in the reference species.
Another set of challenges pertain to gene models that do not conform to the typical characteristics of protein-coding genes, including genes with a non-canonical start codon (e.g., Akt), stop codon readthrough (e.g., jim), or trans-splicing (e.g., mod (mdg4)).
Common issues in gene models prior to submission
The Gene Model Checker sometimes reports multiple “fails” for a proposed gene model because it deviates from the expected biological characteristics of most protein-coding genes (e.g., gene model with multiple in-frame stop codons). Typically, the multiple failures can be attributed to an error in the upstream regions of the proposed gene model that propagates downstream.
For example, one common cause of multiple failures is frameshifts caused by selecting incompatible donor and acceptor splice sites. When a coding exon ends in an incomplete codon, the 3’ end of that coding exon and the 5’ beginning of the next coding exon must include nucleotides that form a complete codon once the intron has been spliced out. The number of nucleotides between the end of the last complete codon and the splice donor site is defined as the phase of the splice donor site. Similarly, the number of nucleotides between the splice acceptor site and the start of the first complete codon is defined as the phase of the splice acceptor site. In order to maintain the open reading frame (ORF) after the intron has been spliced out, the sum of the donor and acceptor phases for adjacent coding exons must either be zero (i.e., no extra codon) or three (i.e., one extra codon). Selecting incompatible donor and acceptor splice sites causes a frameshift that changes the reading frames of the coding exons downstream of that splice junction. Using the incorrect reading frame to translate the downstream coding exons will likely introduce stop codons in the translation, thereby triggering multiple failures in the Gene Model Checker.
To resolve gene models with multiple fails, the student starts troubleshooting at the beginning of the gene. In many cases, correcting errors in the upstream portion of the gene model resolves the fails reported downstream.
While most of the gene models produced by GEP students using the annotation protocol described above are congruent with each other, incongruent models require further examination by a student reconciler. Reconciliation is carried out by experienced students who have received additional training under the guidance of a GEP faculty, and/or senior-scientist, mentor.
Each target gene in a target species is annotated by at least two students working independently. This quality control step is predicated on the assumption that it is relatively common for one student to make a single error but relatively rare for multiple students working independently to make the same error.
Student reconcilers look for differences in the submitted gene models, paying special attention to the three most common errors that might invalidate a model (described below), and investigate any large-scale anomalies (e.g., proposed novel isoform and missing specific exons or isoforms).
Reconciliation process
Reconciliation is performed using Apollo,28 a web-based collaborative genome annotation editor that allows reconcilers to view student-generated models alongside the evidence tracks (Figure 6) used for annotating those models.
Reconcilers evaluate the available student annotations for each isoform in conjunction with the other evidence tracks (e.g., sequence similarity, RNA-Seq data, and gene predictions) to construct the final gene model for the protein-coding isoform(s) that is best supported by the available evidence. Reconcilers then draft a microPublication describing the supporting evidence for the final gene model.29 The student annotators and their faculty mentors review and approve the article draft. Reconciled models are also used in downstream meta-analyses and deposited into the NCBI Third Party Annotation (TPA) database. Track Hubs are currently hosted on GEP servers and all the models that have been reconciled to date have Track Hub links listed in Supplement 8. The assemblies with Track Hubs can be found at: https://thegep.org/projects/pathways/track-hubs/.
Identify the ortholog
Identification of the ortholog is done using BLAST and local synteny analysis of the genomic neighborhood of the target gene. For example, to locate the Ilp2 gene (target gene) in D. yakuba (target species), a tblastn search was used to compare the protein sequence for D. melanogaster Ilp2-PA (query) against the D. yakuba DyakCAF1 genome assembly (subject). The two collinear alignment blocks that correspond to the best match to the D. melanogaster Ilp2-PA protein (137aa) are located in the 9,766,395-9,766,887 region of the D. yakuba scaffold CM000159.2 (chr3L) (Figure 7). The alignment block for the first 54aa of Ilp2-PA is located at 9,766,395-9,766,556 in frame +3, with a normalized score of 171 bits and 94% identity. The alignment block for residues 56-137 of Ilp2-PA is located at 9,766,642-9,766,887 in frame +1, with a normalized score of 261 bits and 94% identity. The joint E-value for the two collinear alignment blocks is 3e-114. The two alignment blocks account for 136aa out of 137aa of D. melanogaster Ilp2-PA (residue 55 of the protein is not covered by the two alignment blocks). Local synteny analysis shows that the genomic neighborhood surrounding Ilp2 is conserved between D. melanogaster and D. yakuba (Figure 4).
Collectively, the available evidence supports the hypothesis that the 9,766,395-9,766,887 region of the D. yakuba scaffold CM000159.2 contains the putative ortholog of Ilp2.
Identify the coordinates of each coding exon
The approximate coding exon coordinates for the target gene in the target species are defined by tblastn searches of the coding exons of the target gene in D. melanogaster against the genomic scaffold in the target species determined by the “Identify the ortholog” step. The approximate coding exon coordinates are then refined by examining the evidence tracks in the GEP UCSC Genome Browser. For example, Figure 5A shows the approximate placement of the second coding exon of the Rheb gene in the D. yakuba DyakCAF1 assembly based on the results of the tblastn search (i.e., scaffold CM000160.2 (chr3R) at 17,358,844-17,358,912 in frame +1). In Figure 5B, the refined start coordinate for the second coding exon of Rheb (i.e., at 17,358,842) was determined by RNA-Seq read coverage, splice junction predictions, gene predictions, and protein alignments evidence tracks on the GEP UCSC Genome Browser. Table 2 shows the refinement of all CDS exons of Rheb in D. yakuba. The “Difference” column indicates the necessity of manually refining BLAST derived coordinates.
Verify gene model
The final step prior to submission to the GEP is for the student to confirm the proposed gene model using GEP’s Gene Model Checker. This tool can help the annotator verify that the proposed gene model satisfies the biological constraints of most protein-coding genes and identify differences between the proposed gene model and the D. melanogaster ortholog. In Figure 8, we can see the checklist for the model of Rheb-PA in D. yakuba passed the Gene Model Checker (i.e., the proposed gene model begins with a start codon, ends with a stop codon, and the five coding exons use the canonical splice donor and acceptor sites). It is, however, possible to “pass”’ all the checks in the tool and still have an entirely incorrect model since the tool does not specifically test for protein sequence conservation or orthology.
Most common annotation errors
Reconciliation consists of reviewing two or more gene models created by student annotators working independently. The main advantage of manually curated gene models relative to computational predictions is the ability for the curator (in this case the student annotator) to evaluate and integrate across non-conforming pieces of evidence. Reconcilers provide further quality control measures as they closely scrutinize each idiosyncrasy in the gene models proposed by student annotators. The most common idiosyncrasies/errors are listed below.
Selection of incorrect splice sites
One of the most common annotation errors is the selection of splice donor and acceptor sites that are not the most parsimonious candidates compared to the target gene in the reference genome. Another type of splice site error is caused by the annotators placing too high of a priority on the use of canonical sequences for splice donor sites (GT) and splice acceptor sites (AG). While non-canonical splice sites are rare in Drosophila (i.e., found in < 1% of introns), they should be used if the non-canonical splice sites are supported by RNA-Seq data and conservation in other Drosophila species.50–52
Multiple failures in the Gene Model Checker checklist (as shown in Figure 9) are typically caused by the selection of incompatible splice sites during the “Refine coding exon coordinates” stage of the analysis. Most of these errors can be resolved by scrutiny of the coordinates for the checklist item where the first error is reported by the Gene Model Checker. Figure 10 shows the incorrect assignment of a splice site in D. eugracilis indicated by the first “fail” in the Gene Model Checker in Figure 9. The proposed splice donor site is one nucleotide away from the correct splice donor site with the canonical sequence of GT. Reconcilers are well equipped to identify the proper splice site when the student model has an error.
Missing/extra exons
Another common error is gene models with missing or extra coding exons compared to the D. melanogaster ortholog. Sometimes these proposed changes in gene structure are well supported by the data (e.g., a novel intron supported by splice junction predictions), but often the mismatch in the number of coding exons is due to the student annotator failing to fully account for all lines of evidence.
Incorrect ortholog assignment
Correct assignment of an ortholog depends on the proper configuration of the BLAST search parameters, proper interpretation of the BLAST search result, and proper consideration of local synteny compared to the reference species. If a student annotates a gene that is not the ortholog, then their gene model cannot be used to establish the final ortholog model in the target species. In instances where the student annotator has failed to annotate the correct ortholog for a given gene, the reconciler asks another member of the reconciliation team to generate a new annotation for the offending model, thus the team member creates another independent annotation to give us at least two reasonable models. This additional model is then incorporated into the standard reconciliation pipeline to identify congruence with another already submitted model.
Missing/Mislabeled Isoforms
The annotation protocol assumes that all isoforms for a given gene from D. melanogaster will also be present in the target species unless there is evidence to the contrary. Orthologous isoform names are assigned based on sharing strong protein sequence conservation and a similar coding exon structure to the D. melanogaster isoform. If a student fails to annotate an isoform model that is viable in the target species genome, the student’s model is categorized as missing an isoform. Evidence that an orthologous isoform is not present in the target species might include that there are no viable splice sites to generate the orthologous isoform and/or an exon unique to that isoform is not present. If relevant, such evidence should be documented in the student’s report form and the subsequent microPublication of the reconciled model.
Non-canonical splice sites
While non-canonical splice sites are rare in Drosophila (i.e., found in < 1% of introns), they should be included if they are supported by RNA-Seq read coverage, splice junction predictions, and conservation in other species.50–52 An example of a gene with a non-canonical spice site is sgg-RN in D. melanogaster, which has a GC splice donor site rather than the GT donor site between exons sgg:15 and sgg:18 (Figure 11). The presence of a non-canonical splice site in a proposed model will be indicated by a “Warning” label in the Gene Model Checker. If a non-canonical splice site is substantiated by multiple lines of evidence (e.g., RNA-Seq data, conservation across multiple species), the student may conclude that the non-canonical splice site is also present in the target species.
Assembly errors
Assembly errors (e.g., gaps) can affect the coding region gene annotations. The Chained Alignments between the D. melanogaster dm6 assembly and the D. pseudoobscura DpseGB3 assembly (GCA_000001765.2) show that the chico gene is split across two different scaffolds in the D. pseudoobscura DpseGB3 assembly—the first coding exon of chico is located in the D. pseudoobscura scaffold CH673091 and the rest of the gene is located in scaffold CH475478 (Figure 12). These assembly errors will be provided to the TPA database as a BankIt submission that contains a VCF (Variant Call Format) file that contains the suggested changes. The reason behind these changes will be detailed in the corresponding microPublication.
The primary goal of reconciliation is to elucidate any inconsistencies or idiosyncrasies that might occur with two independent students generating the models. The majority of student models received for the Pathways Project (n=310) agree with the final reconciled model (83%; Figure 13A). Those with errors include mislabeled or missing isoforms (38%), incorrect splice sites (41%), extra or missing exon(s) (19%), and failing to identify the proper ortholog (2%) as seen in Figure 13B.
After student models are reconciled and manuscripts are drafted, they are sent for review and publication via the microPublication journal (https://www.micropublication.org). The goal of this journal is to publish brief and novel findings whose results may lack a broader scientific narrative. The gene models generated and reconciled by students fit neatly within this description (e.g., Ref. 53).
This annotation protocol has been used by GEP faculty to engage undergraduate students in the comparative annotation of ISP genes in 27 Drosophila species. Similar protocols have previously been used by the GEP on other scientific projects.18,21,25,54 Those protocols were used as the basis to develop the protocol described here.
To ensure gene models produced by GEP students are of high quality, each gene is annotated by at least two students working independently and then reconciled by a more experienced student. This manual curation process ensures the availability of an accurate set of gene models from multiple Drosophila species for the comparative study of network architecture and the evolution of genes involved in signaling pathways (e.g., the ISP).
High-throughput computational gene annotators are extremely valuable at the scale of whole genomes. However, the annotation effort of the student annotators is of value because the students can outperform computational gene predictors at the level of an individual gene, or a small set of genes. For example, GEP students were able to annotate a conserved isoform not predicted by the NCBI RefSeq pipeline for Akt in D. arizonae (Figure 5). This isoform was not predicted computationally due to a non-canonical ACG start codon that is conserved in Akt in D. melanogaster.
Annotation and reconciliation protocols similar to the one described in this article are also currently used to investigate the expansion of the Muller F element in four Drosophila species and the evolution of venom proteins in four parasitoid wasp species. Faculty can also use this protocol to create new Course-based Undergraduate Research Experiences (CUREs) that engage students in the comparative analysis of genes involved in other metabolic and signaling pathways.
The protocol described here is focused only on the annotation of coding regions. Once additional experimental data (e.g., CAGE, RAMPAGE, and long-read RNA-Seq data) becomes available in more species, future analyses will focus on the annotation of untranslated regions and transcription start sites.
Genomics is a rapidly growing field and the research opportunities and careers in genomics are likely to increase. Therefore, the opportunity to provide students high-quality educations in genomics is required. Hands-on approaches to teaching about genomics both enhance student learning and provides us with additional high-quality datasets.16,17 The GEP provides the central support structure to aid in genomics education that enhances the ability of faculty bring genomics research into their curriculum.20 Lopatto et al. show that students benefit from learning through formative frustration and iteration within the GEP -CURE framework.26,27
Repository: NCBI
The GenBank accession numbers for the assemblies and BioProject accession numbers for the RNA-Seq datasets are available on NCBI under the accession numbers listed in Table 1. Data is available in the public domain through NCBI (https://www.ncbi.nlm.nih.gov).
Repository: FlyBase
FlyBase data used for tool creation mentioned below is publicly available through the FlyBase FTP site (https://ftp.flybase.net/releases/).
Data under license by a third party
The UCSC Genome Browser is developed by the Genomics Institute at the University of California Santa Cruz. The source code for the UCSC Genome Browser is covered by five different licenses. Most of the source code is available under the MIT License, and all the source code is freely available to non-commercial entities. The source code is available on GitHub (https://github.com/ucscGenomeBrowser/kent), and it can also be obtained through the UCSC Genome Browser Store (https://genome-store.ucsc.edu/).
Large data
The datasets displayed in the GEP UCSC Genome Browser are too large to be feasibly hosted by an F1000Research-approved repository, such as RNA-Seq read alignments and Whole-Genome Multiple Sequence Alignments. The data is available through the GEP UCSC Genome Browser and the UCSC Table Browser (https://gander.wustl.edu). Details on the datasets and tools used to construct each evidence track are available through the settings page for each track in the GEP UCSC Genome Browser. As stated under the “Repository-hosted Data” section, the genome assemblies and the RNA-Seq datasets used to construct the evidence tracks on the GEP UCSC Genome Browser are in the public domain, and they can be obtained through NCBI (https://www.ncbi.nlm.nih.gov).
Repository: GitHub
• Annotation Files Merger (https://github.com/wilsonleung-gep/annotation-files-merger)
• Dot Plot Viewer (https://github.com/wilsonleung-gep/dot-plot-viewer)
• Gene Record Finder (https://github.com/wilsonleung-gep/gene-record-finder)
• Sequence Updater (https://github.com/wilsonleung-gep/sequence-updater)
• Small Exons Finder (https://github.com/wilsonleung-gep/small-exons-finder)
All the source code in the above repositories is available under the MIT License.
• Gene Model Checker (https://github.com/wilsonleung-gep/gene-model-checker)
The Gene Model Checker is available under the GNU General Public License v3.0.
All custom software and tools generated by the GEP can be accessed from the GEP website (https://thegep.org). The annotation tools are publicly available web-based applications, thus requiring no installation on the part of the user. Versions of the tools can be found in Table 3. All the GEP annotation tools are synchronized to the same FlyBase release. The FlyBase D. melanogaster gene annotations used by the GEP annotation tools are updated twice a year (i.e., just before the start of the Fall and the Spring semesters) in order to mitigate the impact of FlyBase updates to the D. melanogaster reference gene annotations during the semester.
GEP UCSC Genome Browser
Since GEP materials are catered towards undergraduates that may have limited or no experience with programming and command-line interfaces, we endeavor to make all data easily accessible, thereby reducing the barrier to engagement in genomics research. The GEP maintains a mirror of the UCSC Genome Browser with Drosophila genomes (https://gander.wustl.edu). We created these Genome Browsers19 with multiple evidence tracks using data generated by various algorithms4 and experimental data obtained from the NCBI SRA.
Gene Record Finder
The Gene Record Finder (https://gander.wustl.edu/~wilson/dmelgenerecord/index.html) summarizes the FlyBase annotations for protein-coding genes in D. melanogaster. It provides information about the structure of each D. melanogaster gene, such as the number of isoforms (and number of isoforms with unique coding regions), as well as the amino acid sequence for each coding exon and the nucleotide sequence for each exon. The Gene Record Finder also provides exon usage maps that demarcate the exons used by each isoform. This information is used in the “Identify the approximate coordinates of each coding exon” step of the annotation protocol. While this information can be directly obtained from FlyBase, annotators can more easily retrieve the gene structure information from a single page instead of through multiple FlyBase Transcript and Polypeptide Reports.
Gene Model Checker
The Gene Model Checker (https://gander.wustl.edu/~wilson/genechecker/index.html) provides a way for annotators to check their own work when constructing gene models. It verifies that the proposed gene model satisfies basic biological constraints (e.g., maintains an ORF, uses canonical start and stop codons, and canonical splice donor and acceptor sites). It also indicates whether the number of coding exons for the proposed gene model in the target species matches that in the orthologous isoform in D. melanogaster. If there is empirical evidence indicating the gene has unusual characteristics (e.g., the use of a non-canonical start codon, stop codon read-through, or polycistronic transcripts31), students provide the supporting evidence for their claims in the Annotation Report form.
The “Dot Plot” section of the Gene Model Checker output compares the proposed gene model against the putative ortholog in D. melanogaster using protein alignment algorithms. Large gaps in the dot plot or protein alignment might indicate the selection of an incorrect splice site, missing exons, or extra exons in the proposed gene model. Note that this tool does not compare the proposed gene model against other lines of evidence, such as RNA-Seq data or computational gene predictions. The Gene Model Checker also produces the three annotation files for the proposed gene model that are required for project submission (a GTF file to define genomic feature coordinates, a transcript sequence file in FASTA format (FNA), and a peptide sequence file in FASTA format (FAA)).
The Gene Model Checker requires input of the species, assembly, and the scaffold of the putative ortholog. In addition, the Gene Model Checker requires the name of the ortholog in D. melanogaster, the set of coding exon coordinates for the gene in the target species, orientation of the gene, noting whether or not the untranslated regions are included, whether the gene model is complete (i.e., encompasses all CDSs/UTRs), and whether the genomic region containing the gene in the target species has consensus errors (e.g., Figure 8). Presently, this tool only supports the 28 Drosophila species that are currently in the GEP UCSC Genome Browser.
Small Exons Finder
The typical use case for the Small Exons Finder (https://gander.wustl.edu/%7ewilson/smallexonfinder/index.html) tool is identifying CDSs that are too small or too weakly conserved to be detected by BLAST. The tool is designed to look for ORFs that satisfy a set of biological constraints including the type of CDS (i.e., initial, internal, or terminal CDS), the phase of the donor or acceptor splice site, and the expected CDS size according to the D. melanogaster model. The Small Exons Finder then looks for ORFs in the provided sequence that conform to the aforementioned constraints. Compared to the ORF-FINDER tool developed by NCBI,55 the Small Exons Finder allows users to search for ORFs that are less than 30bp in size and can search for initial, internal, and terminal coding exons with constraints on the phases of the donor and acceptor sites.
Sequence Updater
The Sequence Updater (https://gander.wustl.edu/~wilson/sequence_updater/index.html) tool is primarily used to create a Variant Call Format (VCF) file56 to correct errors (i.e., base substitutions, insertions, and deletions) in the consensus sequence of a genome assembly. The VCF file produced by the Sequence Updater can be used with the Gene Model Checker to validate a gene model with consensus errors.
Annotation Files Merger
The Gene Model Checker produces GFF, FNA, and FAA files for each isoform. The submission pipeline requires a single GFF, FNA, and FAA file that includes all the unique isoforms in a project. For each type of annotation file, the Annotation Files Merger (https://gander.wustl.edu/%7ewilson/submissionhelper/index.php) is used to combine the annotations for all the unique isoforms in a project into a single file. The Annotation Files Merger also enables the user to view the combined GFF file as a custom track on the GEP UCSC Genome Browser.
BEDTools
BEDTools57 was used to perform genomic arithmetic and compare locations of genomic features in multiple tracks of the GEP UCSC Genome Browser.
NCBI BLAST+
The local sequence alignments produced by the tools in NCBI BLAST+ suite were used for multiple aspects of the protocol including, but not limited to, sequence annotation and local synteny assignment.
Figshare. Supplement 1.pdf, DOI: https://doi.org/10.6084/m9.figshare.21235341.32
Figshare. Supplement 2.pdf, DOI: https://doi.org/10.6084/m9.figshare.21235345.33
Figshare. Supplement 3.docx, DOI: https://doi.org/10.6084/m9.figshare.21235376.36
Figshare. Supplement 4.docx, DOI: https://doi.org/10.6084/m9.figshare.21235367.4
Figshare. Supplement 5.pdf, DOI: https://doi.org/10.6084/m9.figshare.21235343.43
Figshare. Supplement 6.docx, DOI: https://doi.org/10.6084/m9.figshare.21235380.44
Figshare. Supplement 7, DOI: https://doi.org/10.6084/m9.figshare.23600556.v1.58
Figshare. Supplement 8, DOI: https://doi.org/10.6084/m9.figshare.23600694.59
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Work done by GEP faculty and their students generated gene models that allowed us to refine our protocol. We would specifically like to acknowledge Dr. Alexa Sawa and her students at College of the Desert for providing input on initial drafts of the protocol and the reconcilers who generated models that can be used for downstream analyses.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Expertise in manual and automated gene annotation techniques and supporting datasets
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Expertise in manual and automated gene annotation techniques and supporting datasets
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Relevant area of expertise: gene model annotation in Drosophila melanogaster.
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Relevant area of expertise: gene model annotation in Drosophila melanogaster.
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Expertise in manual and automated gene annotation techniques and supporting datasets
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 3 (revision) 13 Oct 23 |
read | |
Version 2 (revision) 31 Jul 23 |
read | read |
Version 1 23 Dec 22 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)