Abstract

Orthology, refining the concept of homology, is the cornerstone of evolutionary comparative studies. With the ever-increasing availability of genomic data, inference of orthology has become instrumental for generating hypotheses about gene functions crucial to many studies. This update of the OrthoDB hierarchical catalog of orthologs (http://www.orthodb.org) covers 3027 complete genomes, including the most comprehensive set of 87 arthropods, 61 vertebrates, 227 fungi and 2627 bacteria (sampling the most complete and representative genomes from over 11,000 available). In addition to the most extensive integration of functional annotations from UniProt, InterPro, GO, OMIM, model organism phenotypes and COG functional categories, OrthoDB uniquely provides evolutionary annotations including rates of ortholog sequence divergence, copy-number profiles, sibling groups and gene architectures. We re-designed the entirety of the OrthoDB website from the underlying technology to the user interface, enabling the user to specify species of interest and to select the relevant orthology level by the NCBI taxonomy. The text searches allow use of complex logic with various identifiers of genes, proteins, domains, ontologies or annotation keywords and phrases. Gene copy-number profiles can also be queried. This release comes with the freely available underlying ortholog clustering pipeline (http://www.orthodb.org/software).

INTRODUCTION

Orthology is the cornerstone of comparative genomics and gene function prediction. The availability of gene sequence data from a large variety of species is growing quickly, and the gap between such sequence data and the experimental functional data is widening. The evolutionary relatedness of genes, termed homology, can be asserted by sequence analysis, providing the means to formulate working hypotheses on gene functions from experimentation on model organisms. In turn, homologs referencing a particular ancestor have been termed orthologs (1–3). Such genes originating by speciation from an ancestral gene are most likely to retain the ancestral function (4), making orthology the most precise way to link gene functional knowledge to a much wider genomics space. Assessment of gene orthology is also instrumental for interpretation of whole-genome shotgun metagenomics (5) that is reshaping microbiology with a direct impact on future medicine (6).

The term ‘orthology’ was initially coined for a pair of species having just one common ancestor (1). Expanding this concept to a group of species (2–3,7), OrthoDB aims to identify groups of orthologous genes that descended from a single gene of the last common ancestor (LCA) of all the species considered. Such generalization includes not only genes descended by speciation from the LCA, but also all their subsequent duplications after the radiation from the LCA, i.e. co-orthologs. Applying this concept to the hierarchy of LCAs along the species phylogeny results in multiple ‘levels of orthology’ with varying granularity of orthologous groups. While it is possible to obtain more finely resolved orthologous relations for some pairs of species that radiated after the clade's LCA (i.e. referring to a younger LCA), generalization over more than two species brings greater power for integrating sparse experimental functional data.

The central role of the orthology concept prompted the development of numerous approaches and resources (8). Due to the challenges of inferring orthology and scalability of the methods, however, there are few resources (9,10) that match OrthoDB in scope (Table 1) and only a small set of available orthology delineation software (discussed below), prompting the wide use of an oversimplified approach (11) that selects only one out of possibly multiple co-orthologs.

Organism coverage of the major resources providing orthology

Table 1.
Organism coverage of the major resources providing orthology
Number of genomes
DatabaseTotalBacteriaEukaryotesOrthology levelsAvailability
OrthoDB.v830282627401270GUI, data, software
eggNOG.v436862031a238107GUI, data
KEGG-OC30982675256n.a.GUI
Number of genomes
DatabaseTotalBacteriaEukaryotesOrthology levelsAvailability
OrthoDB.v830282627401270GUI, data, software
eggNOG.v436862031a238107GUI, data
KEGG-OC30982675256n.a.GUI

aUsed to define orthologous groups.

Table 1.
Organism coverage of the major resources providing orthology
Number of genomes
DatabaseTotalBacteriaEukaryotesOrthology levelsAvailability
OrthoDB.v830282627401270GUI, data, software
eggNOG.v436862031a238107GUI, data
KEGG-OC30982675256n.a.GUI
Number of genomes
DatabaseTotalBacteriaEukaryotesOrthology levelsAvailability
OrthoDB.v830282627401270GUI, data, software
eggNOG.v436862031a238107GUI, data
KEGG-OC30982675256n.a.GUI

aUsed to define orthologous groups.

OrthoDB is one of the largest resources of orthologs in terms of number of genomes covered, and has promoted the concept of hierarchical orthology since its conception (12). In this release we re-implemented the OrthoDB website and the graphical user interface (GUI) (Figure 1). Similar to some other resources, OrthoDB provides tentative functional annotations of orthologous groups and mapping to functional categories. Notably, OrthoDB provides the most extensive collection of functional annotations of the underlying genes linked to their original sources. Gene annotation is a complicated process that is hardly feasible without automation, which in turn can introduce errors. Although in many cases OrthoDB makes such errors in the collated annotation data apparent, search results with particularly discordant annotations should be considered with caution. The evolutionary annotations of the orthologs and statistics of gene architectures remain the distinguishing features of OrthoDB.

Figure 1.

OrthoDB web user interface. The orthologous group centric results panel is on the left and the query-building panel is on the right.

COVERAGE OF EUKARYOTIC AND PROKARYOTIC GENOMES

The current update brings OrthoDB to the same level as the leading orthology resources (outlined in Table 1), covering 2627 bacterial, 227 fungal, 61 vertebrate, 25 basal metazoan genomes and the most comprehensive set of 87 arthropod genomes. Of the total of almost 10 million bacterial and over 5 million eukaryotic protein-coding genes analysed, 91% and 89% of them respectively were classified into orthologous groups. Evolutionary annotations were computed for all groups, where about 80% of the groups have functional annotations sourced from specialized resources. Since orthology is relative to the LCA, we identify orthologous groups at the major radiations within each lineage comprising 28 animal, 40 fungal and 202 bacterial levels of orthology. OrthoDB now uses the NCBI taxonomy (13) to define levels of orthology.

Protein-coding gene annotations for vertebrates were retrieved from Ensembl (14) (Release 75, February 2014). Arthropod data were retrieved from AgripestBase, AphidBase (15), BeetleBase (16), DiamondBackMoth-DB (17), Ensembl Metazoa (18), FlyBase (19), Hymenoptera Genome Database (20), NCBI (13), SilkDB (21), VectorBase (22), wFleaBase (23), as well as the i5k pilot project (24) and several other genome consortia (July 2014). Gene sets for the additional basal metazoan species were retrieved from Ensembl Metazoa (18) and the Joint Genome Institute (25) (July 2014). The fungal gene sets were sourced from UniProt (26), (February 2014 release). We retrieved over 11,000 bacterial genomes from Ensembl Bacteria (Release 22, May 2014), and selected 2627 with the most complete annotations and the best sampling of the genetic diversity using a set of universal single-copy genes and our BUSCOs pipeline (Simao et. al., submitted).

THE ALGORITHM AND SOFTWARE

With this update we provide the suite of programs for delineation of orthologous genes that was developed for, and is the basis of, the OrthoDB hierarchical catalog of orthologs. The suite includes an efficient clustering procedure scalable to thousands of genomes as well as a multi-step pipeline to handle the complete data analysis flow. The package is distributed under the BSD License from http://www.orthodb.org/software.

OrthoDB ortholog delineation is a multi-step procedure. First, best reciprocal hits (BRH) of genes between genomes are identified (which represent the shortest path through the speciation node between these genes on a distance-based gene tree). Second, matches within each genome that are more similar than the best reciprocal matches between genomes are identified (these represent gene duplications after this speciation point, i.e. co-orthologs). The third and final step involves triangulating and clustering all BRHs and in-paralogs into groups of orthologous genes. Such clusters, called orthologous groups, represent all descendants of a presumably single-gene of the LCA of all the species considered. As in previous releases, this update considers only the longest isoform per gene. Technically, the OrthoDB software suite contains two packages: (i) a collection of Bash and Python scripts that implement the multi-step data analysis pipeline and (ii) an efficient rule-based clustering of the BRHs into groups of orthologous genes written in C++. The data analysis pipeline with pluggable external software currently employs SWIPE (27), implementing full Smith–Waterman pair-wise sequence alignment algorithm, and CD-HIT (28) for identification of very similar gene copies.

BENCHMARKING

Not many methods (29–31) are available for gene orthology delineation that can scale to hundreds of genomes (Table 2, Supplementary Table S1). Usually, there is a trade-off between the precision (‘getting only the right ones’) and recall (‘getting all the right ones’), and different objectives may favour a particular compromise. For example, having insufficient precision may result in propagation of erroneous annotation or diminishing phylogenetic signal, while insufficient recall will give only a fragmented view inappropriate for comparative genomics studies as well as diminished possibilities for annotation propagation. Moreover, since orthology is defined evolutionarily and the true gene and species histories are not known, there is no clear baseline for comparison of alternative orthology predictions. One approach to benchmark alternative orthology predictors is to compare the results against a human-curated classification (32). We refer to this reference classification below as RefOGs. Although such curated orthologs comprise only a small subset of organisms and gene families (prompting discussions to what extent such a subset of challenging cases is representative of all gene families in complete genomes) and can include uncertain expert decisions, this benchmarking approach remains the most appropriate option in our view. There are four other alternatives. The first is to compare concordance of predicted orthologs with available functional annotations. However, this only describes the evolution of gene's functions rather than the evolution of genes themselves, and this measure can be variable among gene families and functions. The second is to compare concordance of gene genomic arrangements (of slowly shuffled genomes, e.g. mammals, or of gene arrangements under selection, e.g. operons in bacteria). Yet this only provides evidence of orthology, not of non-orthology. The third approach is to compare concordance with InterPro domains, or the Gene Ontology (GO) annotations frequently inferred from them, which only provide evidence of incorrect orthologous group fusions (e.g. by erroneously fused gene model predictions) since more broadly defined homologs are compared to more narrowly defined homologs (arisen only after a particular LCA). The fourth approach is to compare concordance among different methods, which is biased by technical similarities. A common benchmarking fault is to compare orthologs predicted for different sets of organisms referring to different LCAs; these are inherently different by definition. An example would be to compare orthologous groups to pair-wise orthology that does not span the most ancient radiation in the group.

Comparative performance of available orthology calling methods versus RefOGs (32)

Table 2.
Comparative performance of available orthology calling methods versus RefOGs (32)
RefOGs
MethodNum. of OGs (RefOGs=67)RefOGs with F1 ≥85%RefOGs with Presicion ≥85%RefOGs with Recall ≥85%Sum: Exact, AkinSum: Fused(events), Split(events)Sum: Complex, Missed
OrthoDB v8 (2014)11251674643: 30, 1345: 0(0), 20(45)4: 4, 0
OrthoDB v5* (2010)15642673433: 24, 989: 0(0), 30(89)4: 4, 0
OrthoMCL (2.0.8)12445644940: 30, 1051: 2(1), 20(58)5: 4, 1
COGsoft (4.2.3)16429661919: 12, 764: 0(0), 28(64)20: 19, 1
OMA (0.99t)22420661312: 8, 4134: 0(0), 31(134)24: 23, 1
RefOGs
MethodNum. of OGs (RefOGs=67)RefOGs with F1 ≥85%RefOGs with Presicion ≥85%RefOGs with Recall ≥85%Sum: Exact, AkinSum: Fused(events), Split(events)Sum: Complex, Missed
OrthoDB v8 (2014)11251674643: 30, 1345: 0(0), 20(45)4: 4, 0
OrthoDB v5* (2010)15642673433: 24, 989: 0(0), 30(89)4: 4, 0
OrthoMCL (2.0.8)12445644940: 30, 1051: 2(1), 20(58)5: 4, 1
COGsoft (4.2.3)16429661919: 12, 764: 0(0), 28(64)20: 19, 1
OMA (0.99t)22420661312: 8, 4134: 0(0), 31(134)24: 23, 1

* Used in prior benchmarking (32).

F1 is a harmonic mean of precision and recall (http://en.wikipedia.org/wiki/Sensitivity_and_specificity). RefOG events are defined as follows: ‘Exact’–having 100% of both precision and recall; ‘Akin’–having precision and recall >85% (i.e. up to 1 ‘wrong’ gene for 37% of RefOGs and up to 2 ‘wrong’ genes for another 20% of RefOGs); ‘Fused’–counting fusing events when more than one RefOG represented one method cluster with RefOG recall >85% and summed method cluster precision >85%; ‘Split’–defined symmetrically to Fused when one RefOG is represented by more than one method cluster; ‘Complex’–when the matches can not be classified into another category; ‘Missed’–when a RefOG recall <50%.

Table 2.
Comparative performance of available orthology calling methods versus RefOGs (32)
RefOGs
MethodNum. of OGs (RefOGs=67)RefOGs with F1 ≥85%RefOGs with Presicion ≥85%RefOGs with Recall ≥85%Sum: Exact, AkinSum: Fused(events), Split(events)Sum: Complex, Missed
OrthoDB v8 (2014)11251674643: 30, 1345: 0(0), 20(45)4: 4, 0
OrthoDB v5* (2010)15642673433: 24, 989: 0(0), 30(89)4: 4, 0
OrthoMCL (2.0.8)12445644940: 30, 1051: 2(1), 20(58)5: 4, 1
COGsoft (4.2.3)16429661919: 12, 764: 0(0), 28(64)20: 19, 1
OMA (0.99t)22420661312: 8, 4134: 0(0), 31(134)24: 23, 1
RefOGs
MethodNum. of OGs (RefOGs=67)RefOGs with F1 ≥85%RefOGs with Presicion ≥85%RefOGs with Recall ≥85%Sum: Exact, AkinSum: Fused(events), Split(events)Sum: Complex, Missed
OrthoDB v8 (2014)11251674643: 30, 1345: 0(0), 20(45)4: 4, 0
OrthoDB v5* (2010)15642673433: 24, 989: 0(0), 30(89)4: 4, 0
OrthoMCL (2.0.8)12445644940: 30, 1051: 2(1), 20(58)5: 4, 1
COGsoft (4.2.3)16429661919: 12, 764: 0(0), 28(64)20: 19, 1
OMA (0.99t)22420661312: 8, 4134: 0(0), 31(134)24: 23, 1

* Used in prior benchmarking (32).

F1 is a harmonic mean of precision and recall (http://en.wikipedia.org/wiki/Sensitivity_and_specificity). RefOG events are defined as follows: ‘Exact’–having 100% of both precision and recall; ‘Akin’–having precision and recall >85% (i.e. up to 1 ‘wrong’ gene for 37% of RefOGs and up to 2 ‘wrong’ genes for another 20% of RefOGs); ‘Fused’–counting fusing events when more than one RefOG represented one method cluster with RefOG recall >85% and summed method cluster precision >85%; ‘Split’–defined symmetrically to Fused when one RefOG is represented by more than one method cluster; ‘Complex’–when the matches can not be classified into another category; ‘Missed’–when a RefOG recall <50%.

The most direct comparison to RefOGs is to apply methods to only the data that was used for curating the RefOGs (i.e. the same sequences as seen by the curator) and then compare the obtained grouping to RefOGs. The comparison of alternative clustering (grouping) can be considered in a few respects: (i) as the number of special cases of group-level fusions and splits considering only matches with high precision that are less undesirable than complex matches or misses (Table 2), (ii) as the fraction of RefOGs that matched better than a certain degree of precision or recall (Table 2) and (iii) as pair-wise metrics of overall concordance between alternative clustering methods (Table 3). These results are surprisingly consistent with less direct comparison of predictions made on complete current gene sets and the mapped RefOGs to the current data (Supplementary Table S2), even though only 93% of RefOG sequences could be unambiguously mapped to current gene annotations.

Concordance on ‘Variation of Information’ between the methods and RefOGs (lower values indicate more similar classifications)

Table 3.
Concordance on ‘Variation of Information’ between the methods and RefOGs (lower values indicate more similar classifications)
ReferenceOrthoDB.v8OrthoDB.v5OrthoMCLCOGsoftOMA
Reference07.712.510.317.320.6
OrthoDB.v87.7067.712.115.4
OrthoDB.v512.5607.59.111.5
OrthoMCL10.37.77.509.913.9
COGsoft17.312.109.9010.4
OMA20.615.49.913.910.40
ReferenceOrthoDB.v8OrthoDB.v5OrthoMCLCOGsoftOMA
Reference07.712.510.317.320.6
OrthoDB.v87.7067.712.115.4
OrthoDB.v512.5607.59.111.5
OrthoMCL10.37.77.509.913.9
COGsoft17.312.109.9010.4
OMA20.615.49.913.910.40
Table 3.
Concordance on ‘Variation of Information’ between the methods and RefOGs (lower values indicate more similar classifications)
ReferenceOrthoDB.v8OrthoDB.v5OrthoMCLCOGsoftOMA
Reference07.712.510.317.320.6
OrthoDB.v87.7067.712.115.4
OrthoDB.v512.5607.59.111.5
OrthoMCL10.37.77.509.913.9
COGsoft17.312.109.9010.4
OMA20.615.49.913.910.40
ReferenceOrthoDB.v8OrthoDB.v5OrthoMCLCOGsoftOMA
Reference07.712.510.317.320.6
OrthoDB.v87.7067.712.115.4
OrthoDB.v512.5607.59.111.5
OrthoMCL10.37.77.509.913.9
COGsoft17.312.109.9010.4
OMA20.615.49.913.910.40

FUNCTIONAL AND EVOLUTIONARY ANNOTATIONS

Functional annotations are arguably the most sought-after information. The extent and detail of available functional gene annotations varies considerably and are mostly only available for genes in model organisms. OrthoDB has provided such annotations for each gene since the first release, avoiding any automated propagation of potentially spurious annotations. The practical utility of annotations at the level of orthologous groups is hard to define, however, and thus we provide some automatic annotations with a disclaimer that they should be treated as explicitly tentative without an expert validation.

SUCCINCT FUNCTIONAL DESCRIPTORS of orthologous groups are derived by summarizing frequently occurring annotation terms or phrases mapped to individual member genes.

FUNCTIONAL COG CATEGORIES were assigned to each orthologous group, whenever possible, by mapping of the GO terms (33) to manually curated COG (7) functional categories (from http://geneontology.org/external2go/cog2go). Such high-level functional descriptors are informative for creating concise functional profiles for comparative genomic and metagenomic studies (GO slim and subset guide; http://geneontology.org/page/go-slim-and-subset-guide).

GO TERMS AND INTERPRO DOMAINS are summarized over the member gene annotations. GO terms (33) for molecular function, biological process and cellular component were mapped from UniProt (26) and InterPro (34) protein domain signatures were sourced from the UniProt Archive (UniParc) and computed for new eukaryotic species that are not yet in UniParc.

DOMAIN ARCHITECTURES are presented as sequentially ordered InterPro domains from the N- to C-terminus for each member gene. This enables searches for specific domain combinations as well as facilitates visual inspection of the conservation of protein domain architectures across all members of the orthologous group.

GENE SYNONYMS AND PHENOTYPES are highlighted, whenever available, in the results table of orthologs with direct links to their respective source databases (Figure 1). The data were retrieved for selected model species from each of the major lineages: Caenorhabditis elegans from WormBase (35), Danio rerio from the Zebrafish Model Organism Database (36), Drosophila melanogaster from FlyBase (19), Mus musculus from the Mouse Genome Database (37) and Saccharomyces cerevisiae from the Saccharomyces Genome Database (38).

HUMAN DISEASES associated with particular genes from the online Mendelian inheritance in man (OMIM®) (39) resource are also mapped and linked to the original records.

ESSENTIAL GENES from 16 bacteria were retrieved from the Database of Essential Genes (40) and include Escherichia coli, Haemophilus influenza and Mycobacterium tuberculosis, and additionally from EcoGene (41) for E. coli genes.

Evolutionary annotations are computed from available genomics data. Gene families evolve under varying levels of constraint on their sequence identity and gene copy-number (42) that is presumably indicative of their rates of possible changes in functional load, and consequently the confidence of extrapolating hypotheses of gene function from experimentally studied genes. For example, functional inferences are more confident for conservative orthologous groups that show near-universal single-copy distributions and relatively low sequence divergence than for dynamic orthologous groups with patchy phylogenetic distribution, or numerous duplications, or high sequence divergence. OrthoDB provides quantifications of the following evolutionary traits:

PHYLETIC PROFILE reflecting universality and duplicability. Universality refers to the ortholog phyletic profile, i.e. genes present in all, most or only a few species. Duplicability refers to retention of gene duplicates that independently happened in all, most, few or no species. The profiles (Figure 1) therefore quantify the maintenance of orthologs across the phylogeny as well as their propensity for gene duplication throughout the evolutionary history since their LCA.

EVOLUTIONARY RATE reflecting constraints on protein sequence identity. Quantification of relative sequence conservation among orthologous genes is computed by averaging over all inter-species protein sequence identities normalized by the average identity of all BRHs for each species pair. This correlates with other evolutionary traits, e.g. sequence evolution of single-copy orthologs is more constrained than that of multi-copy orthologs, and with functional traits, e.g. orthologous groups with essential genes usually exhibit more conservative sequence evolution than those without (42).

GENE ARCHITECTURE reflecting observed variations of protein lengths and exon counts. Summary of median and standard deviation values for protein lengths (in all lineages) and exon counts (in metazoa) for all genes in each orthologous group provide ‘canonical’ gene architectures of each group. Comparing protein lengths and exon counts of each member gene to the canonical architecture can highlight deviations indicative of inaccurate gene models or dynamic intron evolution.

SIBLING GROUPS reflecting the sequence uniqueness of the orthologs. Orthologous groups presumably represent the gene content of a particular ancestral lineage, some of which may have originated by earlier gene duplications (i.e. paralogous genes) or they can share only a duplicated fragment, e.g. evolutionarily mobile domain. Such homology relations among ‘sibling’ orthologous groups are noted in this OrthoDB release by shared content of InterPro domains, requiring at least two organisms from each group to have a shared domain. Orthologous groups with no or very few siblings are unique or rare in the gene universe, while those with many siblings belong to large gene superfamilies where orthology delineation can be the most challenging.

THE WEB INTERFACE

With this release we re-implemented the OrthoDB web interface to be sustainable with data growth. The main organization has remained similar, with the query-building panel now on the right and orthologous group centric results panel on the left (Figure 1). The query-building panel allows:

TEXT SEARCHES by protein, gene, InterPro, GO identifiers, UniProt accession numbers, etc., as well as names, synonyms and functional terms or phrases (quoted). Gene annotations were sourced from UniProt and supplemented with data from specific resources for representative model organisms. The text search also allows the use of logical operator syntax to build complex queries; e.g. to optionally include variations of a term, or to exclude terms. In addition, specific protein domain architectures may be queried with a comma-separated N- to C-terminus ordered list of InterPro identifiers.

COPY-NUMBER PROFILE SEARCHES by predefined lists of specific gene copy-number phyletic profiles, such as ‘all single-copy’ or ‘all multi-copy’ orthologs.

SEQUENCE SEARCHES by BLAST homology search of user provided protein sequence in FASTA format against gene sequences used to build OrthoDB. If significant matches are found, the corresponding orthologous group closest to the root-level is returned.

ORTHOLOGY HIERARCHY LEVEL, in addition to the search options outlined above, can be specified by the user by entering/searching for species of interest or by checking/unchecking radiation nodes of the depicted NCBI species classification. As noted above, orthology is relative to a particular LCA. Therefore, results will contain broader groups of genes when ancient radiations (nodes closer to the root) are selected, and narrower gene correspondences for more closely related species. To enable the most precise comparative studies, OrthoDB has always promoted this concept of hierarchical orthologous groups by computing orthology at different phylogeny radiations.

The results panel (left) is orthologous group centric, i.e. if OrthoDB is queried using a gene identifier, the orthologous group containing this gene is returned as the result. Each orthologous group has a unique identifier (in the style of EOG8xxx and POG8xxx in this v8 release). First, when available, the collated functional annotations are displayed, including InterPro and GO terms, followed by the computed evolutionary annotations outlined above. This is followed by a list of the corresponding orthologous genes with their original annotations. The results can be printed or viewed as tab-delimited text, and the gene sequences can be viewed in FASTA format.

DATA ACCESS

As for the previous versions of OrthoDB, in addition to the web interface we provide data files for bulk download, one file per level of orthology; as well as the underlying gene sequences in FASTA format, and mapping of the genes to UniProt, NCBI and RefOGs. All data are distributed under the Creative Commons Attribution 3.0 License from http://www.orthodb.org/. Users can also navigate to OrthoDB records by following links from FlyBase ‘Orthologs’ section, UniProt ‘Phylogenomic databases’ section or NCBI ‘Additional links/ Gene LinkOut’ section.

CONCLUSIONS AND PERSPECTIVES

The rapidly growing number of sequenced genomes increases the power of comparative analyses, but also brings new challenges for the scalability of methods and the data presentation to end-users. The most significant highlights of this update are: (i) the technical revamp of the OrthoDB web interface, (ii) the introduced selection of the input data by its completeness and uniqueness, where less complete genomes or accumulating well-represented transcriptomes will only be mapped to the previously defined orthologous groups and (iii) release of the underlying orthology delineation software to enable the research community to perform more specific/private projects or to analyse organisms not yet covered. OthoDB will continue, to the best of available resources, to provide comprehensive coverage of publicly available genomes and to refine the accuracy of ortholog delineations.

The authors thank all members of the Zdobnov group for useful discussions and suggestions. Part of the computations were performed at Vital-IT (http://www.vital-it.ch).

FUNDING

Swiss Institute of Bioinformatics SER funding; Schmidheiny Foundation; Swiss National Science Foundation [31003A_143936 and PDFMP3_137064]. Funding for open access charge: University of Geneva.

Conflict of interest statement. None declared.

REFERENCES

1.
Fitch
W.M.
Distinguishing homologous from analogous proteins
Syst. Zool.
1970
, vol. 
19
 (pg. 
99
-
113
)
2.
Koonin
E.V.
Orthologs, paralogs, and evolutionary genomics
Annu. Rev. Genet.
2005
, vol. 
39
 (pg. 
309
-
338
)
3.
Sonnhammer
E.L.
Koonin
E.V.
Orthology, paralogy and proposed classification for paralog subtypes
Trends Genet.
2002
, vol. 
18
 (pg. 
619
-
620
)
4.
Gabaldon
T.
Koonin
E.V.
Functional and evolutionary implications of gene orthology
Nat. Rev. Genet.
2013
, vol. 
14
 (pg. 
360
-
366
)
5.
Arumugam
M.
Raes
J.
Pelletier
E.
Le Paslier
D.
Yamada
T.
Mende
D.R.
Fernandes
G.R.
Tap
J.
Bruls
T.
Batto
J.M.
, et al. 
Enterotypes of the human gut microbiome
Nature
2011
, vol. 
473
 (pg. 
174
-
180
)
6.
Petty
T.J.
Cordey
S.
Padioleau
I.
Docquier
M.
Turin
L.
Preynat-Seauve
O.
Zdobnov
E.M.
Kaiser
L.
Comprehensive human virus screening using high-throughput sequencing with a user-friendly representation of bioinformatics analysis: a pilot study
J. Clin. Microbiol.
2014
, vol. 
52
 (pg. 
3351
-
3361
)
7.
Tatusov
R.L.
Koonin
E.V.
Lipman
D.J.
A genomic perspective on protein families
Science
1997
, vol. 
278
 (pg. 
631
-
637
)
8.
Sonnhammer
E.L.
Gabaldon
T.
Sousa da Silva
A.W.
Martin
M.
Robinson-Rechavi
M.
Boeckmann
B.
Thomas
P.D.
Dessimoz
C.
Big data and other challenges in the quest for orthologs
Bioinformatics
2014
, vol. 
30
 (pg. 
2993
-
2998
)
9.
Nakaya
A.
Katayama
T.
Itoh
M.
Hiranuka
K.
Kawashima
S.
Moriya
Y.
Okuda
S.
Tanaka
M.
Tokimatsu
T.
Yamanishi
Y.
, et al. 
KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D353
-
D357
)
10.
Powell
S.
Forslund
K.
Szklarczyk
D.
Trachana
K.
Roth
A.
Huerta-Cepas
J.
Gabaldon
T.
Rattei
T.
Creevey
C.
Kuhn
M.
, et al. 
eggNOG v4.0: nested orthology inference across 3686 organisms
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D231
-
D239
)
11.
Ward
N.
Moreno-Hagelsieb
G.
Quickly finding orthologs as reciprocal best hits with BLAT, LAST, and UBLAST: how much do we miss
PLoS One
2014
, vol. 
9
 pg. 
e101850
 
12.
Kriventseva
E.V.
Rahman
N.
Espinosa
O.
Zdobnov
E.M.
OrthoDB: the hierarchical catalog of eukaryotic orthologs
Nucleic Acids Res.
2008
, vol. 
36
 (pg. 
D271
-
D275
)
13.
NCBI Resource Coordinators
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D7
-
D17
)
14.
Flicek
P.
Amode
M.R.
Barrell
D.
Beal
K.
Billis
K.
Brent
S.
Carvalho-Silva
D.
Clapham
P.
Coates
G.
Fitzgerald
S.
, et al. 
Ensembl 2014
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D749
-
D755
)
15.
Legeai
F.
Shigenobu
S.
Gauthier
J.
Colbourne
J.
Rispe
C.
Collin
O.
Richards
S.
Wilson
A.
Murphy
T.
Tagu
D.
AphidBase: a centralized bioinformatic resource for annotation of the pea aphid genome
Insect. Mol. Biol.
2010
, vol. 
19
 
Suppl. 2
(pg. 
5
-
12
)
16.
Kim
H.
Murphy
T.
Xia
J.
Caragea
D.
Park
Y.
Beeman
R.
Lorenzen
M.
Butcher
S.
Manak
J.
Brown
S.
BeetleBase in 2010: revisions to provide comprehensive genomic information for Tribolium castaneum
Nucleic Acids Res.
2010
, vol. 
38
 (pg. 
D437
-
D442
)
17.
Tang
W.
Yu
L.
He
W.
Yang
G.
Ke
F.
Baxter
S.W.
You
S.
Douglas
C.J.
You
M.
DBM-DB: the diamondback moth genome database
Database (Oxford)
2014
pg. 
bat087
 
18.
Kersey
P.J.
Allen
J.E.
Christensen
M.
Davis
P.
Falin
L.J.
Grabmueller
C.
Hughes
D.S.
Humphrey
J.
Kerhornou
A.
Khobova
J.
, et al. 
Ensembl Genomes 2013: scaling up access to genome-wide data
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D546
-
D552
)
19.
St Pierre
S.E.
Ponting
L.
Stefancsik
R.
McQuilton
P.
Consortium
F.
FlyBase 102–advanced approaches to interrogating FlyBase
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D780
-
D788
)
20.
Munoz-Torres
M.C.
Reese
J.T.
Childers
C.P.
Bennett
A.K.
Sundaram
J.P.
Childs
K.L.
Anzola
J.M.
Milshina
N.
Elsik
C.G.
Hymenoptera Genome Database: integrated community resources for insect species of the order Hymenoptera
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D658
-
D662
)
21.
Duan
J.
Li
R.
Cheng
D.
Fan
W.
Zha
X.
Cheng
T.
Wu
Y.
Wang
J.
Mita
K.
Xiang
Z.
, et al. 
SilkDB v2.0: a platform for silkworm (Bombyx mori) genome biology
Nucleic Acids Res.
2010
, vol. 
38
 (pg. 
D453
-
D456
)
22.
Megy
K.
Emrich
S.J.
Lawson
D.
Campbell
D.
Dialynas
E.
Hughes
D.S.
Koscielny
G.
Louis
C.
Maccallum
R.M.
Redmond
S.N.
, et al. 
VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D729
-
D734
)
23.
Colbourne
J.
Singan
V.
Gilbert
D.
wFleaBase: the Daphnia genome database
BMC Bioinformatics
2005
, vol. 
6
 pg. 
45
 
24.
i5K Consortium
The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment
J. Hered.
2013
, vol. 
104
 (pg. 
595
-
600
)
25.
Nordberg
H.
Cantor
M.
Dusheyko
S.
Hua
S.
Poliakov
A.
Shabalov
I.
Smirnova
T.
Grigoriev
I.V.
Dubchak
I.
The genome portal of the Department of Energy Joint Genome Institute: 2014 updates
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D26
-
D31
)
26.
UniProt Consortium
Activities at the Universal Protein Resource (UniProt)
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D191
-
D198
)
27.
Rognes
T.
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation
BMC Bioinformatics
2011
, vol. 
12
 pg. 
221
 
28.
Li
W.
Godzik
A.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics
2006
, vol. 
22
 (pg. 
1658
-
1659
)
29.
Roth
A.C.
Gonnet
G.H.
Dessimoz
C.
Algorithm of OMA for large-scale orthology inference
BMC Bioinformatics
2008
, vol. 
9
 pg. 
518
 
30.
Kristensen
D.M.
Kannan
L.
Coleman
M.K.
Wolf
Y.I.
Sorokin
A.
Koonin
E.V.
Mushegian
A.
A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches
Bioinformatics
2010
, vol. 
26
 (pg. 
1481
-
1487
)
31.
Fischer
S.
Brunk
B.P.
Chen
F.
Gao
X.
Harb
O.S.
Iodice
J.B.
Shanmugam
D.
Roos
D.S.
Stoeckert
C.J.
Jr
Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups
Curr. Protoc. Bioinformatics
2011
 
doi:10.1002/0471250953.bi0612s35
32.
Trachana
K.
Larsson
T.A.
Powell
S.
Chen
W.H.
Doerks
T.
Muller
J.
Bork
P.
Orthology prediction methods: a quality assessment using curated protein families
Bioessays
2011
, vol. 
33
 (pg. 
769
-
780
)
33.
Ashburner
M.
Ball
C.A.
Blake
J.A.
Botstein
D.
Butler
H.
Cherry
J.M.
Davis
A.P.
Dolinski
K.
Dwight
S.S.
Eppig
J.T.
, et al. 
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
2000
, vol. 
25
 (pg. 
25
-
29
)
34.
Hunter
S.
Jones
P.
Mitchell
A.
Apweiler
R.
Attwood
T.K.
Bateman
A.
Bernard
T.
Binns
D.
Bork
P.
Burge
S.
, et al. 
InterPro in 2011: new developments in the family and domain prediction database
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D306
-
D312
)
35.
Harris
T.W.
Baran
J.
Bieri
T.
Cabunoc
A.
Chan
J.
Chen
W.J.
Davis
P.
Done
J.
Grove
C.
Howe
K.
, et al. 
WormBase 2014: new views of curated biology
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D789
-
D793
)
36.
Howe
D.G.
Bradford
Y.M.
Conlin
T.
Eagle
A.E.
Fashena
D.
Frazer
K.
Knight
J.
Mani
P.
Martin
R.
Moxon
S.A.
, et al. 
ZFIN, the Zebrafish Model Organism Database: increased support for mutants and transgenics
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D854
-
D860
)
37.
Blake
J.A.
Bult
C.J.
Eppig
J.T.
Kadin
J.A.
Richardson
J.E.
Group
M.G.D.
The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D810
-
D817
)
38.
Costanzo
M.C.
Engel
S.R.
Wong
E.D.
Lloyd
P.
Karra
K.
Chan
E.T.
Weng
S.
Paskov
K.M.
Roe
G.R.
Binkley
G.
, et al. 
Saccharomyces genome database provides new regulation data
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D717
-
D725
)
39.
Amberger
J.
Bocchini
C.A.
Scott
A.F.
Hamosh
A.
McKusick's Online Mendelian Inheritance in Man (OMIM)
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
D793
-
D796
)
40.
Luo
H.
Lin
Y.
Gao
F.
Zhang
C.T.
Zhang
R.
DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements
Nucleic Acids Res.
2014
, vol. 
42
 (pg. 
D574
-
D580
)
41.
Zhou
J.
Rudd
K.E.
EcoGene 3.0
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D613
-
D624
)
42.
Waterhouse
R.M.
Zdobnov
E.M.
Kriventseva
E.V.
Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi
Genome Biol. Evol.
2011
, vol. 
3
 (pg. 
75
-
86
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.