A benchmark for evaluation of phylogeny reconstruction programs
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Evolutionary Studies
- Keywords
- Phylogeny reconstruction, Benchmark, Evolutionary protein domain
- Copyright
- © 2016 Spirin
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. A benchmark for evaluation of phylogeny reconstruction programs. PeerJ Preprints 4:e2628v1 https://doi.org/10.7287/peerj.preprints.2628v1
Abstract
There are a lot of algorithms and programs for reconstruction of phylogeny of a set of proteins basing on multiple sequence alignment. Many programs allow users to choose a number of parameters, for example, a model for maximum likelihood programs. Different programs and different parameters often produce different results. However at the moment all published benchmarks for evaluation of relative accuracy of programs or different choices of parameters are based on simulated sequences. The aim of the present work is to create a benchmark that allows a comparison of phylogenetic programs on large sets of alignments of natural protein sequences.
Author Comment
This is a preprint submission to PeerJ Preprints. A poster describing this work was presented at the CHARME / EMBnet / NETTAB 2016 Workshop in Rome.
Supplemental Information
Reference trees
Reference trees for three sets of organisms
Alignments
The archive contains alignments of orthologous series of protein domains. In each alignment sequences are orthologs from different organisms. Names of sequences are Uniprot mnemonics of corresponding organisms. See tables in the file Organisms.xlsx for full names of the organisms. Names of files are Pfam AC's with figures (1, 2, ets.) added to distinguish different orthologous series from one Pfam family. In folders Metazoa25, Fungi45 and Proteobacteria45 there are alignments of full-size orthologous series, in other folders there are random selections of 10 and 15 (for Metazoa) or 15 and 30 (for Fungi and Proteobacteria) sequences from each orthologous series.
Tables of organisms
The file contains tables of organisms from which the protein domains were taken. There are three tables: with 25 Metazoa, 45 Fungi and 45 Proteobacteria. For each organism, its Uniprot mnemonic is presented, these mnemonics are used in alignments and trees. For Metazoa, taxonomic divisions are presented that allows to construct a binary (i.e. fully resolved) tree. For Fungi and Protebacteria also some taxonomic divisions are presented, just to simplify orientation in data.