Trends in Genetics
Volume 22, Issue 4, April 2006, Pages 225-231
Journal home page for Trends in Genetics

Phylogenomics: the beginning of incongruence?

https://doi.org/10.1016/j.tig.2006.02.003Get rights and content

Until recently, molecular phylogenies based on a single or few orthologous genes often yielded contradictory results. Using multiple genes in a large concatenation was proposed to end these incongruences. Here we show that single-gene phylogenies often produce incongruences, albeit ones lacking statistically significant support. By contrast, the use of different tree reconstruction methods on different partitions of the concatenated supergene leads to well-resolved, but real (i.e. statistically significant) incongruences. Gathering a large amount of data is not sufficient to produce reliable trees, given the current limitation of tree reconstruction methods, especially when the quality of data is poor. We propose that selecting only data that contain minimal nonphylogenetic signals takes full advantage of phylogenomics and markedly reduces incongruence.

Introduction

Molecular characters, primarily DNA and derived protein sequences, provide a wealth of new information that sheds light on many parts of the ‘Tree of Life’. However, molecular phylogenies based on single genes often lead to apparently conflicting results. To overcome this limitation, it is tempting to apply a genome-scale approach to phylogenetic inference (phylogenomics) by combining many genes. The number of published yeast genomes offers the opportunity to test this proposition. Indeed, using 106 genes from yeast genomes, a fully supported phylogeny has been obtained by the analysis of their concatenation 1, 2. Following from this, it has been anticipated that using large amounts of genomic data will mark the end of incongruence in phylogenetics [3].

The incongruence between two phylogenies can be the result of: (i) violations of the orthology assumption generated by mechanisms such as gene duplication, horizontal gene transfer or lineage sorting [4]; (ii) stochastic error related to the length of the genes; and (iii) systematic error leading to tree reconstruction artifacts generated by the presence of a nonphylogenetic signal in the data. Adopting a genome-scale approach theoretically overcomes incongruence because of the first two reasons: nonorthologous comparisons are gene-specific and will probably be buffered in a multigene analysis; and stochastic error naturally vanishes when more and more genes are considered. By contrast, systematic error is not expected to disappear with the addition of data [5]. Systematic error results from nonphylogenetic signals being present in the data, such as heterogeneity of nucleotide compositions among species (compositional signal), rate variation across lineages (rate signal) and also within-site rate variation (heterotachous signal) [6]. The bias causing systematic error creates a signal because, contrary to stochastic noise, it does not average out over several sites. If a bias is strong enough, it can dominate the true phylogenetic signal causing the tree reconstruction method to be inconsistent and lead to an incorrect, but highly supported tree 5, 7. Therefore, phylogenomics, instead of ending incongruence, might open an era of real, statistically significant incongruence resulting from the use of different methods, different taxon samplings, or different character partitions of the same data set.

To illustrate this paradox, we used the large data set of 106 genes (120 762 nucleotides) from 14 yeast species assembled by Rokas and Carroll [1]. Phylogenetic trees were inferred by maximum parsimony (MP) from nucleotide sequences, as in Ref. [1], and alternatively by probabilistic methods – Bayesian inference (BI) [8] or maximum likelihood (ML) 9, 10, 11 – because these methods are generally considered the most accurate 12, 13. In addition, because the diversification of these yeasts is ancient (>250 Mya [14]) and amino acid sequences evolve more slowly than nucleotide sequences, the translated protein sequences were also used to construct trees. Phylogenies were inferred from each of the 106 genes and from their concatenation, using two different methods (MP and BI) and two types of characters (nucleotides and amino acids), yielding a total of 428 trees. We estimated the level of incongruence as the number of bipartitions (or splits, i.e. groups of species defined by a branch of a phylogenetic tree), supported by more than a given bootstrap value, that are different between two trees. Our aim was to compare the level of among-gene incongruence for a given tree reconstruction method with the level of among-method incongruence for a given data set.

Section snippets

Congruence among phylogenetic markers

The trees inferred from each of the 106 genes are all different (data not shown), yielding an apparent high level of incongruence. However, there are 3×1011 possible binary trees connecting 14 taxa and it is possible that the different genes recover different, but very similar trees. Without taking statistical support into account, there are 25.9% and 24.6% different bipartitions when trees are inferred by either MP at the nucleotide level (MPnt) or by BI at the amino acid level (BIaa),

Strong incongruence when different tree reconstruction methods are used

By contrast, a non-negligible statistically significant incongruence exists because of the use of different tree reconstruction methods, and the use of nucleotide versus amino acid sequences. On average, 14.2% (23.2%) of the bipartitions are different at the 95% (70%) bootstrap confidence level between the MPnt tree and the BIaa tree, albeit inferred from the same genes.

Does the phylogenomic approach avoid this incongruence? The answer is no: when phylogenies are inferred from the concatenation

Nucleotide compositional bias causes most of the incongruence

To understand better the source of this exceptionally high level of incongruence, trees inferred from the concatenation by MPnt, BInt, MPaa and BIaa were compared (Figure 1a–d). This enables separation of the impact of the type of characters considered from the impact of the reconstruction method used. The topology within the clade containing the five Saccharomyces species, Naumovia castellii and Candida glabrata was identical in all four cases. In addition, Debaromyces hansenii invariantly

Saturation as an indicator of incongruence

Tree reconstruction artifacts are the result of the accumulation of multiple substitutions at the same position over time: convergences and reversions erase the genuine phylogenetic signal. When multiple substitutions are dominating, the data set is said to be mutationally saturated. Without any bias, a highly saturated data set will produce an unresolved starlike phylogeny. However, when sequences have been generated by a heterogeneous evolutionary process, saturation will ultimately lead to

Conclusions and recommendations

We do not dispute the use of numerous genes for phylogenetic inference 1, 2, 21, because it is generally required to solve difficult phylogenetic questions 22, 23. However, contrary to some current opinions 1, 2, obtaining a highly supported tree from the analysis of a concatenation of multiple genes does not guarantee that ‘it accurately represents the historical relationships’ [2]. Highly supported groupings can prove to be incorrect because of the inconsistency of the tree reconstruction

Acknowledgements

We thank Antonis Rokas for providing his alignment, and Franz Lang, Nicolas Lartillot, Nicolas Rodrigue and Naiara Rodriguez-Ezpeleta for critical readings of the manuscript. This work was supported by operating funds from Génome Québec. H.P. is a member of the Program in Evolutionary Biology of the Canadian Institute for Advanced Research (CIAR), which is acknowledged for salary and interaction support. H.P. is also grateful to the Canada Research Chairs Program and the Canadian Foundation for

References (35)

  • H. Philippe

    Phylogenomics

    Annu Rev Ecol Evol Syst

    (2005)
  • M.J. Phillips

    Genome-scale phylogeny and the detection of systematic biases

    Mol. Biol. Evol.

    (2004)
  • F. Ronquist et al.

    MrBayes 3: Bayesian phylogenetic inference under mixed models

    Bioinformatics

    (2003)
  • Swofford, D.L. (2000) PAUP*: Phylogenetic Analysis Using Parsimony and Other Methods, 4b10 edn,...
  • S. Guindon et al.

    A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood

    Syst. Biol.

    (2003)
  • G. Jobb

    TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics

    BMC Evol. Biol.

    (2004)
  • J. Felsenstein

    Inferring Phylogenies

    (2004)
  • Cited by (516)

    • Multiple outgroups can cause random rooting in phylogenomics

      2023, Molecular Phylogenetics and Evolution
    View all citing articles on Scopus
    View full text