Exploring the utility of “next-generation” sequence data on inferring the phylogeny of the South American Valeriana (Valerianaceae)

https://doi.org/10.1016/j.ympev.2018.02.014Get rights and content

Highlights

  • We assessed the utility of GBS for resolving a radiation of a South American clade of Valeriana.

  • We recovered over 3000 unique loci, with 140 loci being shared by all taxa sampled.

  • Different phylogenetic methods inferred similar topologies with varying support.

  • The supermatrix approach recovered the most well-resolved and well-supported phylogeny.

Abstract

This study aimed to investigate the phylogenetic utility of genotyping-by-sequencing (GBS) data in the southern South American subclade of Valerianaceae (Dipsacales). The variety of forms that has arisen in this clade, presumably over the past 5–10 million years, has all the signatures of an adaptive and rapid radiation. While the phylogeny of Valerianaceae has received a great deal of attention in the last decade, species relationships have been hard to resolve using traditional phylogenetic markers. Here, we collected high-throughput genomic sequence data from reduced-representation libraries obtained through GBS protocols. Putative orthologs were identified using within- and among-sample clustering using the computer software pyRAD. We recovered over 3000 loci for 14 species of southern South American Valeriana, with 140 loci present across all samples. We analyzed a set of phylogenetic trees generated from each locus using maximum likelihood methods, as well as multispecies coalescent (∗BEAST) methods. For comparative purposes, we also used a supermatrix approach to infer the phylogeny for these taxa. Across different methods and data sets, we recovered consistent relationships for the southern South American valerians that we sampled with varying degrees of support.

Introduction

The phylogeny of Valerianaceae (Dipsacales) has received a fair amount of attention over the past 10 years, with recent studies recovering strong support among the major lineages within the group (Bell and Donoghue, 2005, Bell et al., 2012, Bell et al., 2015). Molecular phylogenetic studies suggest that following an introduction into South America, the group subsequently radiated and diversified, primarily in high Andean habitats. In addition, previous studies also find support for two South American subclades; one consisting of species in the north (primarily found in páramo and puna high-elevation habitats) and another made up of species in the southern Andes (Bell et al., 2012, Bell et al., 2015). This southern Andean clade is the focus of our study and consists of 40 described species that are distributed across a wide elevational and ecological gradient (Kutschker and Morrone, 2011). They occur on the east and west side of the Andean Cordillera and at low and high elevations, encompassing many different habitat types. The group radiated recently and rapidly (Bell et al., 2012) and many of its species occur in one of the world’s biodiversity hotspots in central Chile (Myers et al., 2000). As such, the Valerianaceae represents a powerful model to study how biogeography, ecology and genetics drive diversification and its implications for conservation. In order to conduct further studies, a well-supported, well-resolved phylogeny is essential. However, recent molecular phylogenetic studies (Bell et al., 2012, Bell et al., 2015) based on traditional molecular markers have had little success in resolving the relationships with any confidence within this subclade.

Over the past decade, sequencing technologies have made significant progress, most recently with high-throughput sequencing (Mardis, 2008, Kircher and Kelso, 2010, Godden et al., 2013). These “next-generation” sequencing (NGS) methods produce large amounts of genomic sequence data quickly and in a more cost effective manner than traditional Sanger sequencing. Phylogeneticists have begun to take advantage of the reduced-representation of genomic approaches, such as restriction-site associated DNA sequencing (RADseq; Baird et al. 2008) and genotyping-by-sequencing (GBS; Elshire et al., 2011), which produce datasets of many short sequences from all over the genome, at restriction enzyme cut-sites (McCormack and Faircloth, 2013, Eaton and Ree, 2013, Jones et al., 2013, Wagner et al., 2013, Escudero et al., 2014, Hipp et al., 2014, Eaton et al., 2017, Hipp et al., 2018, Hauser et al., 2017). These “reduced-representation genome sequencing” methods are particularly useful for phylogenetic studies because they produce many loci that may be phylogenetically informative and used for non-model organisms, or taxa lacking a reference genome. Reduced-representation methods have shown promise for phylogenetic studies, especially among lineages that are <60 million years old (Rubin et al., 2012, Emerson et al., 2010, Cariou et al., 2013, Eaton et al., 2017). There are some drawbacks to these methods however, including short sequence reads (50–100 bp), no distinction between orthologs and paralogs, loci dropout due to sampling error, disruption of restriction sites due to mutation, and the intensive bioinformatics needed to analyze such data. These drawbacks can limit the utility of reduced-representation methods for deeper timescale studies. Despite that, those methods have successfully produced robust phylogenies for several different genera of plants (e.g., Eaton and Ree, 2013, Hipp et al., 2014, Escudero et al., 2014, Cavender-Bares et al., 2015, Boucher et al., 2016, Eaton et al., 2017, Fernandez-Mazuecos et al., 2017, Hauser et al., 2017).

This ability to obtain large numbers of sequences, from multiple individuals per species, has led phylogeneticists to start using multilocus, and especially multispecies coalescent-based tree inference methods (e.g., BEST Liu, 2008; STEM Kubatko et al., 2009; *BEAST Heled and Drummond, 2010; ASTRAL Mirarab et al., 2014; SVDquartets Chifman and Kubatko 2014). Using a concatenated approach with multiple genes can result in a well-supported, but incorrect, phylogeny (Kubatko and Degnan, 2007, Edwards et al., 2007). However, multispecies coalescent-based approaches have had success in overcoming these challenges by taking into account gene history variation (Delsuc et al., 2005, Rannala and Yang, 2008, Kumar et al., 2012). This becomes exceedingly important for lineages that have diversified rapidly, as they are more likely to retain ancestral polymorphisms due to the limited time to achieve reciprocal monophyly (Sanders et al., 2013, Eaton and Ree, 2013).

In this study, we examine the phylogenetic utility of GBS data for inferring the phylogeny of the southern Andean Valeriana L. clade. To gather phylogenetic data that spanned the clade, we sampled 14 of the 40 recognized species from this area. We then used the hierarchical Bayesian model implemented in *BEAST (Heled and Drummond, 2010) because it specifically models the discord between gene trees and species tree due to incomplete lineage sorting to infer a species tree for our sample taxa. In addition, we analyzed a concatenated GBS data set with traditional maximum likelihood (ML) methods. Although we included only a subset of the species in this subclade, this work will serve as a starting point to see if these data and methods will help to confidently resolve relationships and determine if further efforts will be valuable for understanding the evolutionary history of Valerianaceae.

Section snippets

Sampling & sequencing

For this study, we originally sampled 31 species of southern Andean valerians, with 48 total samples. We extracted genomic DNA from silica dried plant tissues using CTAB methods (Doyle and Doyle, 1987, Cullings, 1992). We prepared the GBS libraries using the protocol outlined in Elshire et al. (2011). We used the restriction enzyme PstI (CTGCAG) to digest the extracted genomic DNA from each individual, and then ligated the resulting fragments to a barcode adaptor and a common adaptor with the

Sequences

Illumina sequencing returned 283,325,239 total reads made up of 13,339 Mbases. We chose to leave out some of the samples due to low coverage, possibly due to low quality of the extracted DNA, and ended up with 14 species, for a total of 18 samples (Table 1). Raw reads are available through NCBI Sequence Read Archive (BioProject ID: PRJNA295150).

Clustering of consensus sequences with our previously mentioned parameters in pyRAD revealed 8323 unique clusters, or loci, across all samples with 140

Phylogenetic studies and biological implications

Due to the limited taxon sampling for these analyses, direct comparisons to previous results are not always possible. However, resulting phylogenies from this study support several previous hypotheses concerning the evolution of the southern South American valerians. Based on molecular sequence data, Bell et al. (2012) inferred an initial appearance of Valerianaceae in the southern Andes during the Miocene (13.7 million years ago), a time that corresponds to the development of open

Acknowlegements

We thank C. Moreau and B. Rubin (Pritzker DNA Laboratory, Field Museum) for assistance in GBS library preparation. We also thank D. Eaton, R. Ree, and A. Hipp for help and advise running pyRAD and L. Coghill for additional help in assembly programs. Many of the specimens used in this study were kindly provided by S. Liede-Schumann (University of Bayreuth). This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (57)

  • Catchen, J.M., Amores, A., Hohenlohe, P., Cresko, W., Postlethwait, J.H., 2011. Stacks: building and genotyping loci de...
  • J. Cavender-Bares et al.

    Phylogeny and biogeography of the American live oaks (Quercus subsection Virentes): A genomic and population genetics approach

    Mol. Ecol.

    (2015)
  • J. Chifman et al.

    Quartet inference from SNP data under the coalescent model

    Bioinformatics

    (2014)
  • K.W. Cullings

    Design and testing of a plant-specific PCR primer for ecological and evolutionary studies

    Mol. Ecol.

    (1992)
  • F. Delsuc et al.

    Phylogenomics and the reconstruction of the tree of life

    Nat. Rev. Genet.

    (2005)
  • J.J. Doyle et al.

    A rapid DNA isolation procedure for small quantities of fresh leaf tissue

    Phytochem. Bull.

    (1987)
  • A. Drummond et al.

    BEAST: Bayesian evolutionary analysis by sampling trees

    BMC Evol. Biol.

    (2007)
  • D.A.R. Eaton

    PyRAD: Assembly of de novo RADseq loci for phylogenetic analyses

    Bioinformatics

    (2014)
  • D.A.R. Eaton et al.

    Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae)

    Syst. Biol.

    (2013)
  • D.A.R. Eaton et al.

    Misconceptions of missing data in RAD-seq phylogenetics with a deep-scale example from flowering plants

    Syst. Biol.

    (2017)
  • R.C. Edgar

    Search and clustering orders of magnitude faster than BLAST

    Bioinformatics

    (2010)
  • S.V. Edwards et al.

    High-resolution species tree without concatenation

    Proc. Natl. Acad. Sci. U.S.A.

    (2007)
  • R.J. Elshire et al.

    A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species

    PLoS One

    (2011)
  • K.J. Emerson et al.

    Resolving postglacial phylogeography using high- throughput sequencing

    Proc. Natl. Acad. Sci. U.S.A.

    (2010)
  • P. Etter et al.

    Local de novo assembly of RAD paired-end contigs using short sequencing reads

    PLoS One

    (2011)
  • M. Fernandez-Mazuecos et al.

    Resolving recent plant radiations: power and robustness of genotyping-by-sequencing

    Syst. Biol.

    (2017)
  • G.T. Godden et al.

    Making next-generation sequencing work for you: approaches and practical considerations for marker development and phylogenetics

    Plant Ecol. Div.

    (2013)
  • M.G. Harvey et al.

    Sequence capture versus restriction site associated DNA sequencing for shallow systematics

    Syst. Biol.

    (2016)
  • Cited by (3)

    • RAD sequencing resolves the phylogeny, taxonomy and biogeography of Trichophoreae despite a recent rapid radiation (Cyperaceae)

      2020, Molecular Phylogenetics and Evolution
      Citation Excerpt :

      One method that has become increasingly popular for phylogenetic analysis at shallow time scales is Restriction-site associated DNA sequencing (RADseq, including ddRAD, GBS; Baird et al., 2008; Elshire et al., 2011). Although originally intended for population genetics, RADseq has been used in dozens of fungal, plant, and animal phylogenetic studies because it provides large numbers of informative characters, and can easily be applied to non-model organisms (Massatti et al., 2016; Hauser et al., 2017; Vargas et al., 2017; Bell and Gonzalez, 2018; Curto et al., 2018; Hipp et al., 2018; Lin et al., 2019; Salas-Lizana and Oono, 2018; Spriggs et al., 2019). The method is particularly well suited for studying recent evolutionary history, as it provides data from thousands of nuclear loci at considerably lower cost per sample (<40$/sample) than alternative NGS methods (Andrews et al., 2016).

    View full text