Introduction

The genetic architecture of oaks at different levels has been the subject of countless studies for a long time. Molecular genetic approaches are generally used to resolve taxonomic issues, infer phylogeny, population genetic structures, and adaptation of oaks (Conte et al. 2007; Simeone et al. 2016; Sork et al. 2016; Carabeo et al. 2017; Hipp et al. 2020). The intensive development of genetic approaches is opening new horizons in the study of oaks’ taxonomy and adaptive evolution. The advanced reduced representation approaches based on next-generation sequencing (e.g., RAD-seq) allow for the effective discovery of genome-wide single nucleotide polymorphisms (SNPs) obtaining high-resolution population genetic data, including non-model organisms such as forest trees (Davey and Blaxter 2010; Parchman et al. 2018). Through properly selected restriction enzyme or enzyme combinations, the restriction site-associated DNA sequencing (RAD-seq) method delivers thousands of SNP markers evenly covering the genome under study during genotypes of large numbers of individuals across many populations (Miller et al. 2007; Baird et al. 2008). Double digest restriction site-associated DNA sequencing (ddRAD-seq) is a two-restriction enzyme approach of RAD-seq, increasingly applied for the population genetic, phylogenetic, and candidate genes studies of oaks (Konar et al. 2017; López de Heredia et al. 2020; Tóth et al. 2021, 2023).

According to most recent predictions, climate change significantly affects the European climate. While the amplitude and duration of extreme events are increasing, forest site conditions get more arid characters (IPCC 2022). As part of the active adaptation efforts, forestry management is looking towards alternative tree species and their selected provenances that can cope with changing conditions, thereby increasing the biodiversity and resilience of forest ecosystems.

Turkey oak (Quercus cerris L.) is one of the most significant members of Quercus section Cerris, distributing from the Apennine Peninsula throughout the Balkans to Asia Minor. The north part of the area reaches the northern regions of the Carpathian basin, near 49° N; thus, Turkey oak is currently the northernmost penetrating species of the section and the whole subgenus Cerris (Old World oaks) (Menitsky 2005; Simeone et al. 2019). Recent dendrochronological studies suggest a differing drought adaptation strategy for Turkey oak and a higher tolerance against drought events than the most widespread white oak species (Quercus petraea (Matt.) Liebl.) in Hungary (Móricz et al. 2021; Mészáros et al. 2022). According to climate model predictions, Turkey oak habitats are expected to expand in response to climate change during this century (Thurm et al. 2018; Illés and Móricz 2022). Due to the drought-tolerant behaviour and the expected expansion, Turkey oak has great potential in future forestry adaptation efforts, not just in its native range but also in Western Europe. Therefore, deepening our knowledge of the species’ genetic makeup and adaptability is indispensable. According to the findings of available studies Turkey oak has a prominently high phenotypic and genetic variability with an increasing NW to SE gradient throughout its distribution range (Özer 2014; Bagnoli et al. 2016). However, currently, no high-resolution genome-wide level information is available for the species which would allow for revealing deep population structures and include functional genomic regions for association studies. The present study establishes a de novo and a reference-mapped genome-wide SNP dataset of eight Central and Southeast European Turkey oak populations to allow a detailed survey of the genetic variability and adaptability of the species in the investigated regions. We assume that through the thousands of SNP markers discovered and mapped to cork oak’s (Quercus suber L.) reference genome, gene flow, demography, and local adaptation processes of Turkey oak can be studied in more detail than ever before.

Our highly variable genome-wide SNP dataset through more than 17 K de novo and 26 K reference mapped SNPs allows deep insight into the genetic structure of eight Central and Southeast European Turkey oak populations for the first time. Combining reference-mapped SNPs with environmental conditions also provides opportunities for genome-wide association studies on the adaptation processes connected to changing climatic parameters and other related studies. In addition, this dataset was developed following the same methodology as the recently published SNP dataset of sessile oak (Tóth et al. 2021). The analyzed populations were collected from overlapping geographic regions; thus, there is a unique opportunity for comparative studies on the co-occurring white oak and a Cerris oak species.

Material and methods

Plant material

During the sampling work, we collected plant material from five populations in Hungary, two in Bulgaria, and one from Kosovo. The broader sampling regions were carefully selected in relation to findings concerning biogeography, the existing spatial genetic structure of the species under study, and the co-occurring white oaks (Bordács et al. 2002; Petit et al. 2002; Özer 2014; Bagnoli et al. 2016; Gömöry et al. 2020). To ensure complex evaluability and comparability, the exact sampling locations were selected from the sessile oak sampling sites in Tóth et al. (2021), in the Balkans, from within the Turkey oak – sessile oak dendrochronology sites in Móricz et al. (2021), and from the experimental site of the Síkfőkút Project in Hungary (Mészáros et al. 2022). Finally, a total of eight even-aged Turkey oak stands with assumed local origin were selected, two from Bulgaria (Balgari and Botevgrad), one from Kosovo (Pristina), and five from Hungary (Zselickisfalud, Iharos, Vállus, Szántód and Síkfőkút), respectively (Fig. 1, Table 1).

Fig. 1
figure 1

Geographic distribution of sampling locations (the distribution range of Turkey oak in light green (Caudullo et al. 2022))

Table 1 Locality of the sampled Q. cerris populations

During the sampling process, we collected fresh leaves from 88 mature trees from the eight populations and stored them by freezing (− 20 °C) or drying them with silica gel until processing. The spatial coordinates (GPS) of the population in every sampling location were recorded and a minimum of 30 m isolation distance was maintained to avoid sampling-related individuals.

Library preparation and ddRAD-sequencing

During lab processing, the total genomic DNA was extracted from seven leaf disks (5 mm in diameter) per sampled individual following the method of Dumolin et al. (1995), except for the HU3 population where the QIAGEN DNeasy Plant Mini Kit (QIAGEN, Hilden, Germany) was used. Next, extracted DNA samples were quantified by a Qubit 3.0 Fluorometer using the Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA). For the double-digestion process of the DNA samples, the combination of Pst I and Msp I restriction enzymes was selected according to the findings of Cumer et al. (2021). In advance restriction sites were analyzed in the CLC Genomic Workbench software version 12.0 (QIAGEN Bioinformatics, Hilden, Germany). Double-digestion was performed on 50 ng of DNA per sample at 37 °C for two hours using 0.1 µl of each enzyme Pst I and Msp I respectively (FastDigest restriction enzymes; Thermo Fisher Scientific, Waltham, MA, USA). Fragments were double-sided size selected by KAPA PureBeads with a 0.55–0.80X solution/bead ratio (Roche, Basel, Switzerland) to isolate 300–600 bp fragments of digested DNA. Then the selected fragments were ligated to oligonucleotide adapters (Table 2) using T4 DNA Ligase, according to the producer protocol (Thermo Fisher Scientific, Waltham, MA, USA). Ligated fragments were purified using 0.8 KAPA PureBeads (Roche, Basel, Switzerland), and then amplified PCR using NEBNext Oligos for Illumina (Dual Index Set 1; New England Biolabs, Ipswich, MA, USA) and KAPA HiFi Hotstart Ready Mix (Roche, Basel, Switzerland). During each amplification reaction, i5 and i7 indexed primers were used in the same 0.5 µl amount. The thermal conditions of the PCR cycles were set in the following order: initial denaturation for three minutes at 95 °C, denaturation for 30 s at 95 °C in 17 repetitions, annealing for 30 s at 55 °C, extension for 30 s at 72 °C, and a final extension for five minutes at 72 °C. After amplification, the quantity and quality of the created amplicon library were checked using a High Sensitivity DNA1000 ScreenTape system with 2200 Tapestation (Agilent Technologies, Santa Clara, CA, USA) and dsDNA HS Assay Kit with Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). Before the paired-end (2 × 300-bp) sequencing, the equimolarily pooled libraries were diluted to 10 pM using the 600-cycle sequencing kit v3.1 (Illumina, San Diego, CA, USA). Finally, next-generation sequencing of the prepared libraries was performed on a MiSeq Sequencing System (Illumina, San Diego, CA, USA) according to the producer protocol. Raw sequence data were uploaded to the NCBI Sequence Read Archive (SRA) and are available under the BioProject ID: PRJNA927246.

Table 2 Oligonucleotides sequences of adapters used for double-digestion

Bioinformatic processes of raw sequences

All bioinformatics processing work was performed on the Silicon Computers (SGI) HPC server of the University of Sopron Forest Research Institute (Sopron, Hungary) with 40 cores (80 threads) and 38 GB RAM. In the first step using MiSeq Control Software (Illumina, San Diego, CA, USA), raw reads of Illumina sequencing were demultiplexed, and then the adapter sequences from the ends of reads were trimmed. In the next step, sequences at the 3´ and 5´ end with a quality score lower than 30 were trimmed. Reads with a mean quality score lower than 30 or a length shorter than 200 bp were also filtered out by using FastQ Toolkit. The purified raw data were further computationally processed using Stacks 2.0 software (Catchen et al. 2013; Rochette et al. 2019). This processing phase was initiated with additional filtering and trimming on the 37 701 348 raw sequences retained in the previous step, implemented in the program “process_radtags”. With the sliding-window method (the 15% of the given read length), every read that has a quality score lower than 90% (raw Phred score of 10) was filtered out (Catchen et al. 2011). In addition, to allow proper assembly and omit the low-quality sequences at the end of reads, all reads were uniformly truncated to 200 bp (Catchen et al. 2011). After filtering out 384 290 low-quality and 6 143 450 adapter-containing reads, a total of 31 173 608 sequences were retained for further analysis. Thus, purified and uniformed reads were now suitable to the de novo assembly of RAD loci and the SNP calling.

De novo assembly and reference mapping

To allow multipurpose usage of our dataset we performed both de novo and reference mapped assembly. For de novo variant calling we implemented the de novo pipeline using the function “denovo_map.pl” of STACKS. This program is composed of three successive functions, “ustacks”, “cstacks” and “sstacks”, respectively (Catchen et al. 2011; Rochette et al. 2019). Before running the de novo pipeline, command parameters had to be optimized for the given dataset. These required parameters are M, the maximum number of mismatches allowed between two alleles; m, the minimum number of reads to form an allele; n, the maximum number of mismatches permitted between two individual loci to consider them as homologous (Mastretta-Yanes et al. 2015; Paris et al. 2017). Optimization was performed following the “r80” method, which maximizes the number of SNPs found in 80% of the individuals (Paris et al. 2017). To find the optimal M and n values, nine test runs of denovo_map.pl were done following the M = n rule in each run by setting its values from two to ten. The optimal value of M and n parameters is the one after which increasing the parameters no longer leads to the detection of more new SNPs (Fig. S1a). According to this approach, the value of M and n parameters was chosen to be three (Paris et al. 2017). For the value of the third parameter m the minimum number of reads to form an allele, the default value, 3 was selected. In the next step, a final run was executed with the above-selected parameters. After using the command “gstacks”, paired-end contigs were assembled, and SNPs were recalled (Rochette et al. 2019).

As a result, assembly and alignment of paired-end reads were conducted for 57 285 loci, from which 559 had paired-end reads that could not be assembled into a contig (1.0%). The assembly of a paired-end contig was successful for the remaining 56 726 loci (99.0%), with an average contig size of 319.3 bp. The mean of effective per-sample coverage was given as 16.1 × with 2.7 × standard deviation, 7.0 × minimum and 30.2 × maximum values. Table E S1 summarizes the per-population values; for a graphical visualization, see Fig. SI1b.

Reference mapping was carried out on the cork oak reference genome ( GCA_002906115.4 [haploid version]; https://www.corkoakdb.org; Ramos et al. 2018), as this species is the closest relative of Turkey oak among the complete genome sequenced oak species. To map our raw reads against the reference genome, we used the BWA-MEM v0.7.17 software (Li 2013), previously proving its effectiveness during the reference mapping of several oak species (e.g. Tóth et al. (2023)). During the mapping process, we used the software’s default parameters and discarded the unassigned scaffolds of the reference genome. After mapping, we sorted the resulting SAM files containing the sequences of individual samples by using SAMtools v1.10 software. In the next step, SAM files were converted to BAM files, and then reads with a mapping quality lower than five (MAPQ >  = 5) were removed from the dataset. Finally, we calculated individual- and population-level summary statistics by using ‘flagstat’ function in SAMtools (Fig. 2, Table S1, Table S1).

Fig. 2
figure 2

Success level of reference mapping

After reference mapping, variant calling was also carried out in STACKS software by using ‘qstacks’ program. During this, loci were reconstructed and a SNP catalogue was created by using sliding window algorithm. In addition, unpaired reads which would support SNPs aligned to only one strand (strand-bias error) were removed.

The final step of the variant calling process was the same for both the de novo and the reference-mapped datasets. This final filtering was implemented in the “populations” program of STACKS and its goal was to discard insecure SNPs (Catchen et al. 2011). During this, every SNP loci with a minor allele frequency lower than 0.05, a rate of missing individuals greater than 0.2 or a significant deviation from Hardy–Weinberg equilibrium (HWE, P < 0.001) was removed from the datasets (Catchen et al. 2011). In addition, the “minimum number of populations” value was set to default 1, which means that SNPs that occurred in only one population were also retained. This allows future users to further filter the datasets according to their study requirements.

For the de novo dataset, after implementing the filtering thresholds, 51 861 of the 56 725 loci did not pass the sample/population, and minor allele frequency constraints and 4206 sites were discarded because they were significantly outside the Hardy–Weinberg equilibrium. The final de novo dataset, thus, comprises 1 662 692 sites in total, containing 17 607 variant sites (SNPs) at 4865 loci.

In some cases, reference mapping could result in greater efficiency of loci discovery (Fitz-Gibbon et al. 2017; Shafer et al. 2017; Ulaszewski et al. 2021; Vaux et al. 2023). During our reference mapping, we also achieved higher success in loci assembly than de novo pipeline. As a result, we obtained 7184 loci with 2 908 293 sites, of which 53 923 were filtered out, and 26 059 variant sites were retained. This means a roughly 50% increase in the number of discovered SNPs.

Access dataset

The final SNP dataset was deposited at the ZENODO repository and is available via the following link: https://doi.org/10.5281/zenodo.8091252. Both de novo and reference mapped datasets inside the repository are in two PLINK formatted, UTF-8 encoded, tab-delimited ped, and map files. By using the freely available genomic analyzing software PLINK, the uploaded dataset can be easily manipulated, converted to several other genomic formats, or directly utilized in genetic tests (Purcell et al. 2007). The first column inside the ped extension file comprises the population IDs from 1 to 8, while the second column contains the individual IDs. The data in the third to sixth columns is undefined (mother ID, father ID, sex, and phenotype data). Data in the seventh column onwards include nucleotide sequences (coded as A, T, G, C), where every pair of two columns correspond to a diploid SNP locus. Missing data were coded as “0 0” according to standard PLINK formatting (see https://zzz.bwh.harvard.edu/plink/). In the first column of the map file, the number of chromosomes was coded to undefined in the de novo dataset. As for the reference mapped dataset, the first column of the map file contains the scaffold ID where the given SNP is located since cork oak’s genome was assembled only into scaffolds, not chromosomes. The second column contains the SNP IDs in the case of both de novo and reference-mapped files. The third column has non-defined values in both cases (genetic position of a given SNP), and the fourth column includes the physical positions of SNPs on RAD loci or the given scaffold by reference mapped data in bp. Elevation and GPS coordinates are available for every population in Table 1.

Quality control of raw sequences

The quality check of raw short-read data for every individual was performed using the FastQC v0.11.9 program (Andrews 2010) in three successive steps. In the first step, per-base and per-sequence quality was checked directly after sequencing; in the second step, after trimming bad quality bases from the 3´ and 5´ end; and in the third step, at sequence processing (whole read filtering). In each step, the 176 produced reports were combined and collectively evaluated with the MultiQC v1.14 program (Ewels et al. 2016), which allows for the effective simultaneous evaluation of large numbers of FastQC reports (Fig. SI2). Since the raw data was consistent with standard Illumina run in every quality control step, its quality was high enough for further processing (Kircher et al. 2011).

Applications

Although reduced representation technologies (such as RAD-seq) are becoming widely available, high-resolution datasets are still considered rare, particularly for forest tree species without reference genomes. The number of studies based on genome-wide SNP data is increasing in oak genetics, though mainly for white oaks. This dataset delivers highly polymorphic SNP markers for Turkey oak belonging to the group of Old World oaks for the first time. In addition, the sampling area affected the genetically highly diverse but scarcely studied Pannonian and Balkan regions. Since this dataset was generated in the same way as the recently published SNP dataset of sessile oak (Tóth et al. 2021) covering the same geographic regions, there is a unique opportunity for the comparative study of the two genetically relatively distant oak species.

For a preliminary population genetic overview basic descriptive indices were computed by using packages “adegenet” (Jombart and Ahmed 2011), “hierfstat” (Goudet and Jombart 2022), and “poppr” (Kamvar et al. 2014) in the statistical software R (R Core Team 2022). Table 3 summarizes the within-population values of observed heterozygosity (Ho), gene diversity (Hs), and the inbreeding coefficient (Fis) following Nei (1987). The calculated indices were compared with the same values from the sessile oak dataset (Tóth et al. 2021), as no such data are currently available for Turkey oak. According to the comparison, indices of Turkey oak populations were quite similar to the values of sessile oak populations except for the Fis values, which are higher in this case. According to the calculated 95% confidence intervals for mean Fis, the inbreeding coefficient is significantly greater than zero in each population. However, note only preliminary filtering was carried out on this dataset to ensure more flexible use. Accordingly, careful filtering may needed before population genetic analysis in line with the objectives of the given study.

Table 3 Population genetic indices by populations for the de novo and reference mapped datasets (where, Ho = observed heterozygosity, Hs = gene diversity, Fis = inbreeding coefficient by Nei (1987), and CI = 95% confidence interval for mean Fis)

Using the more than 17 K de novo and 26 K reference mapped highly polymorphic SNP markers creates several opportunities to study population genetic and genomic aspects of Turkey oak. The gene flow and spatial genetic structure of populations can be investigated in depth by considering geographic locations. In addition, high-resolution SNPs enable the fine-scale study of kinship networks inside populations and hybridization. On a larger temporal scale, discovering phylogeography and mapping glacial refugia and postglacial migration routes tracking the consequences of ancient hybridization and population migration (contraction and re-expansion) events become possible. Combining the reference-mapped dataset with climatic data also gives opportunities for genome-wide association studies.

Nevertheless, we draw the attention of future users that, as with every reduced representation method generated dataset, population genetic inference taken from this dataset should also be treated with particular care since these methods (such as ddRAD-seq) often introduce biases (Arnold et al. 2013; Cariou et al. 2016). Although RAD-seq generated data may have more adverse attributes concerning phylogeny estimation (short loci, low coverage and SNP density, non-random sampling because of the polymorphisms of restriction sites), with appropriate data care, the method is suitable for phylogenetic studies as well (Cariou et al. 2013; Hühn et al. 2022).