Background & Summary

Birds occupy nearly every habitat and ecoregion on Earth, however, many of these habitats experience large seasonal shifts in key ecological attributes such as length of day1, temperature2, rainfall3,4, and associated food and nesting material availability5. This has necessitated the adaptive evolution of complex strategies to maximise survival through seasonal migrations between breeding and wintering ranges. Migrations are carefully timed events, scheduled in such a manner that birds can optimise hours of daylight6, nighttime visibility7,8, and time spent at stop-over sites9 along their migration route to ensure timely arrivals for optimal habitat use. While most of the ecological attributes play some role in the timing of migration, one of the best studied attributes that serve as a trigger to initiate migration is the length of day or photoperiod. The photoperiod is primarily responsible for daily oscillations within the regulatory feedback loops of the circadian clock, which differentially expresses genes during light or dark phases to maintain sleep-wake cycles in most organisms10.

One conundrum regarding migration in birds is how differential migration patterns are established and maintained within singular species, even in the absence of extrinsic environmental triggers. For example, several species within the order Coraciiformes have distinct populations that are either year-round residents, with minimal altitudinal movement, or long-distance migrants. This includes such species as the Lilac-breasted roller11 (Coracias caudatus) and Woodland kingfisher12 (Halcyon senegalensis), both having subspecies that are delineated by differential migration, as well as the European bee-eater13 (Merops apiaster), which is considered monotypic but has a distinct resident population in Southern Africa. Understanding how differential migration is established and maintained between such species is key to assessing connectivity14, speciation at a subspecies level15, and potential population fitness16. This is particularly pertinent with regards to the plasticity or ability to switch between behaviours17,18 should environmental conditions change considerably due to climate change19,20,21 or anthropogenic activity22,23,24,25.

Several studies have explored the possible genetic components that affect intrinsic time keeping mechanisms and migration. Although variable methods have been used, including genomic26, epigenetic27, and transcriptomic approaches28, most studies sought to identify genes or gene regions that show variation in either the sequence itself or the gene expression that can be correlated to divergent migratory behaviour. The key, however, is identifying variation that is linked to processes that interface with annual life events. Thus, variation that is either connected to the endocrine or metabolic changes29, in preparation for migration and breeding, or intrinsic time-keeping mechanisms, such as the rhythmic expression of circadian genes; particularly those that interface with environmental changes that my serve as cues such as photoperiod, temperature, lunar cycles, and food availability30. This is needed to exclude variants that co-vary with migration phenotypes but are not actively involved in shaping them. It is therefore no surprise that many candidate gene studies have explored variation within the network of genes of the circadian clock. Several associated candidate genes have been suggested, with length polymorphisms within short repeats of the Clock and Adcyap1 genes being the focus of many studies31,32,33.

To clarify the role of these genes in migratory phenotypes, a systematic review (Fig. 1) was conducted to identify, synthesise, and provide a reappraisal of the available evidence34. Structured searches of the literature with an optimised Boolean search string were done in five scientific databases. Search results were exported in formats compatible with citation network analysis software35. After duplicate entries were removed, citation network analyses were used for the automated screening of database results to identify the central literature on the topic. Publications identified from the citation network analyses were subjected to manual screening of the title, abstract, and key words to assess the potential eligibility for inclusion in the review. The final list of most eligible publications was sought for full text retrieval. A total of 66 studies were included in the final review of which 34 were candidate gene studies and 32 were other, migration-related, studies. These included latitude/longitude/spatial analyses, timing of migration, and timing of egg laying/breeding. Most of the studies using a candidate gene approach were used for data retrieval. For these studies, datasets were retrieved as either diploid allele data of individuals or allele frequencies. Data sources included the main text of articles, supplementary materials, databases such as Dryad (https://datadryad.org/) or Figshare (https://figshare.com/), data extraction, or data received directly from authors. Unpublished data for an additional 12 species were also included. The dataset included individual level allele data from 52 species of which data was available for 46 species for the Clock gene and 43 species for the Adcyap1 gene. This dataset represents the largest collection of cross species allele data for two candidate genes used to test a putative association between clock gene polymorphisms and divergent migration in birds, which enables the testing for patterns of inheritance, evolutionary selection, relation to divergence times, and associations across a globally distributed dataset.

Fig. 1
figure 1

PRISMA statement for the systematic approach used to identify studies that measured clock gene polymorphisms in relation to annual synchronicity of live events such as breeding and migration in birds. Further details are also provided for the retrieval of allele data for individual studies from various sources as well as reasons for exclusion of studies. (image edited in BioRender.com).

This data descriptor summarises both the methodology used to screen the literature as well as to compile the data concisely and presents the resulting data used in prior analyses in an easy-to-understand format. At present, none of the scientific databases that collect genetic variation data is suitable for the deposit of this specific type of data. The barcode of life data system (BOLD, https://boldsystems.org/), which does accept length polymorphism data from microsatellite markers, currently only accepts data for markers used in barcoding or population assignment experiments and does not specifically store data for markers used in behavioural or phenotype associated studies. The European variant archive (EVA, https://www.ebi.ac.uk/eva/), which also accepts variant data that includes length polymorphisms, currently only accepts data for species with reference genomes, which is still unavailable for most avian species. To overcome this, we have endeavoured to create a central compilation of the available data in two standard formats which is archived in parallel to this data descriptor; with an additional online version on GitHub36 (https://github.com/LSLeClercq/AvianClocksData) that will be maintained and updated over time as more data is made available. This may greatly facilitate the reuse of the data where it may be applicable to other forms of analyses within migration genetics and beyond.

Methods

Literature search and automated screening

Literature was searched using systematic review methods, in line with PRISMA Ecology and Evolution guidelines37, to identify and synthesize relevant sources. The overall approach is depicted in the PRISMA statement38 in Fig. 1 that was supplemented with further information on the data retrieval and screening process. Literature was searched between January and September of 2022 on five databases: Scopus (N = 52, www.scopus.com), ScienceDirect (N = 1814, www.sciencedirect.com), Web of Science (N = 140, https://clarivate.com/), PubMed (N = 157, https://pubmed.ncbi.nlm.nih.gov/), and Dimensions (N = 2746, www.dimensions.ai). Databases were searched using an optimized Boolean search string derived from the PICO terms for the aim and objectives of the review. The final search string was as follows: (“Birds” OR “Avian”) AND (“Clock genes” OR “Clock” OR “Adcyap1” OR “Candidate gene”) AND (“Migration” OR “Flying”). As needed, this was complemented by ancillary ‘free term’ searches based on citations in articles or to include other relevant aspects such as “Breeding”, “Moult”, “Genomics”, “Transcriptomics” or “Photoperiod”. For the Scopus and Dimensions database searches, the results were exported in the comma separated value (CSV) format, while the results from the ScienceDirect, Web of Science, and PubMed database search were exported in the research information systems (RIS) format.

Automated screening for inclusion was done through citation network analyses. For the Scopus database, the results were merged and reformatted with the R package ‘Scopus2CitNet 0.1.0.0’ (https://github.com/MichaelBoireau/Scopus2CitNet) in RStudio 1.4.110639, running R 4.0.540. The results were subsequently visualized by year in CitNetExplorer 1.0.0., keeping only those papers that overlapped in terms of references cited and the largest connected set (Fig. 2a). The results from the search on the Dimensions and ScienceDirect databases were visualized in VOSviewer 1.6.1635 by group as well as by year, keeping only those papers that are connected by citations and reference lists (Fig. 2b). The size of bubbles corresponds to citations and the number of cross-links between studies.

Fig. 2
figure 2

Visualised citation network for studies identified in literature searches. (A) Citation network of the Scopus and PubMed database in CitNetExplorer. Publications are organized by year (2006–2021) with the name and first initial of the first author indicating individual studies. The relationship between studies by virtue of co-citations in the reference lists are indicated by grey lines. Subgroup analyses identified several key groups, indicated by the colour code from VOSviewer. Key candidate genes are indicated in red italics and show studies that assayed polymorphisms in the Clock, Adcyap1, CREB1, NPAS, and DRD4 genes. (B) Citation network for studies identified in literature searches of the Dimensions and ScienceDirect database in VOSviewer. First authors are labelled by surname and first name. Automated group analyses identified ten clusters of related studies of which the studies identified from Scopus formed part of five groups, indicated as groups 2, 5, 6, 7, 9, and 10. This network shows the larger field of migration studies including non-candidate gene studies such as transcriptomic studies (group 10). (image edited in BioRender.com).

Manual title-abstract screening and full text retrieval

Sources identified from the citation networks were imported (citation and abstract) into Mendeley citation manager (www.mendeley.com) for further screening. Several types of studies relating to migration genetics were included in preliminary screening such as candidate gene studies, genomic studies, transcriptomic studies, and epigenetic studies. Studies with a focus on endocrine systems, physiology, or telomeres were excluded. Studies on migration phenology, without an evident genetic link, were also excluded. The inclusion criteria of candidate gene studies were confined to studies that primarily measure Clock or Adcyap1 gene polymorphisms (as well as other candidate genes studied in parallel e.g., NPAS, CREB1, and DRD4: indicated on Fig. 234) within bird populations to compare putative variation to the annual synchronicity in life events and differential migration. These included latitude/longitude/spatial analyses, timing of migration, migratory restlessness, timing of egg laying/breeding, clutch size, moult, urbanisation, and exploratory behaviour. The final set of studies that passed preliminary screening were sought during full text retrieval and added to the imported reference if it wasn’t already included. A total of 66 studies were included in the final review of which 34 were candidate gene studies and 32 were other, migration related, studies using genetic methods. Some basic scientometric assessments of the final set of studies, including the plotting of publications per year (Fig. 3) as well as the geographic distribution (Fig. 4) of studies, was conducted using ABCal version 1.0.241 (https://github.com/LSLeClercq/ABCal).

Fig. 3
figure 3

Plots indicating the distribution for publications by year. (A) Histogram for publications by year indicating the first publications starting in 2007 up to more recent publications in 2022, with the largest number of publications between 2013–2015 and in 2019. (B) Density gradient display of studies in VOSviewer based on year of publication, indicated most studies were published between 2006 (blue) and 2022 (red) with a high number of publications emanating from 2013–2016 (green to orange). (image edited in BioRender.com).

Fig. 4
figure 4

Geographic distribution of candidate gene studies included in the final review dataset (N = 34) based on sampling locations. Related migration studies (N = 32), such as transcriptomic or epigenetic studies, were excluded. The density gradient plots the number of studies per country ranging from one study (green) to more than eight studies (red); countries in white are data deficient. The overall plot indicates that most studies emanated from sampling locations in Europe and North America, with only a small number of studies including sampling from parts of Africa and South America.

Published datasets

A total of 34 studies were identified that used a candidate gene approach for which data retrieval was done. Data was retrieved from either the main text, supplementary material of the article, online data repositories such as Dryad42,43,44,45,46,47,48,49 and Figshare50, or additional data received directly from authors. Data types varied from allele frequencies to individual level diploid allele data. Allele data for the Barn swallow51 was retrieved from the text while data for the Yellow-legged gull52 was extracted from images using WebPlotDigitizer version 4.653. Allele data was generally derived from a single source with the exception of the European pied flycatcher44,49 and Willow warbler54,55,56. The species, data sources, and data types are summarized in Table 1 along with the sampling location and sample sizes. Frequency data was available for most published studies, with the exception of the bluebird species18, and those species for which allele data was unavailable are summarised in Table 2. This includes species for which only frequency data was reported, species for which a non-clock gene approach was used, and studies for which only data summaries without frequencies were reported.

Table 1 List of species for which published allele data was collected and/or included in the review and data article.
Table 2 List of species for which other published data was collected and/or included in the review and data article.

Unpublished datasets

This study included unpublished data for twelve species in total, summarised in Table 3. The six North American species were sampled at Long Point Old Cut, Ontario, Canada, and included the American redstart (N = 26), Common yellowthroat (N = 31), Hermit thrush (N = 30), Magnolia warbler (N = 33), Swainson’s thrush (N = 29), and White-throated sparrow (N = 32). The six European species included the Common chiffchaff (N = 55) and five species of shearwaters: Barolo shearwater (N = 15), Boyd’s shearwater (N = 25), Great shearwater (N = 25), Manx shearwater (N = 23), and Yelkouan shearwater (N = 15). The Common chiffchaff was sampled from several locations in Sweden (N = 30, subspecies abietinus) and Kazakhstan (N = 25, spp. tristis). Blood samples were taken from the brachial vein and stored in SET buffer at –80 °C. Shearwaters were sampled from several locations in Europe including France and Portugal while several species were sampled from islands such as Iceland, Cape Verde, and territories of the United Kingdom such as Gough Island. A 1 ml blood sample was taken from the tarsal or the brachial vein during geolocator retrieval. Samples were collected in 1.5 ml plastic tubes containing 70% ethanol and stored at –20 °C until further analysis.

Table 3 List of species for which unpublished data was collected and/or included in the review and data article.

Samples were genotyped using established methods54. Briefly, samples of North American species were preserved in a buffer at room temperature until extraction with the ArchivePure DNA purification kit (5 PRIME, Hilden, Germany). Then, polymorphism at Clock and Adcyap1 3′-UTR was determined as before54, with PCR products labelled with HEX (Clock), 6-FAM (Clock and Adcyap1) or TAMRA (Adcyap1) dyes. For the Common chiffchaff, genomic DNA was extracted using a standard ammonium acetate protocol. All 55 samples were successfully genotyped and analysed for length polymorphism in the poly-Q repeat of the Clock gene following previously published protocols31. For Shearwater samples, total genomic DNA was extracted from blood samples using the Speedtools® Tissue DNA Extraction kit (Biotools, Madrid, Spain) following the manufacturer’s instructions. Genotyping was subsequently performed with methods adapted from the literature31. Briefly, PCR products were generated with shearwater specific primers for the Clock gene labelled with 6-FAM or HEX, followed by fragment analysis as in54 to determine the size of the poly-Q repeat.

Data Records

The data collated during the systematic review and meta-analysis were made available to via the Zenodo repository at the time of publication. Additional inclusion and exclusion criteria were applied and a final set of 40 species (indicated by asterisk in Tables 1, 3) were included in the comparative analyses using mantel and phylogenetic generalised least squares methods to test for an association between migratory phenotypes and candidate gene genotypes34,57. This data are available on Zenodo57, and includes a workbook with the allele data as well as a results workbook with various population genetics measures including allele frequencies, Homozygosity (Ho), Heterozygosity (He), Hardy-Weinberg equilibrium58,59, and Ewens-Watterson60 results. The complete dataset was reformatted for distribution with this data descriptor and is available from two sources, from the Figshare61 depository, as submitted with this article, and from a maintained repository with version histories on GitHub36.

Data (version 1.0.2) are available as a spreadsheet workbook, labelled “Avian Clock Gene Dataset” with multiple sheets. The first sheet of the workbook, labelled “Index”, contains the table of contents which has several columns (Table 4) that list species by common names, indicates data availability for Clock and Adcyap1, and total sample size (N). Furthermore, the taxonomic classifications including genus, species, family, superfamily, parvorder, and order are also given. The species codes are hyperlinked to the allele data for individual species, contained in separate sheets within the same workbook. Individual sheets for species contain several columns including the species name, sample ID, and diploid alleles for Clock and/or Adcyap1 genes. Alleles are expressed as the number of polyglutamine repeats (QN) for Clock while the Adcyap1 alleles represent the amplified fragment length in base pairs (bp). The sum and average of alleles is also provided, and missing data is labelled as NA. For the purpose of individual species analyses, the species sheets from the workbook are also provided as individual comma separated value (CSV) files. The same data is also available on GitHub with the workbooks available in the root directory while the individual CSV files are available in a subfolder with the title “CSV”. The repository also contains a “README” file which provides some basic background and details on the data. Both the workbook as well as CSV files can be read by Microsoft® Office (https://www.office.com/) as well as StarOffice™ (https://www.staroffice.com/), OpenOffice™ (https://www.openoffice.org/), and LibreOffice™ (https://www.libreoffice.org/).

Table 4 Description of field names and data for workbook and CSV files.

Technical Validation

Allele data comprises the heterozygous or homozygous diploid allele for one or both studied clock genes as well as the sum and average of allele sizes. The data for Clock was normalized according to the poly-glutamine repeat size (QN) by subtracting the conserved non-repeat size (LC) in base pairs from the total fragment size (LT) and dividing by codon size, following Eq. 1.

$${Q}_{N}=\left({L}_{T}-{L}_{C}\right)/3$$
(1)

Data for Adcyap1 was generated using the same published primers and was kept as the total fragment size.