Introduction

The genetic basis of crop yield is an agricultural research hotspot (Luo et al. 2021; Achary et al. 2020). Genes affecting crop yield must be identified to reveal the genetic basis and facilitate the increase in crop yield through genetic engineering and hybridization or selective breeding (Gu et al. 2020). Genome-wide association study (GWAS) and quantitative trait locus (QTL) analysis are the most efficient methods for detecting yield-related genes in staple crops, such as wheat, maize, rice, and cotton (Alemu et al. 2020; Su et al. 2020; Wang et al. 2020; Zhang et al. 2020). However, the yield-related genes of emerging crops are difficult to identify by using GWAS or QTL analysis due to the lack of genomic information. With its further development, the cost of the development of third-generation sequencing has dropped exponentially. Many genome sequences of emerging crops have been published and thus serve as a basis to reveal the genetic mechanism underlying their yield.

Weeping forsythia (Forsythia suspensa (Thunb.) Vahl) is a perennial plant of the genus Forsythia in the Oleaceae family. The main medicinal ingredients of this fruit are forsythin and forsythiaside A, which have broad-spectrum antibacterial and antiviral effects (Yan et al. 2016). Its fruit extracts are mostly incorporated in Chinese traditional medicines for influenza treatment (Wang et al. 2018). Lianhua Qingwen capsules containing this fruit extract can remarkably alleviate the symptoms of COVID-19 (Hu et al. 2020). The weeping forsythia fruits currently sold in the market have been harvested from wild populations; however, their yield and quality have declined due to recent “robbing green” phenomenon (Wang et al. 2018). The market demand for weeping forsythia has increased greatly due to its huge medicinal value, but the current supply from wild resources is insufficient (Li et al. 2020a). Large-scale artificial planting has been initiated in Henan, Shanxi, Shaanxi, Shandong, Hebei, and other provinces in China. However, most of the current cultivated populations are genetically mixed, and no high-yield varieties are applied for production. The shape and weight of field weeping forsythia fruits vary greatly and therefore affect its yield. The current yield is low at only 2250–3750 kg per hm2. Hence, additional planting areas are needed to meet the market demand.

Crop yield can be increased by enhancing variety, cultivation techniques, and environmental conditions (Zhu et al. 2015). Among which, variety improvement is the most important; identifying the genes related to crop yield greatly accelerates breeding speed. Fruit size is a quantitative trait controlled by multiple genes. In Cucumis sativus L. (Yang et al. 2013; Wei et al. 2016) and Citrullus lanatus L. (Ren et al. 2014; Dou et al. 2018), the fruit size is controlled by multiple genes. In Solanum lycopersicum L., 41 gene loci are related to fruit traits, and 17 of which are related to fruit size (Phan et al. 2019).

Linkage and association analyses are the main screening methods for quantitative trait genes (Sahu et al. 2015). Linkage analysis creates a mapping population through parental crossing and then maps the genes that control quantitative traits to the genome. This method has achieved great success in locating traits that conform to Mendelian inheritance (Ma et al. 2020). Association analysis does not require a mapping group and can use natural or cultivated groups as research materials. Natural groups undergo multiple rounds of genetic recombination, and linkage disequilibrium decay (LDD) exists within a short distance, thus allowing the precise positioning of genes (Li et al. 2020b). Association analysis is an important method to identify genes that control quantitative traits in crops (Daba et al. 2020). This technique is classified into genome-wide (GWAS) and candidate gene-based association analysis; the former is suitable for locating genes with unknown function. In Zea mays ssp. mays L., GWAS identified 25 known genes related to flowering time and photoperiod sensitivity and 37 new genes related to days to tassel, anthesis-silk interval, and photoperiod sensitivity (Li et al. 2020b). In Triticum aestivum L., 4 genes associated with plant dwarfing were determined using this method (Daba et al. 2020). In Glycine max (Linn.) Merr., 23 genome segments related to agronomic traits, such as yield, maturity days, seed oil content, seed protein, and 100-seed weight were recognized by GWAS (Bruce et al. 2020). These studies confirmed that GWAS effectively identifies genes that control quantitative traits in crops.

GWAS usually requires a large sample size for detection to improve the statistical power and recognize many candidate genes (Wang and Xu 2019). However, the use of at least several hundred samples requires enormous costs. Here, the yield-related candidate genes of weeping forsythia, an emerging medicinal crop, were identified by using a small number of cultivated samples from the same field and source. Although these fruits are from the same field and source, they still exhibit differences in shape and weight.

Weeping forsythia is a woody plant that takes 3 years from seeding to fruiting. Therefore, a long breeding cycle is required through conventional breeding. Effective methods must be urgently adopted to improve the breeding speed of the high-yielding varieties of this emerging medicinal crop. The identified genes related to the fruit size of weeping forsythia will serve as the foundation for molecular-assisted and genetic engineering breeding to greatly improve its breeding efficiency. The recently published genome sequence of weeping forsythia (Li et al. 2020a) was used as the basis.

In this study, weeping forsythia samples cultivated in the field were selected. Association analysis was performed based on the phenotypic data of fruit shape and weight and variant loci from genotyping by sequencing (GBS) to screen the candidate genes related to fruit yield. The findings will be useful for the subsequent molecular-assisted and genetic engineering breeding of weeping forsythia.

Materials and methods

Sampling and phenotypic measurement

Immature fruits of weeping forsythia, known as ‘Qingqiao,’ were harvested in July to extract their medicinal ingredients. Fresh and healthy fruits (Figs. 1 and S1) were collected from 60 seedlings of 5-year-old plants. All the individual plants are located in the same field in Hexi village (34.03° N, 111.03° E), Sanmenxia, China. Ten fruits were collected from each individual. The average length, width, thickness and weight of 10 fruits per plant were measured using a vernier caliper. Pearson correlation analysis was applied for the phenotypic traits using cor.test in R, where r > 0.7 is taken as the threshold of correlation. Fresh leaves were collected from the corresponding 60 seedlings and immediately stored in silica gel for subsequent DNA extraction.

Fig.1
figure 1

Partial fruit samples of sixty weeping forsythia seedlings

Library construction and genotyping

DNA extractions for each individual were performed on 30 mg of silica gel dried leaf material using CTAB methodology (Doyle and Doyle 1987). Genomic DNA was detected using the following methods to ensure the quality of library construction. (1) Agarose gel electrophoresis was performed to select the samples with the main band and prevent degradation and RNA contamination. (2) Nanodrop 2000 (Thermo Fisher Scientific, MA, USA) was used to select samples with an OD260/280 ratio between 1.8 and 2.2. (3) Qubit 3.0 (Thermo Fisher Scientific, MA, USA) was employed to detect samples with DNA concentration greater than 50 ng/µl and total amount greater than 2 µg. The qualifying DNA was randomly interrupted and then prepared following the steps of terminal repair, (A) tail addition, sequencing adaptor addition, purification, and polymerase chain reaction (PCR) amplification. After the library was constructed, Qubit 3.0 (Thermo Fisher Scientific, MA, USA) was applied for preliminary quantification and library dilution. Agilent 2100 (Agilent, CA, USA) was employed to detect the size of inserted fragments of the library. After the inserted fragments reached the expected size, qPCR was used to accurately quantify the effective concentration of the library to ensure its quality. The qualified libraries were sequenced on an Illumina NovaSeq system (Illumina, Inc., CA, USA) following the manufacturer's protocols.

Fastp 0.20.1 (Chen et al. 2018) was used for data quality control and filtering with the following standards: reads with polyG and polyX at the tail, N number greater than 5, base mass lower than 15, and ratio higher than 40%. After filtering, reads less than 15 in length were removed. The filtered reads were mapped onto the reference genome of weeping forsythia by using BWA 0.7.17 (Li and Durbin 2009). Variant loci were identified by using the GATK 4.1.8.1 (McKenna et al. 2010) and then annotated using snpEff 4.3 (Cingolani et al. 2012).

Population structure analysis

GATK was used to filter variant loci with the following filter parameters to obtain high quality SNP and Indel: QD < 2.0, FS > 60.0, MQ < 20.0, MQRankSum < − 12.5, and ReadPosRankSum < − 8.0. Vcftools 0.1.16 (Danecek et al. 2011) was applied to eliminate loci with MAF less than 3% and missing more than 0.2. The remaining loci were used for population structure analysis and GWAS.

Principal components analysis (PCA), admixture analysis, kinship relation analysis, and phylogenetic inferences were employed to infer the genetic relationships of all weeping forsythia samples. PCA was run using GCTA 1.93.2 (Yang et al. 2011). ADMIXTURE version 1.3 (Alexander et al. 2009) was used to investigate the maximum likelihood of the ancestry of each individual with co-ancestry clusters ranging from 1 to 8. Optimal clustering number was decided according to the cross validation (CV) error value. Phylogenetic relationships of the weeping forsythia samples were constructed using neighbor-joining method in PHYLIP 3.696 program (Felsenstein 2005). Kinship relation was analyzed using PLINK 1.90b6.12 (Purcell et al. 2007).

GWAS analysis

Given that the previous weeping forsythia genome (Li et al. 2020a) was not assembled at the chromosome level, many contigs were artificially combined into several chromosomes for analysis and data presentation. GWAS association analyses were performed on the merged files using EMMAX (Zhou and Stephens 2012) and FaST-LMM 0.4.6 (Lippert et al. 2011) with default parameters. The value of 1/variant loci (340,933) number was used as a threshold for screening variant loci associated with fruit traits. Genotyping was conducted on the variant loci and phenotypic data. Owing to the self-incompatibility of weeping forsythia (Qiao et al. 2020), the candidate genes related to fruit traits were searched on the 20 kb up/downstream of variant loci.

Results

Phenotypic traits of fruits

Four phenotypic traits, namely, fruit length, width, thickness, and weight, were measured (Table S1) for 10 weeping forsythia fruits. The weeping forsythia seedlings exhibited the following features: fruit length ranging 14.62–30.39 mm, fruit width ranging 8.30–14.86 mm, fruit thickness ranging 6.70–12.96 mm, and fruit weight ranging 0.38–1.71 g (Table S1, Fig. 2). The fruit length of weeping forsythia was correlated with the width (r = 0.510) and thickness (r = 0.471), but the correlation coefficients were not significant. Fruit size indexes such as length (r = 0.725), width (r = 0.847), and thickness (r = 0.821) were significantly correlated with fruit weight. According to the correlation coefficient, the fruit width and thickness significantly contributed to fruit weight.

Fig. 2
figure 2

Distribution of phenotypic data in fruit length, width, thickness, and fruit weight. The x axis represents phenotypic data, and the y axis represents the number of individuals

Identification and distribution pattern of variant loci

The GBS of all samples yielded 313,066,384 raw reads with an average of 0.87 × sequencing depth. A total of 313,037,766 clean reads remained after filtering. An average of 5,217,296 reads per sample were mapped on the genome of weeping forsythia, and the average coverage of each sample in the genome reached 7.05%. A total of 3,586,221 SNPs, 399,021 inserts, and 226,432 deletions were recorded after variant locus detection. According to the definition of snpEFF, 96.79% of these loci belong to modifier variation, 1.66% belong to moderate variation, 1.12% belong to low variation, and only 0.43% belong to high variation. After further filtration by GATK and vcftools, the remaining 340,933 variant loci were used for subsequent GWAS and population structure analyses.

Population genetic structure

Admixture (Alexander et al. 2009) was used to examine the population genetic structure of 60 weeping forsythia samples based on all variant loci. An optimal clustering at K = 1 groups was confirmed based on the CV error rate (Fig. S2). Admixture does not support the genetic structural stratification of these weeping forsythia samples. The NJ tree (Fig. 3a) and Kinship matrix (Fig. 3b) did not verify significant genetic differentiation in the 60 samples. However, a genetic close relationship was observed for F34/F36, F35/F38, and F29/F30. This result was also supported by the PCA analysis (Fig. 3c).

Fig. 3
figure 3

Population genetic structure of weeping forsythia. A, Phylogenetic tree of all individuals of weeping forsythia were constructed using the neighbor-joining method in the program PHYLIP. B, Each point represents an individual genotype. The amount of variation explained by PC1 and PC2 is 3.62% and 3.42%. C, Kinship matrix of all individuals of weeping forsythia was constructed using PLINK

Characterization of candidate genes of fruit shape and weight traits

GWAS analysis with 340,933 variant loci was performed for the fruit traits, namely, length, width, thickness, and weight. Several variant loci passed the threshold line (− log10(p) > 5.533) and showed significant association with fruit length (2), fruit width (8), fruit thickness (24), and fruit weight (13) (Figs. 4 and S3-6). Given that weeping forsythia is a self-incompatible plant, a genetic distance of 20 kb was used to screen the candidate genes strongly related to its fruit traits. The results showed that 2 candidate gene was related to fruit length, 17 were related to fruit width, 44 were related to fruit thickness, and 17 were related to fruit weight (Table S2). On the basis of gene annotation and previous studies on gene function, most of these genes had unknown function or were unrelated to fruit development. Only 4 genes might be related to the fruit development of weeping forsythia (Table 1). In particular, WRKY transcription factor 35 (WRKY35, EVM0030555) and salicylic acid-binding protein 2 (SBP2, EVM0012993) were related to fruit width and thickness, and auxin response factor 6 (ARF6, EVM0018222), and alpha-mannosidase (EVM0029771) were related to fruit thickness (Table 1).

Fig. 4
figure 4

GWAS-Manhattan and QQ plots for the fruit traits of fruit length, width, thickness, and single grain weight. Chr1 to Chr7 represent pseudochromosomes, the blue line represents the significance threshold of P < 0.00001, the red line represents the significance threshold of 1/variant loci number

Table 1 Four candidate genes related to fruit development of weeping forsythia fruits

Discussion

Increasing crop yield is an agricultural production research hotspot (Dreccer et al. 2019). Variety improvement is an important way to increase the yield of cereal and industrial crops (Parry and Hawkesford 2012; Wan 2018). Identifying the genes related to yield traits is the key to this strategy. The genes can be used as molecular markers for offspring selection in the selective and cross breeding or target genes for genetic engineering breeding (Gu et al. 2020). This technology is especially important for woody crops with long breeding cycles (Rehman et al. 2020). GWAS can identify the target genes for woody crop research (De la Torre et al. 2019) and usually requires a large sample size. With small samples, identifying the target gene is difficult because of the lack of statistical power and the possibility of false positive results (Wang and Xu 2019).

Weeping forsythia is an emerging medicinal crop that currently lacks excellent varieties for production and cultivation. Seedlings in production are generated from the propagated seeds of natural populations. The fruits have different traits, leading to low production yield. Although reproduction can be achieved through cutting or tissue culture of superior individuals, these methods are limited. Identifying its yield-related genes is the key for the variety improvement of weeping forsythia.

In this study, all weeping forsythia seedlings were obtained the same field and source, Yuncheng, Shanxi, China. The fruits were genetically similar as confirmed by the population genetic structure analysis (Fig. 3). No genetic stratification was found, and only a few individuals showed genetic similarities, such as F34/F36, F35/F38, and F29/F30. Given that all seedlings were propagated from seeds and the species has self-incompatibility characteristics (Qiao et al. 2020), the genetically similar fruits still exhibited significantly different traits (Fig. 2). This finding provides an opportunity to use small samples to identify candidate genes related to fruit yield. Given the small sample size, two methods were used for analysis, namely, EMMAX and FASTLMM. A relatively high threshold was also applied to increase the statistical power.

Multiple variant loci related to fruit length, width, thickness, and weight were identified by EMMAX and FASTLMM. Further search on the upstream and downstream genes of these variant loci revealed multiple candidate genes related to fruit traits. Many gene functions of weeping forsythia are unknown because only a few studies were conducted on the gene functions of Oleaceae (Table S2). According to the functional annotations of genes from other parallel species, most of the identified candidate genes may not be related to fruit development (Table S2). However, the relationship between these genes and the fruit development of weeping forsythia needs further verification. Only the following four genes are most likely to be related to fruit development.

SBP2 (EVM0012993) is a 29 kDa protein that catalyzes the conversion of methyl salicylic acid into salicylic acid (SA) (Tripathi et al. 2010). Transgenic studies proved that the transfer of SBP2 gene can remarkably increase the endogenous SA content of plants (Guan et al., 2019). SA can also improve plant adaptability to biotic and abiotic stresses by activating plant defense systems (Tiwari et al. 2017). A recent study on pomegranate found that SA contributes to fruit development and could improve fruit yield and quality (García-Pastor et al. 2020). The present results suggested that SBP2 gene is related to the fruit width and thickness of weeping forsythia and thus contribute to its fruit size.

ARF6 is a member of the ARF gene family and encodes a transcription factor that combines with auxin-responsive promoter elements to regulate the expression of auxin response genes. Auxin is an important plant hormone involved in flower fertilization, fruit setting, and fruit formation and development (De Jong et al. 2009). This hormone regulates cell division and expansion during fruit development to control the fruit size (Devoghalaere et al. 2012). In the present study, ARF6 was significantly related to the fruit thickness of weeping forsythia. Recent reports on ARF gene family members ARF5 and ARF9 in tomato suggested that this gene family can affect fruit size by regulating cell division and expansion (De Jong et al. 2015; Liu et al. 2018).

WRKY35 (EVM0030555) is a transcription factor belonging to WRKY family that regulates plant growth and development and plays important roles in fruit development, such as fruit ripening (Cheng et al. 2016), fruit color (Wang et al. 2017), and fruit disease resistance (Deng et al. 2020). Here, WRKY35 was found to be possibly related to the fruit size of weeping forsythia. Alpha-mannosidase (EVM0029771) was also associated with fruit size, but previous reports have confirmed its relation to fruit ripening and softening (Irfan et al. 2016). Whether these identified genes are related to fruit development or have unknown functions related to fruit yield must be verified by transgenic experiments.

In this study, the seedlings of weeping forsythia from the same field and source were used as subjects. Owing to this plant’s self-incompatibility, its fruits are genetically similar but have distinctly different traits. This characteristic allows the use of small sample size to identify candidate genes related to its fruit yield. However, the identified genes still need further functional verification. The findings serve as an important basis for the breeding of high-yield varieties of this emerging medicinal crop.