Introduction

Polycystic ovary syndrome (PCOS) is a common endocrine disorder that often occurs in women of childbearing age, often accompanied by insulin resistance (IR) and obesity [1]. According to the revised Rotterdam 2003 criteria [2], PCOS is diagnosed when two of the following three criteria are present: (1) oligo- or an-ovulation, (2) clinical and/or biochemical signs of hyperandrogenism, and (3) polycystic ovaries on ultrasound. As a highly prevalent endocrine disorder, PCOS affects 5–13% of reproductive-aged women and more than 10% of adolescent girls [3,4,5,6]. PCOS is associated with a variety of consequences, such as reproductive issues, metabolic abnormalities, and psychological disorders [7]. Furthermore, more than half of the PCOS patients suffer from infertility [8], and infertile women and adolescent girls with PCOS were reported to have reduced quality of life (QoL) [9,10,11].

PCOS is a multifaceted disease, and previous studies have found its association with genetic factors [12], environmental factors such as exposure to Bisphenol A (BPA) [13], and life style factors such as smoking [14]. However, the exact etiology of PCOS remains to be elucidated. More studies are needed to further explore the pathological mechanisms underlying PCOS to facilitate diagnosis and treatment of this common and jeopardizing disease.

Genetics play an important role in the pathogenesis of PCOS. Previous twin studies demonstrated that genetics explained more than 70% of PCOS pathogenesis [15]. Multiple genome-wide association studies (GWAS) conducted on Han Chinese, Korean, and European populations have identified a few genetic variants in association with PCOS, such as single-nucleotide polymorphisms (SNPs) in/near FSHR, THADA [16], YAP1, HMGA2 [17], KHDRBS3 [18], FSHB, GATA4/NEIL2 [19], ERBB4/HER4, and RAD50 [20]. However, biological interpretation of the identified genetic variants in the etiology of PCOS remains largely unclear. It is likely that the genetic variants identified in GWAS could exert their effects on diseases/disorders via gene expression because many of them are located in non-coding regions [21]. Therefore, exploring the relationship between genetic variation and gene expression can help better understand the regulatory pathways underlying the pathogenesis of PCOS.

Mendelian randomization (MR) is a method for exploring potentially causal association between an exposure and an outcome by using genetic variants as the instrumental variables (IVs) for exposure [22,23,24]. Compared with traditional statistical methods used in the association studies, MR can reduce confounding and reverse causation and is becoming increasingly popular in the exploration of etiological mechanisms [25, 26]. A novel analytical framework through a summary data-based MR (SMR) approach integrating cis- expression quantitative trait loci (cis-eQTL, i.e., genetic variants near a gene that explain the variance of the expression level of the gene [27]) or cis- DNA methylation QTL (cis-mQTL, i.e., CpG sites near a gene that explain the variance of the methylation level of the CpG site [28]) and GWAS data was recently proposed [29]. This method has been employed in identifying gene expressions or DNA methylation loci that are pleiotropically or potentially causally associated with various phenotypes, such as severity of COVID-19, major depression, and Alzheimer’s disease [30,31,32], implying that it is a promising tool to explore genes that are pleiotropically associated with complex traits.

In this paper, the SMR method integrating summarized GWAS data for PCOS and cis- eQTL data were applied to prioritize genes that are pleiotropically/potentially causally associated with PCOS.

Materials and Methods

This study utilized SMR method integrating GWAS summary results for PCOS and eQTL data in ovary and blood. All the data were publicly available. The analytic process of the present study is illustrated in Fig. 1.

Fig. 1
figure 1

Flow chart for the SMR analysis. A SMR analysis using eQTL data from ovary and B SMR analysis using eQTL data from blood. eQTL, expression quantitative trait loci; GWAS, genome-wide association studies; LD, linkage disequilibrium; PCOS, polycystic ovary syndrome; SMR, summary data-based Mendelian randomization; SNP, single nucleotide polymorphisms

Data Sources

eQTL Data

In the SMR analysis, cis-eQTL genetic variants were used as the instrumental variables (IVs) for gene expression. SMR analysis were performed using eQTL data in ovary because ovary is directly involved in PCOS. Analysis using eQTL data from blood were also conducted as blood might reflect hormonal or metabolic traits associated with PCOS. The eQTL data were obtained from the V7 release of the GTEx summarized data. Detailed information regarding sample acquisition and treatment were provided elsewhere [33]. The summarized data included 85 participants for ovary and 338 participants for blood [34]. The eQTL data can be downloaded at https://cnsgenomics.com/data/SMR/#eQTLsummarydata.

GWAS Data for PCOS

The GWAS summarized data for CCT were provided by a recent genome-wide association meta-analysis of PCOS [35]. The results were based on meta-analyses of seven cohorts of European descent, including a total of 10,074 PCOS cases and 103,164 controls. Details of the study participants and design of the study can be found in the paper published previously [35]. Diagnosis of PCOS was based on the NIH criteria [36], which was established in 1990 and requires hyperandrogenism (HA) and ovulatory dysfunction (OD), or the Rotterdam Criteria [2] which was established in 2003, includes the presence of polycystic ovarian morphology (PCOM) and requires the presence of at least two of three traits as specified above. The self-report-based data from 23andMe, Inc. (Mountain View, CA, USA) were excluded. The GWAS summarized data can be downloaded at https://www.repository.cam.ac.uk/handle/1810/283491.

SMR Analysis

In the SMR analyses, cis-eQTL was used as the instrumental variable, gene expression was the exposure, and PCOS was the outcome. The analysis was done using the method as implemented in the software SMR. SMR applies the principles of MR to jointly analyze GWAS and eQTL summary statistics to test for pleotropic association between gene expression and a trait due to a shared and potentially causal variant at a locus. Detailed information regarding the SMR method was reported in a previous publication [29]. The heterogeneity in dependent instruments (HEIDI) test [29] was conducted to evaluate the existence of linkage in the observed association. Rejection of the null hypothesis (i.e., PHEIDI<0.05) indicates that the observed association could be due to two distinct genetic variants in high linkage disequilibrium with each other. The default settings in SMR were adopted (e.g., PeQTL <5 × 10−8, minor allele frequency [MAF] > 0.01, removing SNPs in very strong linkage disequilibrium [LD, r2 > 0.9] with the top associated eQTL, and removing SNPs in low LD or not in LD [r2 <0.05] with the top associated eQTL), and false discovery rate (FDR) was used to adjust for multiple testing.

Data cleaning and statistical/bioinformatical analysis were performed using R version 4.0.4 (https://www.r-project.org/), PLINK 1.9 (https://www.cog-genomics.org/plink/1.9/), and SMR (https://cnsgenomics.com/software/smr/).

Results

Basic Information of the Summarized Data

The GWAS summarized data were based on GWAS meta-analysis of 113,238 subjects (10,074 PCOS cases and 103,164 controls) [35]. After checking of allele frequencies among the datasets and LD pruning, there were about 6.4 million eligible SNPs included in the final SMR analysis. The number of participants for GETx eQTL data in ovary is smaller (n=85), compared with that in whole blood (n=385), so is the number of eligible probes (1530 vs. 4490). The detailed information is shown in Table 1.

Table 1 Basic information of the GWAS and eQTL data.

Pleiotropic Association with PCOS

Information of the top ten probes using eQTL data for the ovary and whole blood was presented in Table 2. Although no genes showed significant pleiotropic association with PCOS after correction for multiple testing, some of genes exhibited suggestive significance. Specifically, RPS26 (ENSG00000197728.5) showed the strongest suggestive pleiotropic association with PCOS in both SMR analyses (β[SE]=0.10[0.03], PSMR=1.72×10−4 for ovary; β[SE]=0.11[0.03], PSMR=1.40×10−4 for whole blood; Fig. 2). PM20D1 (ENSG00000162877.8) showed the second strongest suggestive pleiotropic association with PCOS in the SMR analysis using eQTL data for the whole blood, and it was also among the top ten hit genes in the SMR analysis using eQTL data for the ovary (β[SE]=0.10[0.03], PSMR=2.94×10−3 for ovary; β[SE]=0.14[0.04], PSMR=8.38×10−4 for blood; Fig. 3). Two other genes, including CTC-457L16.2 (ENSG00000262319.1; β[SE]= −0.2[0.06], PSMR=1.55×10−3 for ovary; β[SE]= −0.31[0.09], PSMR=1.07×10−3 for blood; Fig. 4) and NEIL2 (ENSG00000154328.11; β[SE]= −0.2[0.06], PSMR=1.02×10−3 for ovary; β[SE]= −0.30[0.10], PSMR=2.25×10−3 for blood; Fig. 5), were among the top ten hit genes in both SMR analyses.

Table 2 The top ten probes identified in the SMR analysis.
Fig. 2
figure 2

Pleiotropic association of RPS26 with PCOS. A SMR analysis results using eQTL data for ovary. B SMR analysis results using eQTL data for whole blood. Top plot, grey dots represent the -log10(P values) for SNPs from the GWAS of PCOS, with solid rhombuses indicating that the probes pass HEIDI test. Middle plot, eQTL results. Bottom plot, location of genes tagged by the probes. GWAS, genome-wide association studies; SMR, summary data-based Mendelian randomization; HEIDI, heterogeneity in dependent instruments; eQTL, expression quantitative trait loci; PCOS, polycystic ovary syndrome

Fig. 3
figure 3

Pleiotropic association of PM20D1 with PCOS. A SMR analysis results using eQTL data for ovary. B SMR analysis results using eQTL data for whole blood. Top plot, grey dots represent the -log10(P values) for SNPs from the GWAS of PCOS, with solid rhombuses indicating that the probes pass HEIDI test. Middle plot, eQTL results. Bottom plot, location of genes tagged by the probes. GWAS, genome-wide association studies; SMR, summary data-based Mendelian randomization; HEIDI, heterogeneity in dependent instruments; eQTL, expression quantitative trait loci; PCOS, polycystic ovary syndrome

Fig. 4
figure 4

Pleiotropic association of CTC-457L16.2 with PCOS. A SMR analysis results using eQTL data for ovary. B SMR analysis results using eQTL data for whole blood. Top plot, grey dots represent the -log10(P values) for SNPs from the GWAS of PCOS, with solid rhombuses indicating that the probes pass HEIDI test. Middle plot, eQTL results. Bottom plot, location of genes tagged by the probes. GWAS, genome-wide association studies; SMR, summary data-based Mendelian randomization; HEIDI, heterogeneity in dependent instruments; eQTL, expression quantitative trait loci; PCOS, polycystic ovary syndrome

Fig. 5
figure 5

Pleiotropic association of NEIL2 with PCOS. A SMR analysis results using eQTL data for ovary. B SMR analysis results using eQTL data for whole blood. Top plot, grey dots represent the -log10(P values) for SNPs from the GWAS of PCOS, with solid rhombuses indicating that the probes pass HEIDI test. Middle plot, eQTL results. Bottom plot, location of genes tagged by the probes. GWAS, genome-wide association studies; SMR, summary data-based Mendelian randomization; HEIDI, heterogeneity in dependent instruments; eQTL, expression quantitative trait loci; PCOS, polycystic ovary syndrome

Discussion

This study integrated GWAS summarized data for PCOS and eQTL data in the SMR analysis to explore genes that were pleiotropically or potentially causal associated with PCOS. It was found that multiple genes were potentially involved in the pathogenesis of PCOS. To the best knowledge of the authors, this is the first study to explore genes in pleiotropic association PCOS through a Mendelian randomization approach.

In the present study, RPS26 (Ribosomal Protein S26) showed the strongest suggestive pleiotropic association with PCOS in the SMR analyses using both ovary and blood eQTL data (Table 2). RPS26, located on 12q13.2, is a gene encoding a ribosomal protein which is a component of the 40S subunit and belongs to the S26E family of ribosomal proteins [37]. Recent research found that RPS26 critically regulated T-cell survival in a p53-dependent manner [38]. Knockout of RPS26 in mouse oocytes led to retarded follicle development from pre-antral follicles to antral follicles and arrested chromatin configurations of the oocytes [39]. An earlier study using human ovary cDNA library found that RPS26 was downregulated in PCOS ovary, compared with the normal human ovary [40]. A recent study examined genes in PCOS-associated regions using a Bayesian colocalization approach (Coloc) and found seven genes, including RPS26, that harbored potential causal variants accounting for approximately 30% of known PCOS signals [41]. These findings, together with findings from the present study, suggested that RPS26 might play a critical role in the etiology of PCOS and highlighted the potential of this gene as a promising target for the prevention and treatment of PCOS.

PM20D1 (Peptidase M20 Domain Containing 1) showed the second strongest suggestive pleiotropic association with PCOS in the SMR analysis using blood eQTL data and was also among the top hit genes in the SMR analysis using ovary eQTL data (Table 2). PM20D1, located on 1q32.1, is a gene encoding a bidirectional enzyme capable of catalyzing both the condensation of fatty acids and amino acids to generate N-acyl amino acids and the reverse hydrolytic reaction, thereby regulating energy homeostasis [42]. Mice with increased circulating PM20D1 had increased N-acyl amino acids in blood and improved glucose homeostasis, and PM20D1 knockout in mice resulted in impaired glucose tolerance and insulin resistance (IR) [43]. In human adipocytes, PM20D1 is one of the most highly upregulated genes by the antidiabetic thiazolidinedione drug rosiglitazone, suggesting a potential role of this enzyme and/or N-fatty acyl amino acids in obesity and diabetes [44]. In addition, PM20D1 has been found to be associated with various diseases such as Parkinson’s disease [45] and Alzheimer’s disease [46]. To date, research is scarce on the association of PM20D1 with PCOS. Given that PCOS is a metabolic disorder that is closely related with obesity, IR, type 2 diabetes (T2D), and metabolic syndrome [47], it is highly likely that PM20D1 could be involved in the etiology of PCOS, and more studies are needed to explore the exact roles that PM20D1 plays in the pathogenesis of PCOS.

NEIL2 (Nei Like DNA Glycosylase 2) showed the third strongest suggestive pleiotropic association with PCOS in the SMR analysis using ovary eQTL data and was also among the top hit genes in the SMR analysis using whole blood eQTL data (Table 2). NEIL2, located on 8p23.1, is a gene encoding a member of the Fpg/Nei family of DNA glycosylases. The encoded enzyme is primarily involved in DNA repair by cleaving oxidatively damaged bases and introducing a DNA strand break via its abasic site lyase activity [48, 49]. The recent meta-analysis of GWAS on PCOS, on which the SMR analysis of the present study was based, found that the genetic variant rs804279 in GATA4/NEIL2 showed significant association with PCOS (OR=1.14, 95% CI: 1.10–1.18; P=3.76×10−12); however, significant heterogeneity was observed across the different studies [35]. This genetic variant also showed significant association with polycystic ovarian morphology and ovulatory dysfunction [35]. Another genetic variant rs8191514 in NEIL2 was predicted to generate a binding site for twenty transcription factors, and it was in linkage disequilibrium with the PCOS-identified genetic variant rs804279 (R2= 0.4, D′=0.97; P<0.0001) [50]. Moreover, 8p23.1, where the NEIL2 is located, is also the region of GATA4 (GATA Binding Protein 4), and knock-out of GATA4 led to abnormal responses to exogenous gonadotropins and impaired fertility in mice [51]. It also encompasses the promoter region of FDFT1 (Farnesyl-Diphosphate Farnesyltransferase 1), a gene encoding farnesyl-diphosphate farnesyl transferase that is involved in cholesterol biosynthesis pathway [52], thereby influencing testosterone biosynthesis. These findings, together with findings from the present study, highlight the importance of this region in association with PCOS.

This study has some limitations. The number of probes used in the SMR analyses was limited, and the sample size in the eQTL analysis was limited, especially for the eQTL data in ovary, which may lead to reduced power in the eQTL analysis. Consequently, some important genes implicated in PCOS may have been missed. Future SMR studies with larger samples for the eQTL analysis are warranted to identify additional genes underlying the pathogenesis of PCOS. The SMR analyses were performed only in participants of European ethnicity, and future studies are needed to explore whether the findings can be generalized to other ethnicities. Moreover, this study only explored gene expression probes in association with PCOS, and it is possible that genetic variants exert their effect on PCOS through other epigenetic mechanisms, such as DNA methylation. More studies integrating multi-omics data are needed to more systematically explore the complex mechanisms underpinning PCOS.

In summary, the present SMR study integrating GWAS of PCOS and eQTL data revealed that multiple genes were potentially involved in the pathogenesis of PCOS. Some of the genes were reported to be involved in the regulation of T-cell survival, energy homeostasis, and DNA repair. More studies are needed to examine the exact functions of these genes in the etiology of PCOS and to explore additional genes implicated in the pathogenesis of PCOS.