Introduction

Lung adenocarcinoma (LUAD) is the most common histologic subtype of lung cancer and accounts for approximately 40% of lung cancer incidence worldwide1,2,3. In studies drawn from East Asian (EA) ancestry, LUAD has been the predominant histologic subtype among females2 and has replaced squamous cell carcinoma as the most common subtype in males4,5. Well established risk factors, namely, tobacco smoking, certain environmental/occupational exposures and lifestyle factors, and family history, contribute to the risk of LUAD6,7,8. In addition, multiple genome-wide association studies (GWAS) have identified at least 24 susceptibility loci for LUAD that achieved genome-wide significance, many drawn from studies in EA9,10,11,12,13,14,15 and European (EUR)16,17,18,19,20,21,22,23 populations, as well as multi-ancestry meta-analyses24,25. Of these, 12 loci have been reported at genome-wide significance in GWAS of either never-smokers9,11,12,13 or smokers and nonsmokers combined10,14,15,24 in EA populations while another two loci were suggested in a multi-ancestry meta-analysis24. We estimated that the known susceptibility variants account for only 13% of the estimated familial risk in EA populations. Accordingly, larger studies are needed to investigate the underlying architecture of susceptibility to LUAD in never-smokers and individuals with a history of smoking and in different ancestral populations. The importance of multi-ancestry analyses is further highlighted by reports of susceptibility loci showing association for LUAD in EA but not in EUR populations13.

In the current study, we conducted a two-stage GWAS meta-analysis in EA populations using unpublished and previously published data from four studies: the Female Lung Cancer Consortium in Asia (FLCCA), Nanjing Lung Cancer Study (NJLCS)10,24, National Cancer Center Research Institute (NCC) and Aichi Cancer Center (ACC), with 11,753 cases and 30,562 controls in the discovery set and 9905 cases and 120,114 controls in the replication set. A multi-ancestry meta-analysis of EA and EUR studies16,22 (from the International Lung Cancer Consortium, ILCCO) was performed to identify variants shared by both populations. We also investigated the heterogeneity of effect sizes for susceptibility variants identified in EA and EUR populations16,22 and obtained genome-wide estimates of effect-size correlation. Finally, we evaluated the genetic architecture26 of LUAD, characterized by the number of susceptibility variants and their effect size distribution after normalizing allele frequencies, to investigate the accuracy of genetic risk prediction in the future GWAS in EA populations with increased sample sizes.

Results

Two-stage GWAS meta-analysis of LUAD in East Asian populations

For the discovery set, we performed a fixed-effect meta-analysis (11,753 cases and 30,562 controls) drawn from EA studies (Table 1, Supplementary Table 1). Details of quality control, imputation and post-imputation filtering are described in Methods. Variants with an imputation quality score ≥0.5 and minor allele frequency (MAF) ≥ 0.01 were included for meta-analysis. The estimated genetic correlation between LUAD in never-smokers and individuals with a history of smoking was rg = 0.81 (s.e. = 0.16) using linkage disequilibrium (LD) score regression (LDSC)27, which enabled the primary meta-analysis to include the two groups. LDSC analysis suggested little evidence of residual population stratification (LDSC intercept = 1.03). We identified 14 loci achieving genome-wide significance P < 5 × 10−8 (Supplementary Table 2); two were novel at 2p23.3 (rs682888, OR = 0.89, P = 4.94 × 10−10) and at 7q31.33 (rs4268071, OR = 1.39, P = 7.27 × 10−10). In meta-analysis performed separately for males and females, and for never-smokers and individuals with a history of smoking, no further loci achieved genome-wide significance.

Table 1 Demographic characteristics of the subjects in the discovery and the replication datasets for a GWAS of lung adenocarcinoma in East Asians

In the replication phase, we selected 38 lead variants with P < 10−5 in the discovery data that were not previously reported as genome-wide significant in either EA or EUR populations and genotyped them in an independent data set of 9905 LUAD cases and 120,114 controls from a Japanese population (Table 1, Supplementary Table 1). After combining the discovery and the replication data, we identified a total of 10 novel loci achieving genome-wide significance and a novel variant on the locus at 15q21.2 that was previously reported in EUR populations16 (Table 2, Manhattan plot in Fig. 1, and regional association plots in Supplementary Fig. 1).

Table 2 Novel genetic variants associated with lung adenocarcinoma in East Asians
Fig. 1: Manhattan plot for GWAS meta-analysis of lung adenocarcinoma in East Asians.
figure 1

The x-axis represents chromosomal location, and the y-axis represents -log10(p-value). All p values were two-sided and not adjusted for multiple testing. The red horizontal line denotes the p value threshold for declaring genome-wide significance at \(5\times {10}^{-8}\). For each box, red text represents a novel variant (12 novel variants, including the lead variants from 10 novel loci, rs12664490 by conditional analysis at 6p21.1, a locus previously reported in East Asians, and rs71467682 at 15q21.2, a locus preciously reported in Europeans); black text represents a previously reported association (16 variants in total, including three independently associated variants in 5p15.33 locus). For each locus, a green circle represents the top p value from the discovery samples, a red diamond represents the p value combining the discovery and the replication data, a black square represents the p value combining our discovery data and Chinese samples in Dai et al.24 (for three variants identified in a cross-ancestry analysis of East Asians and Europeans in Dai et al.24, see Supplementary Table 3). In summary, 28 variants at 25 loci achieved genome-wide significance, including 16 previously reported variants and 12 novel variants.

Conditional analysis using GWAS summary statistics suggested two additional susceptibility variants rs13167280 (OR = 1.29, P = 4.07 × 10−13) and rs62332591 (OR = 0.87, P = 3.21 × 10−8) in the locus at 5p15.33 (Table 3, Supplementary Fig. 2); both are in modest LD with previously reported secondary variants in EA populations28 (R2 = 0.27 between rs13167280 and rs10054203;28 R2 = 0.19 between rs62332591 and rs1005420328). Variant rs12664490 (OR = 0.81, P = 1.24 × 10−10) was conditionally significant in a locus previously reported in EA at 6p21.1 (Table 3, Supplementary Fig. 3), adding another novel variant (12 novel variants in total).

Table 3 Conditional and joint analyses identified independently associated risk SNPs for lung adenocarcinoma at two existing loci in East Asians

A previous multi-ancestry meta-analysis conducted by Dai et al.24. that included Chinese samples and EUR samples from the ILCCO study identified three SNPs for LUAD, one of which achieved genome-wide significance and the other two were suggestive in their analysis restricted to the Chinese subgroup24 (see Supplementary Table 3). In the meta-analysis of the Chinese samples in Dai et al.24. with our independent EA samples, all three variants exceeded the threshold of genome-wide significance without issues of heterogeneity (Supplementary Table 3).

Overall, our study identified 12 novel susceptibility variants bringing the total to 28 genetic variants at 25 loci that have been identified to date in EA populations (Supplementary Table 4, Fig. 1). Assuming a familial risk estimate of 1.84 for first-degree relatives29, the 25 independent susceptibility variants for LUAD (Supplementary Table 4) captured 16.2% of the familial relative risk in EA populations. Moreover, we found no evidence that the SNP associations differed between the samples from the Mainland of China and those from outside of the Mainland of China, or between Han Chinese and Japanese, the two largest ancestry populations in our study (Supplementary Table 5).

We further examined whether the novel variants identified in this study were associated with smoking behaviors (i.e., smoking status, cigarettes per day, initiation age and cessation) or chronic obstructive pulmonary disease in the Biobank Japan Project30 (BBJ). We found no evidence that these variants were implicated in these traits in this cohort (Supplementary Table 6). A previous GWAS in EUR populations found variants (e.g., rs55781567) at the 15q25.1 CHRNA5 locus associated with tobacco smoking and lung cancer risk only in individuals with a history of smoking (OR = 1.33, P = 1.83 × 10−78, MAF = 0.39)16,18,19,31. However, this variant did not achieve genome-wide significance in our EA data (OR = 1.37, P = 0.001 for individuals with a history of smoking; OR = 1.05, P = 0.44 for never-smokers), likely because of a low MAF = 0.03, and no other variant in LD with this SNP showed a substantial association.

Fine mapping and functional analyses of GWAS loci

To prioritize candidate variants for functional follow-up from each of the LUAD GWAS loci, we performed Bayesian fine mapping using FINEMAP32 (Methods). Fine mapping of the genome-wide significant loci from the discovery set nominated 95% credible set variants for 9 loci with a median of 63 variants per locus (Supplementary Data 1). For the 12 novel variants identified from the combined discovery and replication datasets as well as conditional analysis, we then performed variant annotation analysis. High-LD variants for these signals (R2 ≥ 0.8 with the lead SNP in the 1000 Genomes, phase 3, EA) included those located in predicted promoters or enhancers in lung tissues/cells (RegulomeDB33, Haploreg34 v4.1, and FORGE2;35 Supplementary Data 2), which can be tested in future experimental studies.

To further characterize the functionality of the prioritized susceptibility genes that could explain the new GWAS loci, eQTL colocalization and transcriptome-wide association study (TWAS) analyses were conducted. Initial stratified LD score regression36 using GTEx data (Supplementary Fig. 4; Supplementary Data 3) indicated that LUAD heritability drawn from EA populations are enriched in lung tissue-specific genes and chromatin features compared to other tissues (aggregated rank test P = 1.36 × 10−2 and 7.7 × 10−3, respectively; Supplementary Data 3). Accordingly, we performed eQTL analyses using the Taiwanese dataset of adjacent normal lung tissues from 115 never-smoking lung cancer patients (LCTCNS) (Methods; Supplementary data 4). We performed colocalization analyses of eQTL genes using eCAVIAR37 and HyPrColoc38. A notable finding was the colocalization of FADS1 at 11q12.2 (rs174559, posterior probability = 0.91) (Fig. 2; Supplementary Data 5), particularly since rs174559 was in LD with a recently identified functional variant (rs174557) regulating allelic FADS1 expression in liver cells39. FADS1 encodes fatty acid desaturase 1, which is a key enzyme in the metabolism of polyunsaturated fatty acids and plays a key role in inflammatory diseases40. Higher FADS1 levels in the lung tissues were associated with LUAD risk, which is consistent with its role in increasing the proliferation and migration of laryngeal squamous cell carcinoma through activation of the Akt/mTOR pathway41. Among the known loci, colocalization identified TP63 at 3q28 and ACVR1B at 12q13.13 (Supplementary Data 5).

Fig. 2: Colocalization of lung adenocarcinoma GWAS signal from the new locus on Chr11 with FADS1 eQTL signal.
figure 2

Colocalization analysis was performed using HyPrColoc with summary statistics from Taiwanese lung eQTL data (for FADS1 gene, A) and those of EA GWAS discovery set (B). LD R2 (1000 Genomes, EA) of each SNP with the GWAS lead SNP, rs174559 (red circle), is color-coded as shown in the top band. Colocalization posterior probability (PP) is shown next to the candidate SNP, rs174559. Note that the p value of rs174559 in GWAS was based on the discovery data and did not include the Japanese replication data. All eQTL p values were two-sided and not adjusted for multiple testing.

We then performed a TWAS using LCTCNS eQTL dataset. TWAS identified FADS1 as a susceptibility gene from the 11q12.2 locus (TWAS P = 3.01 × 10−6) validating the finding from the colocalization analysis. We further identified ELF5 (TWAS P = 1.89 × 10−8) as a novel gene from a locus (at 11p13) not originally passing the genome-wide significance threshold based on a single-variant test in our EA discovery GWAS (Supplementary data 6, Methods). For these two loci, we also performed TWAS conditional analysis to assess whether genetically predicted expression of these genes explain most of the GWAS signal. When GWAS signal was conditioned on predicted expression of ELF5, most of the signal disappeared, adding support for ELF5 as the main susceptibility gene in this locus (Supplementary Fig. 5A). ELF5 encodes E74-like factor 5, a key transcription factor of alveologenesis of mammary glands42. Lower levels of ELF5 were associated with LUAD risk in the TWAS. Similarly, when GWAS signal was conditioned on predicted expression of FADS1, the strongest part of the signal disappeared (Supplementary Fig. 5B). We further performed TWAS analysis using GTEx lung eQTL dataset (v8, n = 515, ~85% Europeans) and identified five genes from four loci (Supplementary Data 6). While identification of ELF5 was common between two datasets, TWAS using GTEx data identified four unique genes from three known loci (DCBLD1, MPZL3, JAML, and LINC00674). Notably, FADS1 was identified only by ancestry-matched LCTCNS eQTL dataset even with a ~ 4 times smaller sample size.

An investigation of the local environment of susceptibility loci revealed further plausible candidate genes that could be pursued in laboratory follow-up. For instance, rs137884934 on 3q22.3 maps to PIK3CB encoding an isoform of p110 catalytic subunit of Class IA PI3K43. Previous studies have shown that PI3K/Akt/mTOR signaling pathway plays an important role in the development and progression of non-small cell lung cancer44. Moreover, rs764014 on 15q21.3 is located adjacent to NEDD4, which is a negative regulator of tumor suppressor PTEN45, which encodes a lipid phosphatase which counteracts the growth promoting effect of PI3K pathway46.

Multi-ancestry meta-analysis in East Asian and European populations

To identify variants shared by EA and EUR populations, we performed a fixed effect, multi-ancestry GWAS meta-analysis including data from samples in EA (11,753 cases and 30,562 controls) and samples from EUR populations (11,273 cases and 55,483 controls). We identified four additional loci (Supplementary Table 7) with similar effect sizes in the two populations: rs1130866 (2p11.2, OR = 1.08, P = 1.56 × 10−8), rs2320614 (4q32.2, OR = 1.08, P = 6.51 × 10−9), rs34638657 (16q23.3, OR = 1.09, P = 2.19 × 10−9) and rs638868 (18q12.1, OR = 1.08, P = 3.6 × 10−8). Regional association plots are shown in Supplementary Fig. 6. A multi-ancestry meta-analysis stratified by smoking status did not reveal loci specific to never-smokers or individuals with a history of smoking (sample size information in Supplementary Table 8).

Among the four loci, rs1130866 at 2p11.2 is a missense variant (Ile131Thr) of SFTPB, encoding surfactant protein B. Pulmonary surfactant lines the alveoli of lung to reduce the surface tension and is essential for lung function, and increasing circulating level of pro-SFTPB suggested increased lung cancer risk based on prediagnostic samples47. Notably, two other novel variants, rs34638657 at 16q23.3 (MPHOSPH6)48,49 and rs2320614 at 4q32.2 (NAF1)50, are on or near genes implicated in telomere biology. Together with other known or new loci (rs2736100 TERT, rs4268071 POT1, rs75031349 RTEL151,52, rs7902587 OBFC153, rs35446936 TERC) (Supplementary data 7), our findings further support the role of telomere biology in LUAD.

Mendelian randomization analysis of telomere length

We performed a Mendelian randomization (MR) analysis to investigate a potential causal relationship between telomere length and the risk of LUAD. The MR analysis was based on 46 independent variants identified in a recent multi-ancestry GWAS of telomere length in the TOPMed study54, cumulatively accounting for 3.74% of telomere length variance (Methods). Since genetic effects on telomere length showed no evidence of heterogeneity across populations in the TOPMed study, we used the genetic effects estimated based on all populations in the TOPMed study. Our MR analysis was based on MR-PRESSO55, a robust approach that estimates causal effects after removing variants detected with evidence of pleiotropic effects. Genetically predicted longer telomere length was significantly associated with increased risk of LUAD with similar ORs (per one standard deviation change in genetically increased telomere length) between the two populations: OR = 2.61 (95% CI = 2.08, 3.28, P = 8.14 × 10−10) in EA populations, OR = 2.67 (95% CI = 2.07, 3.43, P = 7.14 × 10−9) in EUR populations, consistent with previous MR reports56,57,58 as well as a study of white blood cell DNA telomere length and lung cancer risk in multiple prospective cohorts59. MR analyses stratified by smoking status showed similar results between never-smokers and individuals with a history of smoking (Supplementary Table 9). We performed sensitivity analyses using genetic effects estimated based on Asian and European populations in the TOPMed study separately and found similar results (Supplementary Table 9).

Comparing the genetics of LUAD in EA and EUR populations

We systematically compared the effect size in EA vs. EUR populations of 38 susceptibility variants for LUAD. These included 12 variants identified in the current study, 26 variants previously reported in EA10,11,13,14,15,31 and/or EUR16,19,20 populations, and results of multi-ancestry meta-analyses combining data from EA and EUR24 populations (Supplementary Data 8). As expected, ten SNP associations that were independently identified in both populations and through multi-ancestry analysis were very similar (Fig. 3A, B, C). In contrast, out of the 20 SNP associations initially identified in EA populations, two had MAF < 0.01, 11 showed no evidence of association within EUR populations at P < 0.05 (Fig. 3D and Fig. 3E, Supplementary Data 8), and 11 associations were significantly different between the two populations with FDR < 0.05. Similar population differences were observed among never-smokers and individuals with a history of smoking (Supplementary Fig. 7). For variants with MAF > 0.01 in both populations, the lack of association in EUR populations did not seem to be driven by low MAF or lower statistical power, as MAFs in both populations for most variants were similar and GWAS in both populations had adequate power to detect at least some evidence of association (Supplementary Data 9). Further, evaluation of gene region plots that spanned 500 kb for these loci within EUR populations showed no or very weak evidence of association for other variants in the region as well as the lead variants from the EA populations (Supplementary Fig. 8A–J), with one exception (Supplementary Fig. 8K). For 8 SNPs initially identified in EUR populations, there was evidence of association for 5 variants in EA populations (Fig. 3F, Supplementary Fig. 9) although all variants were attenuated in the EA compared to the EUR population and one variant had MAF <1% in EA; moreover, two variants were significantly weaker (Supplementary Data 8, Supplementary Fig. 9). Similar patterns were observed among never-smokers and individuals with smoking history (Supplementary Fig. 7).

Fig. 3: Comparing odds ratios (ORs) of lung adenocarcinoma susceptibility variants between East Asian (EA) and European (EUR) populations.
figure 3

Here, the effect allele was defined as the minor allele in EA. Each error bar represents the 95% confidence interval of the OR (the center). A Susceptibility variants previously discovered (at genome-wide significance) in both EA and EUR populations. B Variants previously identified by multiple-ancestry meta-analysis of Chinese and EUR populations; C Variants were identified by multiple-ancestry meta-analysis combining EA samples in our study and EUR samples in ILCCO. D Variants identified only in EA populations. E Novel variants identified in the current study; F Variants identified only in EUR populations. Variants are labeled with *, **, *** and **** corresponding to 0.01 ≤ phet < 0.05, 0.001 ≤ phet < 0.01, 0.0001 ≤ phet < 0.001 and phet < 0.0001, respectively; here, phet (t-statistic, two-sided) is the p value for testing the heterogeneity of effect sizes between EA and EUR populations. Sample sizes for EUR populations in all panels: 11,273 cases and 55,483 controls. Sample sizes for EA populations: 11,753 cases and 30,562 controls for (A, B, C, D, and F); 21,658 cases and 150,676 controls for (E).

We used LDSC27 to evaluate the heritability and genetic correlation between individuals with a history of smoking and never-smokers within each population and POPCORN60 across populations. The genetic correlation was weaker between never-smokers in EA and EUR populations compared to individuals with a history of smoking (Supplementary Fig. 10) although power was limited given the relatively small sample sizes within each group (Supplementary Table 8). Larger sample sizes are needed to estimate these characteristics more precisely.

Polygenic risk score and gene-smoking interaction analysis

We investigated whether the polygenic risk score (PRS), which was based on the cumulative effect of 25 independent susceptibility loci for LUAD in EA (Supplementary Table 4), interacted with smoking status to influence the risk of LUAD, given previous evidence of gene-environment interaction61,62. Since only summary statistics were available for some datasets (instead of individual genotype data), we developed a statistical method for testing the multiplicative smoking-PRS interaction using the summary statistics for the susceptibility variants (Methods). Compared to the middle quintile that represents the average risk in the general population, the top quintile had OR of 2.07 (95% CI = 1.99, 2.15) for never-smokers and 1.80 (95% CI = 1.70, 1.89) for individuals with a history of smoking (Pinteraction = 0.0058, Fig. 4, Supplementary Fig. 11), providing statistical evidence that the association between PRS and LUAD risk was higher for never-smokers. Moreover, we tested for the presence of multiplicative interactions between smoking status and each individual susceptibility variant in the PRS and found five variants with stronger associations in never-smokers than in individuals with a history of smoking (P < 0.05) (Supplementary Table 2).

Fig. 4: A polygenic risk score (PRS) is more strongly associated with risk of lung adenocarcinoma in never-smokers than in individuals with a history of smoking (P = 0.0058).
figure 4

The PRS was defined based on 25 independent variants that achieved genome-wide significance in EA with weights derived from the meta-analysis of the current study (Supplementary Table 4). The odds ratios (ORs) and the standard errors of the 12 novel variants were based on 21,658 cases and 150,676 controls. The ORs and the standard errors of the other 13 variants were based on 11,753 cases and 30,562 controls. The figure shows the ORs and their 95% confidence intervals comparing each quintile group to the middle quintile for individuals with a history of smoking (blue) and never-smokers (red).

Genetic architecture, performance of PRS and sample size requirements in EA populations

To further investigate the underlying genetic architecture of susceptibility (Methods) to LUAD63 in EA populations, we performed a GENESIS26 analysis based on the GWAS summary statistics for our larger never-smoker dataset. We estimated that ~2275 (s.e. = 1167) susceptibility variants are independently associated with LUAD, suggesting that LUAD is a highly polygenic disease and most of the susceptibility variants have very small effect sizes. Based on the estimated parameters, we investigated how the performance of a PRS, measured as the area under the receiver operating characteristic curve (AUC), depended on the sample size of the training GWAS (Fig. 5). The AUC is predicted to be 60.7% (95% CI = 56.6%, 64.8%) at the current sample size and will increase to 66.9% (95% CI = 62.5%, 71.3%) when the sample size increases to 70,000 cases with one control per case and 68.4% (95% CI = 64.0%, 72.8%) with 1000,000 controls. Of note, even a small increase of AUC value for a PRS can help identify many more subjects at risk64.

Fig. 5: The expected area under the receiver operating characteristic curve (AUC) of a polygenic risk score (PRS) built based on a GWAS of specified sample sizes for lung adenocarcinoma in never-smoking East Asians.
figure 5

For “1 million controls”, the x-coordinate represents the number of cases, assuming the study has 1 million controls. For “Equal number of cases and controls”, the x-coordinate represents the numbers of cases, assuming the same number of cases and controls.

Discussion

We conducted the largest GWAS of LUAD in an EA population to date and identified 12 novel susceptibility variants achieving genome-wide significance. In addition, two variants identified from a previous multi-ancestry meta-analysis achieved genome-wide significance as well in EA alone after we combined the reported summary data with our independent data. In total, including the previously described genetic variants, 28 variants at 25 loci have reached genome-wide significance for LUAD in EA populations, representing major progress in elucidating the genetic basis of LUAD. Finally, a multi-ancestry meta-analysis identified four additional loci in the combined EA and EUR populations, with consistent effects in both.

Our eQTL colocalization and TWAS analyses using an ancestry-matched lung eQTL dataset (EA population) identified novel LUAD susceptibility genes including FADS1 and ELF5. Importantly, FADS1 is regulated by sterol-response element-binding proteins (SREBPs)65, which govern lipid metabolism in alveolar type II (ATII) cells66. ELF5 is also expressed in tissues with glandular/secretory epithelial cells including salivary gland and lung67,68 and 3.2% of lung alveolar type II cells express ELF5 in GTEx single-cell expression data. Identification of FADS1 and ELF5 in our study suggests a role for alveolar lineage-specific genes and pathways in LUAD susceptibility. Notably, the missense variant (Ile131Thr), rs1130866, in SFTPB identified through the multi-ancestry analysis was a protein quantitative trait locus (pQTL) for SFTPB in blood69, where the LUAD risk-associated A allele (Ile131) is correlated with increased SFTPB levels. Importantly, the genomic region encompassing rs1130866 presents weak LD and high SNP density, consistent with the presence of a recombination hot spot70, and therefore fine-mapping inspecting low-frequency variants in the region is warranted. Our TWAS analyses using both ancestry-matched and ancestry-discordant lung eQTL datasets identified both common and unique genes from each dataset, highlighting potential benefits of an eQTL dataset of larger sample size and the importance of an ancestry-matched eQTL dataset, even at a smaller sample size, in detecting susceptibility genes.

We evaluated the presence of a gene-environment interaction with tobacco smoking in our EA data. We found that the association between a PRS (constructed by the lead variants at the 25 loci with genome-wide significance in EA) and LUAD in never-smokers was statistically significantly stronger than in individuals with a history of smoking (Fig. 4). This finding, together with our recent paper showing a stronger association of PRS for LUAD risk in non-coal users than in coal users71, provides evidence that genetic susceptibility may vary by exposure patterns in EA populations.

We systematically compared top GWAS findings that had been initially reported in one or the other or both populations. After accounting for differences in MAFs and statistical power as well as the local LD pattern of each locus (500 kb each side of the lead variant), we found that a substantial number of the associations initially reported in EA populations showed no signal in EUR populations. It might reflect causal variants for these loci not being tagged well in the EUR populations. This might also suggest important differences between EA and EUR in the genetic architecture of LUAD samples, which could be caused by differential environmental exposures. Finally, this observation is also consistent with distinct tumor molecular characteristics (e.g., EGFR mutation prevalence was higher in Asians than EUR populations) observed in LUAD suggesting different etiologies influenced by genetic and/or environmental factors13,72,73.

Our genetic architecture analysis suggested that LUAD is a highly polygenic disease. Expanding GWAS of LUAD will continue to identify many risk variants albeit with smaller effect sizes. Moreover, our analysis predicts that the AUC of PRS for EA never-smokers could be improved to 66.9% for a GWAS training dataset with 70,000 cases and 70,000 controls that could be further increased with a greater number of controls. Thus, an expanded GWAS in the future can lead to the substantial improvement in knowledge about the underlying genetic architecture of LUAD; increased understanding of how known or suspected lung cancer environmental risk factors interact with genetic susceptibility; and assessment of the potential clinical utility of risk models integrating both genetic and non-genetic risk factors74,75.

There are several limitations in the current study. First, the discovery phase included subjects of diverse EA populations (Mainland China 38.2%, Japan 45.9%) and the replication phase only included subjects from Japan. However, our data did not show evidence of heterogeneity in effect sizes for susceptibility variants between Han Chinese and Japanese populations or across geographic locations (Supplementary Table 5), suggesting a minimal impact for using a single EA population for replication. Second, we were underpowered to conduct formal heritability correlation analyses to compare the genetic architecture in EA and EUR populations stratified by smoking status; larger studies will be needed to conclusively characterize differences. Furthermore, completely elucidating the genetic basis of ancestry differences requires detailed information about age of onset, family history and exposures. Finally, rs4268071 (Table 2) achieved genome-wide significance in the discovery data but replication data were not available. While the significance was primarily driven by Japanese samples (MAF = 0.04 in Japanese and <1% in other populations), there was no evidence of heterogeneity in effect estimates across EA populations. Replication is warranted to further establish its etiological role.

In conclusion, we identified 12 novel variants in a GWAS of LUAD in EA populations as well as 4 novel variants in a multi-ancestry meta-analysis of EA and EUR populations. Colocalization and TWAS analyses using an ancestry-matched lung tissue eQTL dataset identified candidate susceptibility genes with suggested roles in alveolar lineage. At the same time, a large majority of variants identified in the EA GWAS showed no evidence of association in EUR populations. Larger samples sizes with data on environmental risk factors will be needed to further characterize the etiologic differences between these populations. Finally, our genetic architecture analysis suggests that the performance and the clinical utility of the PRS will be substantially improved by larger GWAS in the future.

Methods

Ethics statement

All participants provided informed consent according to protocols that were evaluated and approved by the internal review boards of the contributing centers. Protocols used to generate new, unpublished data presented in this paper were approved by the National Cancer Center Institutional Review Board, Japan and the Aichi Cancer Center Ethics Committee, Japan.

Overview of study

We conducted a two-phase GWAS meta-analysis of LUAD in EA populations, including Female Lung Cancer Consortium in Asia (FLCCA), Nanjing Lung Cancer Study (NJLCS)10,24, National Cancer Center of Japan (NCC) Research Institute and Aichi Cancer Center (ACC). For the FLCCA study, details of the study design, participating studies, case ascertainment, genotyping, and quality controls have been described in detail9. Briefly, this international consortium is composed of Asian women who never smoked and resided in Mainland China, Hong Kong, Singapore, Taiwan, South Korea and Japan at the time of recruitment. All were genotyped using the Illumina 660 W, 370 K and 610Q microarrays.

The NCC study included lung cancer patients from NCC and BioBank Japan (BBJ) and non-cancer controls from the Japan Public Health Center-based Prospective Study and the Japan Multi-Institutional Collaborative Cohort Study, genotyped by Illumina HumanOmniExpress and HumanOmni1-Quad genotyping platforms. The ACC study included lung cancer patients from the Aichi Cancer Center, Kyoto University, Okayama University and Hyogo College of Medicine and non-cancer controls from the Nagahama Study and the Aichi Cancer center. Samples were genotyped by Illumina 610k and Illumina660k platforms15,76. The NJLCS study at the Nanjing Medical University was based on meta-analysis of three studies: the Nanjing GWAS with subjects from Nanjing and Shanghai, the Beijing study with subjects from Beijing and Wuhan (genotyped by Affymetrix Genome-Wide Human SNP Array 6.0) and the Oncoarray GWAS10,77,78.

The replication study included cases from multiple sources (BBJ, NCC, Kanagawa Cancer Center, Akita University Hospital, Tokyo Medical and Dental University, Hospital and Gunma University Hospital, and Fukushima Medical University School of Medicine) and non-cancer controls from BioBank Japan. Cases were genotyped using the Invader assay and the control samples in BioBank Japan were genotyped using the Illumina HumanOmniExpress genotyping platform.

For the multi-ancestry meta-analyses of LUAD and cross-population comparison of top GWAS findings with both never-smokers and individuals with a history of smoking, we used 11,273 cases and 55,483 controls of European ancestry in the Integrative Analysis of Lung Cancer Etiology and Risk team of the International Lung Cancer Consortium (INTEGRAL-ILCCO)16 (Supplementary Table 8). For the multi-ancestry analysis and cross-population comparisons of smokers, we used European samples genotyped with the OncoArray platform in the ILCCO study (Supplementary Table 8). For the multi-ancestry and cross-population comparisons analysis of never-smokers, we used the GWAS of European never-smoking subjects from Hung et al.21.

Quality control, imputation and association analysis in EA populations

For each study, SNPs with minor allele frequency (MAF) < 0.01, Hardy-Weinberg Equilibrium (HWE) p value < 10−6 in controls were removed; subjects with missing rate >3%, sex discrepancy, or displaying non-East Asian ancestry based on principal component analysis scores were removed. Moreover, for any pairs of subjects estimated to be related with identity by descent pihat >0.10 using PLINK (V2.0), we removed one subject. Imputation was performed using IMPUTE2 and the 1000 Genomes Project East Asian samples (Phase 3) as reference. After imputation, SNPs with imputation quality score ≥ 0.5 were used for association analysis in each study. Logistic regression under an additive model was performed using SNPTest (V2) or PLINK2 based on imputed genotypic dosage data adjusting for smoking (if both smokers and never smokers were present) and PCA scores to control for population stratification. Meta-analysis was performed using inverse-variance weighted fixed effects methods. All p values were two-sided. We consider the following variants as novel for the GWAS in EA: (1) the lead variant with p < 5 × 10−8 in a locus that has not been previously reported in either EA or EUR populations, or (2) a secondary variant with p < 5 × 10−8 conditioning on the lead variant in a previously reported locus in either EA or EUR populations with the requirement that the LD R2 ≤ 0.2 between the secondary and the lead variants in both populations.

LDSC27 was used to estimate the heritability attributed to genome-wide common variants and to assess the potential inflation due to insufficient correction of population stratification. LDSC was also used to estimate the genetic correlation of LUAD between never-smokers and individuals with a history of smoking in each population. We used POPCORN60 to estimate the genetic correlation between EA and EUR populations because LD patterns are expected to be different. To account for the difference of allele frequencies in the two populations, we also used POPCORN to estimate the cross-population genetic-impact correlation that was defined as the correlation of population specific phenotypic variance explained by each SNP.

Conditional analysis and fine mapping

To identify independently associated SNPs at an established susceptibility locus, we performed conditional analysis using software Genome-wide Complex Trait Analysis (GCTA)79 based on the GWAS meta-analysis summary results of EA populations. LD for the conditional analysis was calculated using a reference population of 4544 controls from the FLCCA study to achieve a desirable accuracy. Here, genotypes for FLCCA were imputed using IMPUTE2 and the 1000 Genomes Project (Phase 3) reference samples with EA ancestry. SNPs with imputation quality <0.5 were excluded from the reference set for conditional analysis. Conditional analysis was restricted to 14 loci with lead SNPs achieving genome-wide significance in the discovery-phase meta-analysis. We did not perform conditional analyses for other new SNPs that did not achieve genome-wide significance in the discovery-phase meta-analysis because secondary SNPs would not survive multiple testing correction. Conditional analysis was restricted to SNPs less than 500 kb from the lead SNP of each locus. To identify multiple potentially independent SNPs in one locus, we performed stepwise conditional analysis using GCTA. All SNPs identified with P < 5 × 10−8 and the lead SNP of the locus were put into one model to derive the joint estimate of ORs, appropriately adjusting for LD among all SNPs. Only SNPs with p value < 5 × 10−8 in both conditional and joint analyses were considered to be independently associated SNPs.

For 11 out of the 14 loci with genome-wide significance in the discovery phase, we performed a Bayesian fine-mapping analysis using FINEMAP32 to nominate 95% credible set variants using the same set of imputed genotypes of 4544 FLCCA control subjects as an LD reference. We did not perform fine-mapping analysis for two loci in MHC regions, because of the complex and extensive LD patterns in this region. We also excluded the locus at 7q31 because the lead SNP, rs4268071, had MAF < 1% in our LD reference population. MAF of this variant is 4% in the Japanese populations (45.8% of cases and 74.5% of controls in the discovery set) but <1% in other EA populations included in our study. For FINEMAP analysis, we tested the variants within ±500 kb of the lead SNP and set the number of maximum causal variants as the number of independent signals (P ≤ 10−5) observed in the conditional analysis for each locus.

Proportion of familial risk explained

We considered a set of identified variants for LUAD. For SNP t, we defined \({p}_{t}\) as the frequency of the risk allele and \(O{R}_{t}\) as the estimated per-allele odds ratio. Under a multiplicative model, the fraction of the familial risk explained by the set of SNPs was calculated as \({\sum }_{t}\log ({\lambda }_{t})/\log ({\lambda }_{0})\), where \({\lambda }_{0}\) is the observed familial risk to the first degree of LUAD cases and \({\lambda }_{t}\) is the familial risk due to the \({t}^{{th}}\) SNP:

$${\lambda }_{t}=\frac{{p}_{t}O{R}_{t}^{2}+\left(1-{p}_{t}\right)}{{\left({p}_{t}O{R}_{t}+1-{p}_{t}\right)}^{2}}.$$
(1)

Heritability partitioning in functional classes and tissue-specific analyses

Stratified LD score regression (sLDSC)80 was conducted to identify functional annotations enriched for LUAD heritability using summary statistics from the discovery phase of meta-analysis in EA populations. In addition to the functional annotations provided by the sLDSC package, we also analyzed the gene sets defined by smoking studies: differentially expressed genes in peripheral blood mononuclear cells upon nicotine treatment (“PBMC nicotine” gene set) from Moyerbrailean et al.81, those in non-tumorous lungs between current- and never-smokers (“Lung smoking” gene set) from Bosse et al.82, and those in normal bronchial airway epithelial cells between current- and never-smokers (“Airway smoking” gene set) from Beane et al.83. An annotation was considered to be significantly enriched for LUAD heritability if FDR < 0.05.

We then performed sLDSC to prioritize relevant tissue types (lung, blood/immune, and brain/CNS) using tissue-specific expressed genes from GTEx v6p (53 tissue types) and other public expression datasets (152 tissue types), as well as tissue-specific chromatin annotations from EnTEX (111 annotations in 26 tissue types) and Roadmap dataset (378 annotations in 85 tissue types) as described by Finucane and colleagues36. We used GTEx v6p expression data based on a comparison with v8 data, where a median of 83% of tissue-specific differentially expressed genes were shared between two versions. In general, we did not find significant enrichment for individual annotations after adjusting for the multiple testing. To increase the power of prioritizing relevant tissues (lung, blood/immune, and brain/CNS), we performed an aggregated analysis to test if p values from one tissue (e.g., lung) tended to be smaller than those from the other two tissue groups (blood/immune, and brain/CNS) using the Wilcoxon rank test.

eQTL colocalization analysis and TWAS

EA lung eQTL dataset is based on a cohort of 115 never-smoking LUAD patients from Taiwan, referred to as LCTCNS (Lung cancer tissue cohort of never-smokers). Expression array data was obtained for non-tumor lung tissues of these patients using the Illumina WG-DASL HumanRef-8 v3 or HumanHT-12 v4 BeadChip (Illumina Inc.) (Gene Expression Omnibus accession number GSE46539)84. Genotype data from buffy coat DNA was obtained using the Illumina Human660W-Quad BeadChip. A systematic quality control for the genotype data was performed as previously described12 (SNPs were excluded if call rate <90%, MAF < 5%, or P < 0.0001 based on the Hardy-Weinberg equilibrium test. Samples were excluded if call rate <90%, sex discrepancies based on the X chromosome heterozygosity, contaminated samples with high heterozygosity scores, or first or second- degree relatives), and imputation was carried out using Minimac4 (V4.0.3) with the 1000 Genomes reference set (all populations). For eQTL analysis, expression data was processed for background correction as previously described84. Briefly, we kept the probes that are present in both the BeadChip platforms and further removed those with low expression levels (detection p > 0.05). Based on the data at the remaining 24,216 probes, we applied model-based background correction. Log2-transformed expression levels of 24,216 probes were then used to obtain 20 latent factors based on probabilistic estimation of expression residuals (PEER) while specifying batch, sex, age, medical operation status, RNA integrity number, and RNA input quantity as known confounders. The expression residuals from PEER were then inverse rank transformed to the standard normal distribution (the inverse rank transformed residuals) and were used as the dependent variable in the expression levels for eQTL analysis. eQTL analysis was conducted for 29 GWAS lead SNPs (all EA loci including discovery, replication, and conditional signals plus new loci from the multi-ancestry GWAS). In LCTCNS, all these SNPs have a MAF of >0.01. For each GWAS lead SNP, its association with each probe located within ±500 kb of the SNP was tested using an additive linear model where the dependent variable was the expression level as described above and the independent variable was the effect allele count. Based on the resulting p values of these eQTL analyses for all 29 SNPs, the corresponding Benjamini–Hochberg FDR was calculated. Colocalization analysis was performed using eCAVIAR37 and HyPrColoc38 via ezQTL platform for eight GWAS lead SNP-eQTL gene pairs displaying FDR < 0.05 in LCTCNS (Supplementary Data 5). For each of these eight SNP-probe pairs, we further examined the association between the probe and SNPs within ±100 kb of the lead SNP using Matrix eQTL to obtain the summary statistics as an input to ezQTL for colocalization analysis using HyPrColoc and eCAVIAR. For loci on MHC regions, ±10 kb window was used for computational efficiency of colocalization analyses. LD matrix was obtained from 1000 Genomes EA populations. For HyPrColoc, posterior probability of >0.7 was used as a cutoff for colocalization. For eCAVIAR analysis, colocalization posterior probability (CLPP) score > 0.01 was used as a cutoff for colocalization.

For TWAS, we adopted FUSION85 using LCTCNS or GTEx v8 lung eQTL data and summary statistics of EA discovery GWAS meta-analysis. We computed weights using the elastic-net regression model for 24,216 expression probes (LCTCNS) or 24,687 genes (GTEx v8 lung) and cis-SNPs within 500 kb of the gene for each probe. LD matrix was obtained from 1000 Genomes EA populations. We performed association analysis for 1875 expression probes (LCTCNS) or 5534 genes (GTEx v8 lung) with cross-validation cutoff of R2 > 0.05 based on the elastic-net model. We defined a significant transcriptome-wide association as TWAS P < 2.6 × 10−5 (0.05/1875; LCTCNS) or P < 9 × 10−6 (0.05/5534; GTEx v8 lung) based on Bonferroni correction. For two loci passing this cutoff from LCTCNS analysis (ELF5 and FADS1), we further performed conditional analysis as implemented in FUSION by conditioning the GWAS signal on the predicted expression of the probe with the best TWAS P value.

Mendelian randomization

We performed MR analysis to investigate the potential causal relationship between telomere length and the risk of LUAD. MR analysis was based on 46 common SNPs identified in a recent multi-ancestry meta-analysis of telomere length in the TOPMed54 study. The original paper identified 48 variants associated with telomere length that collectively explained 4.35% of telomere length variance; two of them at the TERT locus were excluded using the LD filter R2 < 0.05 that together explained 0.61% of the telomere length variance; the remaining 46 variants included in our MR analysis explained 3.74% of telomere length variance. Because there was no significant heterogeneity of effect sizes on telomere length across populations (Table S4 in Taub et al.54), the primary MR analyses were based on the estimated effect sizes combining all samples in the TOPMed study in a joint regression model for telomere length. Analyses were based on MR-PRESSO86, a powerful and robust approach designed to deal with widespread horizontal pleiotropy. This approach uses a formal testing framework to (1) detect the presence of horizontal pleiotropy, (2) detect variant outliers, (3) evaluate distortion, and (4) re-estimate causal effect sizes after removing potentially problematic variants. According to simulations, this approach is best suited when horizontal pleiotropy occurs in <50% of instruments. This approach identified 5–7 outlier variants in our data. The estimated \(\beta\) from MR analysis was converted as OR, interpreted as risk increase per standard deviation (640 base pairs87) increase of the geneticly predicted telomere length.

Testing the interaction between polygenic risk score and smoking status

We investigated whether the PRS, which was calculated based on 25 independent SNPs associated with LUAD in EA populations (Supplementary Table 4, excluding three variants identified by conditional analysis), interacted with smoking status for LUAD risk. Because we have only GWAS summary statistics instead of individual-level data for smokers and never-smokers, we developed a statistical method for testing the interaction using summary statistics separately from smokers and never-smokers. Suppose that we have \({n}^{1+}\) smoking cases, \({n}^{0+}\) never-smoking cases, \({n}^{1-}\) smoking controls and \({n}^{0-}\) never-smoking controls. Let \({x}_{{it}}^{s+}\) and \({x}_{{jt}}^{s-}\) be the genotype of SNP \(t\) for the \({i}^{{th}}\) case and the \({j}^{{th}}\) control, where \(s=1\) indicates smokers and 0 indicates never-smokers. Given smoking status \(s\), we define \({{PRS}}_{i}^{s+}={\sum }_{t=1}^{T}{{\beta }_{t}x}_{{it}}^{s+}\) and \({{PRS}}_{j}^{s-}={\sum }_{t=1}^{T}{{\beta }_{t}x}_{{jt}}^{s-}\) as the PRS for cases and controls, respectively. For smokers (\(s=1\)), the association between PRS and disease risk can be quantified as:

$${\Delta }_{1}=\frac{1}{{n}^{1+}}\mathop{\sum }_{i=1}^{{n}^{1+}}{{PRS}}_{i}^{1+}-\frac{1}{{n}^{1-}}\mathop{\sum }_{j=1}^{{n}^{1-}}{{PRS}}_{j}^{1-},$$
(2)

the difference of average PRS between cases and controls. Similarly, we define \({\Delta }_{0}\) to be the difference of average PRS between cases and controls for never-smokers. Testing the PRS*smoking interaction can be done using \(Z=\frac{{\Delta }_{1}-{\Delta }_{0}}{\sqrt{{{{{{\rm{var}}}}}}({\Delta }_{1}^{2})+{var}({\Delta }_{0}^{2})}}.\) Under the null hypothesis of no interaction for all variants, \(Z \sim N({{{{\mathrm{0,1}}}}})\) asymptotically. Assuming SNPs are independent, we derive \(Z={\sum }_{t=1}^{T}\left({w}_{t}^{1}{z}_{t}^{1}-{w}_{t}^{0}{z}_{t}^{0}\right),\) where \({z}_{t}^{s}\) is the z-score for testing association for SNP \(t\) in subjects with smoking status \(s\). The weight is given as

$${w}_{t}^{s}=\frac{{\beta }_{t}\sqrt{\frac{{\left({\sigma }_{t}^{s+}\right)}^{2}}{{n}_{+}^{s}}+\frac{{\left({\sigma }_{t}^{s-}\right)}^{2}}{{n}_{-}^{s}}}}{\sqrt{{\sum }_{t=1}^{T}{\beta }_{t}^{2}\left(\frac{{\left({\sigma }_{t}^{1+}\right)}^{2}}{{n}_{+}^{1}}+\frac{{\left({\sigma }_{t}^{1-}\right)}^{2}}{{n}_{-}^{1}}+\frac{{\left({\sigma }_{t}^{0+}\right)}^{2}}{{n}_{+}^{0}}+\frac{{\left({\sigma }_{t}^{0-}\right)}^{2}}{{n}_{-}^{0}}\right)}}.$$
(3)

Here, \({\left({\sigma }_{t}^{s+}\right)}^{2}\) and \({\left({\sigma }_{t}^{s-}\right)}^{2}\) are the genotypic variances for SNP \(t\) in cases and controls, respectively.

We note that both discovery and replication data are included for testing PRS smoking interaction novel variants included in our PRS to maximize the power of statistical testing. In particular, only the discovery data were available and included for previously identified variants; both discovery and replication data were included for new variants to increase the statistical power. To do this, \({w}_{t}^{s}\) was modified to have SNP-specific sample sizes. All analyses were done using R (x64 4.1.0).

GENESIS analysis for projecting yield of future expanded studies

The genetic architecture of a disease is defined as the number of susceptibility SNPs and the distribution of their effect sizes26. When these parameters are estimated, one can estimate the number of variants achieving genome-wide significance and the accuracy of a polygenic risk model trained using a GWAS with a given sample size. In the current study, we estimated the genetic architecture using GENESIS (GENetic EStimation and Inference in Structured samples)26 based on the GWAS summary statistics with LD scores calculated based on the genotypes of the subjects of EA ancestry in the 1000 Genomes Project. Since GENESIS requires a large sample size to derive reliable estimates, we performed analysis only for never-smokers in EA. The three-component model \({\beta }_{m} \sim \pi {p}_{1}N\left(0,\,{\sigma }_{1}^{2}\right)+\pi {p}_{2}N\left(0,\,{\sigma }_{2}^{2}\right)+\left(1-\pi \right){\delta }_{0}\) best fit the never-smoker data in EA, where \({\beta }_{m}\) represents effects sizes, \(\pi\) denotes the fraction of truly associated variants in the genome, \({\delta }_{0}\) denotes the point mass at zero, \({\sigma }_{i}^{2}\) denotes the variance of effect sizes for the \({i}^{{th}}\) component, \(\pi {p}_{i}\) (\(i={{{{\mathrm{1,2}}}}}\)) represents the fraction of variants with effect size following \(N\left(0,\,{\sigma }_{i}^{2}\right)\). Based on this estimated genetic architecture, we calculated the expected number of variants reaching genome-wide significance for a given GWAS and calculated the expected area under the receiver operating characteristic curve (AUC) for an additive polygenic risk prediction model built based on a discovery GWAS for a given sample size. The uncertainty of the AUC was induced by the uncertainty in the estimated parameters in GENESIS (\(\Gamma=(\pi,\,{p}_{1},\,{p}_{2},\,{\sigma }_{1}^{2},\,{\sigma }_{2}^{2})\)) because of the limited sample size in our summary data. We used a resampling approach to estimate the standard error of AUC. Briefly, we randomly simulated 1000 sets of parameters \({\varGamma }^{k}\) given the estimated \(\hat{\Gamma }\) and the estimated covariance matrix, and calculated AUCk for each simulated parameter \({\varGamma }^{k}\) for a given sample size. The standard error was calculated based on the 1000 sets of AUC values.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.