Abstract

Background. Diffuse large B cell lymphoma (DLBCL) is a life-threatening malignant tumor characterized by heterogeneous clinical, phenotypic, and molecular manifestations. Given the association between immunity and tumors, identifying a suitable immune biomarker could improve DLBCL diagnosis. Methods. We systematically searched for DLBCL gene expression microarray datasets from the GEO database. Immune-related genes (IRGs) were obtained from the ImmPort database, and 318 transcription factor (TF) targets in cancer were retrieved from the Cistrome Cancer database. An immune-related classifier for DLBCL prognosis was constructed using Cox regression and LASSO analysis. To assess differences in overall survival between the low- and high-risk groups, we analyzed the tumor microenvironment (TME) and immune infiltration in DLBCL using the ESTIMATE and CIBERSORT algorithms. WGCNA was applied to study the molecular mechanisms explaining the clinical significance of our immune-related classifier and TFs. Results. Eighteen IRGs were selected to construct the classifier. The multi-IRG classifier showed powerful predictive ability. Patients with a high-risk score had poor survival. Based on the AUC for three- and five-year survival, the classifier exhibited better predictive power than clinical data. Discrepancies in overall survival between the low- and high-risk score groups might be explained by differences in immune infiltration, TME, and transcriptional regulation. Conclusions. Our study describes a novel prognostic IRG classifier with strong predictive power in DLBCL. Our findings provide valuable guidance for further analysis of DLBCL pathogenesis and clinical treatment.

1. Introduction

Diffuse large B cell lymphoma (DLBCL) accounts for 30–58% of all diagnosed non-Hodgkin lymphomas. The incidence of DLBCL in Europe is approximately 3.8/100,000 per year [13]. DLBCL is an intricate and aggressive tumor with heterogeneous phenotypic, clinical, and molecular manifestations [46]. Following existing chemotherapeutic approaches, the survival rate of patients with DLBCL has increased to 50–60%; however, nearly 40% of patients do not benefit fully due to the heterogeneous nature of this malignancy [7]. Hence, there is a strong need for the identification of new biomarkers to be applied in direct clinical therapy and to improve DLBCL prognosis [8, 9]. Based on advances in gene sequencing technology and numerous gene expression profiles, the prognosis in DLBCL patients has been closely associated with the tumor microenvironment (TME) [1012]. The TME is composed of immune cells, inflammatory mediators, mesenchymal cells, and extracellular matrix molecules [1217]. A fraction of infiltrating immune cells influences the growth and progression of tumors, defining a patient prognosis [18, 19]. At the same time, gene expression affects a variety of immune cells infiltrated in the TME [2022]. In recent years, a novel immunotherapy approach based on kinase inhibitors that target B cell receptor signaling as well as PD-L1 and PD-1 blockade has proven successful against DLBCL [11, 2326]. However, only a few DLBCL patients benefited from these new therapies, whereas others exhibited little or no response. Hence, analyzing the association between immune-related genes (IRGs) and overall survival (OS) may reveal the prognostic value of IRGs and novel biomarkers. The ESTIMATE algorithm uses gene expression signatures to quantify the infiltration of stromal and immune cells in tumor samples [26]. The CIBERSORT deconvolution algorithm uses gene expression profiles to detect immune cell phenotypes in bulk tumor samples with complex cell types [27]. Based on the link between DLBCL and immune cells, our study is aimed at constructing an IRG classifier capable of predicting the outcome of DLBCL patients. ESTIMATE and CIBERSORT were employed to assess the role of the TME in DLBCL.

2. Materials and Methods

2.1. DLBCL Datasets and Preprocessing

We systematically searched for DLBCL gene expression microarray datasets from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) database of publicly available clinical annotations. Illumina gene expression profiles were obtained using Illumina HumanRef-8 WG-DASL v3.0 for one cohort of samples (GSE32918), and Affymetrix gene expression profiles based on Affymetrix Human Genome U133 Plus 2.0 (HG-U133 Plus_2.0) were obtained for two cohorts (GSE10846 and GSE31312). The following steps were applied for dataset screening. (i) The raw CEL files from Affymetrix datasets were subjected to the robust multiarray average algorithm in Affy software [28] to perform background correction and quantile normalization. Moreover, oligonucleotides per transcript were summed up with the median polish algorithm [29]. The Illumina matrix files were subjected to quantile normalization using Lumi software. (ii) The HG-U133 Plus_2.0 probes were annotated using the hgu133plus2.db package. The Illumina HumanRef-8 WG-DASL v3.0 probe annotation sequences were obtained from the GPL8432 Platform (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL8432). (iii) For multiple probes corresponding to the same gene, we used the genes with the largest average value. (iv) Complete gene expression profiles and follow-up information on patients were provided. As a result, 1022 DLBCL samples were selected, including 470, 140, and 412 samples obtained from the studies by Visco et al. [30], Barrans et al. [31], and Lenz et al. [12], respectively.

2.2. IRG Retrieval

We obtained a list of IRGs from the ImmPort database (https://www.immport.org/home). Different IRGs including chemokines, cytokines, and genes relevant to the immune response were reserved. Overlapping genes between GSE31312, GSE32918, and IRGs were selected for further study. The 470 samples from GSE31312 were selected as the training set for model development, whereas GSE32918 and GSE10846 were chosen as validation sets.

2.3. Univariate Cox Regression and Least Absolute Shrinkage and Selection Operator (LASSO) Analysis

To identify survival-associated IRGs from the 1170 overlapping ones, we performed univariate Cox regression analysis on continuous variables in cohorts GSE31312 and GSE32918. Only IRGs with a significant value of were selected as survival-related IRGs. Based on 349 candidates from the GSE31312 cohort and 71 from the GSE32918 cohort, 25 overlapping genes were identified as survival-associated IRGs. The GSE31312 expression data of the 25 IRGs were integrated into LASSO regression to identify prognostic signatures. Specifically, the LASSO regression model was built into a multi-IRG-based classifier (containing 18 IRGs) to predict the OS of patients in the training set. LASSO regression was carried out using the “glmnet” package in R. To estimate the prognostic significance of the IRG classifier, we used the “survivalROC” package in R, thereby obtaining the receiver operator characteristic (ROC) curves of the three cohorts. The area under the curve (AUC) at different times was calculated and compared to validate the performance of the classifier. Patients were classified into the low- and high-risk groups by the median risk score of each series. Kaplan-Meier (KM) survival curves were used to analyze the survival of DLBCL patients.

2.4. Development and Validation of a Nomogram for Prognosis Prediction in DLBCL Patients

Next, we developed a clinically applicable method to predict the prognosis of individuals with DLBCL. Based on the results of multivariate Cox regression analysis for OS, we integrated the risk score and other clinicopathological covariates to build a nomogram in the GSE31312 cohort. Predictive factors included the risk score, age, sex, subtype, stage, ECOG performance status, and the number of extranodal sites. The nomogram was verified in the GSE10846 cohort using calibration and ROC curves.

2.5. Comparison of Infiltration of Stromal and Immune Cells between Low- and High-Risk Groups

To explore potential mechanisms underlying variations in OS between the low- and high-risk groups, we used ESTIMATE to obtain a microenvironment score for each sample of the three cohorts [26]. The difference in microenvironment between the two groups was analyzed using the Wilcox test [32]. Finally, we separately plotted KM curves for the low- and high-immune score groups in the three cohorts according to their cutoff values. The cutoff values were determined using the “survminer” package in R based on the relationship between patient OS and immune score in each independent cohort. The R package “MaxStat,” which iteratively tests all probable cutoff points to determine the value that achieves the maximum rank statistic, was applied to dichotomize the immune score and allocate patients to the low- and high-immune score groups in each cohort and thus diminish the computational batch effect.

2.6. Comparison of Leukocyte Subtypes between Low- and High-Risk Groups

To calculate the fraction of immune cells in the two groups of DLBCL patients, we employed the LM22 gene signature, which distinguished 22 human immune cell phenotypes, including B cells, natural killer cells, T cells, myeloid subsets, and macrophages [27]. The results were analyzed by CIBERSORT with a perm value set to 1000; patients with were excluded from further investigation. The Wilcox test was used to analyze the difference in leukocyte subtype between the two groups. Furthermore, the correlation among the 22 leukocyte subtypes in DLBCL patients was calculated. Only correlation coefficients whose absolute value was greater than 0.15 were considered significant.

2.7. Weighted Correlation Network Analysis (WGCNA) and Transcription Factor (TF) Regulatory Network

To reveal the potential molecular mechanisms associated with the clinical significance of our IRG classifier, we explored the regulatory mechanisms of the 18 IRGs included in the classifier. First, the GSE31312 expression profiles of 318 TFs were subjected to WGCNA. We defined the first principal component of every gene module by its eigengenes, which we regarded as representative TFs. Gene significance (GS) represented the correlation between module memberships and clinical characteristics and was defined as the median value () of every TF. Risk score-correlated modules were defined by , and a higher GS value was selected for further study. Furthermore, to calculate the correlations between TFs of the highly related modules and our classifier IRGs, only genes with absolute values of and were retained and built into a network using Cytoscape. Finally, functional enrichment analyses based on gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathways were performed to examine the potential molecular mechanism of these genes.

2.8. Statistical Analysis

All analyses were carried out in R version 3.6.1 and corresponding packages.

3. Results

3.1. Construction of the Prognostic Multi-IRG Classifier

Univariate Cox regression analysis of the 1170 IRGs overlapping between GSE31312, GSE32918, and IRG datasets (Figure 1(a)) identified 349 IRGs from GSE31312 and 71 IRGs from GSE32918 as survival-associated IRGs based on (Figure 1(b)). Twenty-five survival-associated IRGs were found to overlap between these two cohorts and were integrated into LASSO regression to identify prognostic biomarkers (Figures 1(c) and 1(d)). Finally, 18 IRGs were selected to build the multi-IRG classifier (Table 1). Based on the LASSO coefficients and expression levels of these 18 IRGs, we determined the median risk score for every patient in the training cohort and allocated them to either the low- or high-risk groups (Figures 1(e) and 1(f)). Based on KM survival analysis, we found that patients with high-risk scores had markedly shorter OS than those in the low-risk group (, Figure 2(a)). To examine the prognostic significance of the classifier, we introduced ROC curves. The AUC in the training cohort was 0.741 and 0.758 for three- and five-year survival, respectively (Figure 2(b)).

3.2. Validation of the Multi-IRG Classifier for Survival Prediction in External Cohorts

To validate the classifier, KM survival analysis and ROC curves were generated for the GSE32918 and GSE10846 cohorts. In the GSE32918 cohort, high-risk patients tended to have shorter survival (, Figure 2(c)), and the AUC at three and five years highlighted the predictive prognostic power of the multi-IRG classifier (Figure 2(d)). A similar result was obtained with the KM survival (, Figure 2(e)) and ROC (Figure 2(f)) curves of the GSE10846 cohort. Furthermore, the AUC at three and five years revealed that the IRG-based classifier had better predictive power than other clinical characteristics (Figures 2(g) and 2(h)).

3.3. Univariate and Multivariate Cox Regression Analyses of Prognostic Factors and OS in DLBCL Patients

In the GSE31312 cohort and two testing sets, we comprehensively considered clinical characteristics, such as subtype, age, gender, and stage, to perform univariate and multivariate Cox regression analyses. This allowed us to determine the significance of the multi-IRG classifier on OS. The multi-IRG classifier was found to be an independent prognostic tool for DLBCL (Figures 3(a) and 3(b), GSE31312; Figures 3(c) and 3(d), GSE32918; and Figures 3(e) and 3(f), GSE10846). A two-tailed Fisher test revealed a remarkably different chemotherapeutic response between the low- and high-risk groups. Specifically, the high-risk group tended to have a poor response and worse prognosis than the low-risk group (, Figure 3(g)).

3.4. Development and Validation of a Nomogram

To predict the individual probability of OS and help clinicians provide better care to DLBCL patients, a nomogram that integrated the multi-IRG classifier and other clinical features was constructed (Figure 4(a)). A calibration curve was generated to evaluate the accuracy of the nomogram. The combined nomogram performed well in predicting the three- and five-year survival rates of patients from the GSE31312 (Figures 4(b) and 4(c)) and GSE10846 (Figures 4(d) and 4(e)) cohorts, with a prediction probability close to the observed one (Figures 4(b) and 4(e)). In the training cohort, the AUC was 0.799 and 0.823 for three- and five-year survival, respectively. The nomogram performed far better than stage, subtypes, ECOG, and other clinicopathological features when determining a patient’s prognosis (Figures 4(f)4(i)). Even when limited information about the GSE32918 cohort, such as age and subtype, was combined with our multi-IRG classifier, the prognostic accuracy of DLBCL was substantially improved (Figures 4(j) and 4(k)). Thus, the multi-IRG classifier provided additional prognostic value to existing clinicopathological predictors of DLBCL.

3.5. Different Immune Infiltration Scores and Leukocyte Subtypes Define Low- and High-Risk Groups

We introduced the ESTIMATE algorithm to determine the immune and stromal scores of the three cohorts (Figure 5). In the training cohort, the high-risk score group tended to have a lower immune infiltration score than the low-risk group (Wilcox test , Figure 5(a)). The results were consistent with the GSE32918 (Wilcox test , Figure 5(c)) and GSE10846 (Wilcox test , Figure 5(e)) cohorts. Based on the cutoff values of the immune score, we separately performed KM survival analysis on the low- and high-immune score groups (Figure 5(b), GSE31312; Figure 5(d), GSE32918; and Figure 5(f), GSE10846). The results confirmed a reduced survival in the low-immune score group compared to the high-immune score group, thus indirectly verifying the poor prognosis of the high-risk score group. Next, the CIBERSORT algorithm was applied to evaluate the proportions of 22 leukocyte subtypes in the three DLBCL cohorts (Figure 6). The low- and high-risk score groups consisted of distinct immune cell types. Memory B cells, naïve B cells, CD4 memory-activated T cells, CD8 T cells, follicular helper T cells, and M2 macrophages accounted for a considerable proportion of the DLBCL immune cell infiltration.

3.6. Associations among TME Components in DLBCL Patients

To explore the potential associations among TME components in DLBCL patients, we performed correlation tests on 22 infiltrating immune cells in DLBCL (Figure 7). In the GSE31312 and GSE10846 cohorts, M1 macrophages, memory B cells, resting mast cells, CD4 memory-activated T cells, resting NK cells, CD8 T cells, and gamma delta T cells were at the core of the correlation network (Figures 7(a) and 7(c)). The correlation heatmap suggested that CD4 memory-activated T cells correlated positively with gamma delta T cells and negatively with memory B cells (Figures 7(b) and 7(d)). Additionally, there was a strong positive correlation between resting mast cells and activated NK cells. However, resting mast cells correlated negatively with memory B cells (Figures 7(a)7(d)). Combining the ESTIMATE and CIBERSORT algorithms, we found that the high-immune cell infiltration group in DLBCL presented higher immune and stromal scores, but lower tumor purity than the low-immune cell infiltration group (Figure 7(e)).

3.7. TF Regulatory Network

TFs were integrated into WGCNA to explore the potential regulatory mechanisms related to the clinical significance revealed by the multi-IRG classifier. Based on the scale-free () and average linkage hierarchical clustering, we determined the soft-thresholding power and identified six TF modules (Figures 8(a)8(c)). The yellow module, containing 15 TFs, exhibited the strongest association with the risk score of DLBCL patients. The green and turquoise modules, containing 14 and 141 TFs, respectively, were also closely connected with the patient risk score (Figures 8(d)8(h)). TFs of the three modules were correlated with the 18 IRGs used to construct the classifier. Based on filtering criteria (absolute values of and ), we built a regulatory network in Cytoscape, which clearly demonstrated the regulatory relationships among these IRGs (Figure 9(a)). PTPRC, PSMD14, FABP5, GDF2, STC2, S100A11, and BTC represented the seven hub genes. Functional enrichment analysis identified the following GO biological processes: regulation of hemopoiesis, connective tissue development, and regulation of growth (Figure 9(b)). The KEGG pathways were significantly enriched in transcriptional misregulation in cancer, Epstein-Barr virus infection, human papillomavirus infection, Th17 cell differentiation, thyroid hormone signaling pathway, inflammatory bowel disease, and Kaposi sarcoma-associated herpesvirus infection (Figure 9(c)).

4. Discussion

As the most common subtype of non-Hodgkin lymphoma, DLBCL is an aggressive and heterogeneous tumor. Substantial progress has been made in the immunotherapy of DLBCL [3335]. Several studies have demonstrated that the TME influences growth and progress of tumor cells and is associated with patient prognosis [10, 11, 26, 36]. The identification of immune infiltrating components could improve patient prognosis and result in biomarkers for predicting the outcome of DLBCL patients receiving immunotherapy or other treatments [10, 3740].

Here, we developed and validated a novel tool for the prognostic stratification of patients with DLBCL into the low- and high-risk groups. The proposed risk score provided additional prognostic value to existing clinicopathological predictors of DLBCL. This is the first study to demonstrate the clinical utility of the multi-IRG signature as a prognostic tool in patients with DLBCL. Moreover, we propose a prognostic nomogram that allows for individualized estimation of three- and five-year OS probability among patients. The multi-IRG classifier and the associated nomogram may improve surveillance and guide decision-making regarding the administration of adjuvant chemotherapy and treatment duration.

The present study has identified various hub IRGs associated with OS in DLBCL patients. PTPRC (also known as CD45), a member of the protein tyrosine phosphatase (PTP) family, is involved in oncogenic transformation, cell growth, and differentiation [41]. This protein is an essential regulator of T and B cell antigen receptor signaling and is associated with a variety of cancers, including multiple myeloma, acute myeloid leukemia, and Barrett’s cancer [4244]. However, its role in DLBCL remains poorly explored [45]. This study suggests that PTPRC exerts a protective effect in patients with DLBCL, possibly through transcriptional regulation. PSMD14 encodes a component of the 26S proteasome, which is involved in apoptosis, cell cycle, and DNA damage repair [46, 47]. S100A11 is a member of the S100 family of proteins and plays an important role in intracellular calcium signaling. Altered expression and rearrangements of S100A11 have been implicated in tumors [48, 49]. Chan found that S100 expressed by antigen-presenting cells in patients with DLBCL was associated with a high survival rate [40]. FABP5 belongs to the class of fatty acid-binding proteins (FABPs) and binds to retinoic acid. This binding functions as an apoptotic as well as a differentiation signal in transformed cells [50, 51]. FABP levels are elevated in carcinomas, neoplastic skin cells, and gliomas, which are highly resistant to apoptosis [5254]. Apoptosis-resistant (BCL-2-expressing) cells are known to alter the ability of retinoic acid to induce apoptosis [55]. Therefore, exploring the association between FABP5 and BCL-2 might reveal the mechanistic role of BCL-2 in DLBCL and point to potential therapeutic biomarkers. GDF2 (also known as BMP-9), a member of the transforming growth factor β superfamily, promotes the proliferation and migration of cancer cells [56]. Cyclin-dependent kinase 4 (CDK4), a crucial player in cell cycle progression, is associated with DLBCL. The CDK4 inhibitor abemaciclib strongly suppresses cell proliferation and induces apoptosis in DLBCL [57]. The present results point to a strong positive correlation between CDK4 and the TFs KZF1 and NCAPG. Identifying the mechanism underlying this relationship could accelerate the development of a suitable treatment. Functional enrichment analysis indicated that these genes might be involved in various pathways related to cancer, including Epstein-Barr virus infection, human papillomavirus infection, and Th17 cell differentiation. Th17 cells may affect patient prognosis and exert antitumor immune effects during the occurrence and progression of DLBCL. These cells and related cytokines interact with other immune cells in the TME to provide direct or indirect antitumor immunity [5861].

Assessing the immune microenvironment using the ESTIMATE algorithm, we discovered that the degree of immune infiltration influenced OS in DLBCL patients in both the training and validation cohorts. This indicated that immune infiltration was closely associated with the outcome of DLBCL. Immune cells of the TME have been proposed for the prognostic assessment of certain cancers, such as melanoma, gastric cancer, liver cancer, and DLBCL [10, 6265].

Based on the multi-IRG classifier, DLBCL patients were allocated to the low- and high-risk score groups. KM analysis indicated that the prognosis was better in the low-risk group than in the high-risk group, in both the training and validation cohorts. The Wilcox test showed that both stromal and immune scores differed significantly between the low- and high-risk groups. Specifically, the scores were higher in the low-risk group, which was related to greater immune infiltration and a better outcome. Therefore, we believe that the immune infiltration and OS of DLBCL patients might be influenced by the comprehensive expression of these 18 IRGs.

According to the CIBERSORT algorithm, memory B cells, naïve B cells, CD4 memory-activated T cells, CD8 T cells, follicular helper T cells, and M2 macrophages were the main infiltrating immune cells in DLBCL patients of the three cohorts. The Wilcox test indicated that synthetic expression of the 18 IRGs could lead to distinct immune cell infiltration types in DLBCL patients. The proportion of CD4 memory-activated T cells and follicular helper T cells was relatively high in the low-risk group. In contrast, memory B cells and naïve B cells were relatively more abundant in the high-risk group. This discrepancy suggests that the shorter OS of the high-risk group might result from an imbalance between these four immune cell types. CD4 memory activation in T cells correlated negatively with memory B cells. We speculate that infiltrated memory B cells in the high-risk group may somehow inhibit the infiltration of CD4 memory-activated T cells, thus contributing to the different prognosis between the low- and high-risk score groups.

Finally, we integrated multiple IRGs to construct a novel prognostic classifier via LASSO regression analysis, an unprecedented approach in DLBCL. The proposed risk score could complement existing clinicopathological predictors of DLBCL. The classifier was confirmed to have good prediction performance in the validation cohorts. Notably, few studies have applied both ESTIMATE and CIBERSORT algorithms to explore immune infiltration in DLBCL. However, the lack of further experiments investigating the proportion and specific cell types of immune infiltration is a limitation of the present study.

In conclusion, the multi-IRG classifier can effectively allocate patients with DLBCL to groups with different risks. Accordingly, IRGs may complement traditional clinicopathological risk factors to generate a comprehensive prognostic tool. The proposed nomogram incorporating the risk score and existing clinical prognostic predictors might facilitate personalized follow-up and management of patients with DLBCL.

Data Availability

The publicly available datasets were analyzed in this study. Data sets used in this study could be downloaded from NCBI Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) under the accession numbers GSE31312, GSE10846, and GSE32918. Illumina gene expression profiles were obtained using Illumina HumanRef-8 WG-DASL v3.0 for one cohort of samples (GSE32918), and Affymetrix gene expression profiles based on Affymetrix Human Genome U133 Plus 2.0 (HG-U133 Plus_2.0) were obtained for two cohorts (GSE10846 and GSE31312). The following steps were applied for dataset screening. (i) The raw CEL files from Affymetrix datasets were subjected to the robust multiarray average algorithm in Affy software [28] to perform background correction and quantile normalization. Moreover, oligonucleotides per transcript were summed up with the median polish algorithm [29]. The Illumina matrix files were subjected to quantile normalization using Lumi software. (ii) The HG-U133 Plus_2.0 probes were annotated using the hgu133plus2.db package. The Illumina HumanRef-8 WG-DASL v3.0 probe annotation sequences were obtained from the GPL8432 Platform (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL8432). (iii) For multiple probes corresponding to the same gene, we used the genes with the largest average value. (iv) Complete gene expression profiles and follow-up information on patients were provided.

Conflicts of Interest

The authors declare that no competing interest exists.

Authors’ Contributions

Xiao-Jie Liang, Rui-ying Fu, and He-nan Wang contributed equally to this work and share first authorship.

Acknowledgments

This work was supported by grants from the National Natural Science Foundation of China (grant No. 81873450) and the Open Research Fund from Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beijing Tongren Hospital, Beihang University, and Capital Medical University (grant No. BHTR-KFJJ-202009) to Liang Wang.