Introduction

Gastrointestinal stromal tumors (GISTs), which often occur in the stomach or small intestine, are the most common primary mesenchymal tumor of the gastrointestinal tract [1,2,3]. Cajal cells in the gastrointestinal tract wall are the origin of GISTs [4, 5], which have malignant potential as a result of activating mutations in the KIT proto-oncogene or platelet-derived growth factor receptor alpha gene [6, 7]. The National Institutes of Health (NIH) 2008 criteria stratify GISTs into four risk categories: “very low risk,” “low risk,” “intermediate risk” and “high risk” [8]. As the risk and malignant potential increase, GIST patients’ prognosis decreases.

Until approximately the early 2000s, GISTs generally had a poor prognosis due to the lack of effective treatments. The discovery of imatinib mesylate, a small molecular inhibitor of receptor tyrosine kinases, has dramatically changed the outcome of patients with high-risk GISTs [9]. An accurate categorization of the risk degree of GISTs has important clinical significance for determining reasonable treatment and assessing prognosis. In recent years, the pathological risk degree of GISTs has become an issue of great concern [10].

Computed tomography (CT), a common pretreatment examination, is beneficial for showing tissues adjacent to the primary tumor in detail and determining metastasis or recurrence of GISTs [11]. Imaging and clinicopathological findings of GISTs in the stomach, small intestine and other sites have been described extensively [12,13,14,15,16]. Some studies have described the predictive values, based on subjective CT signs for the risk stratification of patients with GISTs [11, 17]. However, high subjectivity and the lack of validation hamper the clinical application of these methods. Therefore, there is an increasing need for accurate and objective stratification of the GIST risks using quantitative techniques.

Radiomics, which transforms medical images into mineable high-dimensional data, has recently shown a great potential in aiding clinical decision-making [18]. Many studies have demonstrated that radiomics analysis can obtain more useful information from medical images and has a great potential for improving diagnosis and prognosis [19,20,21,22]. We hereby hypothesized that CT imaging-based radiomics could improve the risk stratification of GISTs and could be a complementary tool for achieving an accurate diagnosis. Therefore, the purpose of our study was to develop and validate CT-based radiomics models for GIST risk stratification.

Materials and methods

Characteristics of patients

This retrospective research was approved by our institutional review board in Cangzhou Central Hospital, which waived the requirement for informed consent. Three hundred and sixty-six patients clinically suspected of primary GISTs from January 2013 to February 2018 were retrospectively enrolled in our institutional database. The inclusion criteria were as follows: (1) surgical resection of the tumor and complete clinic-pathological data; (2) no treatment prior to surgery; (3) preoperative contrast-enhanced CT examination with good image quality. The exclusion criteria were as follows: (1) unresectable tumor or endoscopic resection; (2) with treatment before surgery; (3) no preoperative contrast-enhanced CT or poor-quality CT images. A total of 140 patients were eventually included in the study (Fig. 1). Clinical data were reviewed and included age, sex, lesion location and risk stratification. The study population characteristics are summarized in Table 1.

Fig. 1
figure 1

Process of the patients enrolling

Table 1 Patients’ details of 140 cases with GISTs

CT image acquisition

All 140 patients underwent abdominal and/or pelvic contrast-enhanced CT examination covering the whole tumor using a 320-detector row scanner (Toshiba Medical Systems, Otawara, Japan) in our hospital. The CT imaging parameters were as follows: tube voltage 120 kV; tube current 100–450 mA; rotation time 0.5 s; detector collimation 64 × 0.625 mm; matrix 512 × 512; pitch 0.6–1.2:1; thickness 5 mm. For dynamic contrast-enhanced CT imaging, an iodinated contrast agent (1 mL/kg) was intravenously injected at a rate of 3.5 mL/s with an automatic power injector. The arterial phase, portal venous phase and balanced phase were obtained after 25–30 s, 60 s and 180 s, respectively. All three phases of CT images were reconstructed using a thickness of 2 mm.

Reference standard

This study was divided into two parts, based on the NIH 2008 Consensus Classification system [8], which includes tumor size, mitotic count and tumor site (Table 2). The reference standard was based on pathology specimens after lesions’ resection.

Table 2 NIH 2008 criteria for defining risk stratification of GIST recurrence after surgery

Data partitioning based on the NIH Consensus Classification

The entire patient population was divided as follows: eight patients in very-low-risk category; 47 patients in low-risk category; 33 patients in intermediate-risk category; 52 patients in high-risk category. The very-low-risk and low-risk categories were considered “nonadvanced GISTs,” while the intermediate-risk and high-risk categories were considered “advanced GISTs.” The 140 patients were randomly divided into two cohorts in a 7:3 ratio, based on a computer-generated seed. There were ultimately 100 patients in the training cohort and 40 patients in the validation cohort.

Tumor segmentation

Lesions were delineated on the CT arterial phase images using the ITK-SNAP software (available at www.itk-snap.org), because the arterial phase is better for distinguishing a tumor mass from adjoining normal tissue. The whole-tumor volume was determined by manually drawing a region of interest along the border of the tumor on each consecutive slice covering the whole lesion while excluding intra-luminal air and calcification. Two experienced radiologists, both with least 10 years of abdominal imaging experience and blind to the clinical outcomes, conducted the process in consensus. Therefore, a three-dimensional (3D) volume of interest (VOI) was finally obtained (Fig. 2).

Fig. 2
figure 2

Example of tumor delineation and segmentation. a For the delineation of an intestinal stromal tumor; b for the one slice area of the same intestinal stromal tumor; c for the whole volume extracted from the same intestinal stromal tumor

Radiomics feature extraction and feature selection

Quantitative radiomics feature extraction was conducted, as previously described [23]. The radiomics features in this study contained 396 features, which included (1) 42 first-order statistics features, (2) nine shape and size features and (3) 345 texture features including 144 Gray Level Cooccurence Matrix (GLCM), 180 Gray Level Run Length Matrix (GLRLM), 11 Gray Level Size Zone Matrix (GLSZM) and ten Haralick features. The features were extracted automatically using the AK radiomics analysis software (A.K.; GE Healthcare, Shanghai, China), similar as previous reports [24, 25].

After being centered and scaled, highly redundant and correlated features underwent a two-step feature selection procedure. First, highly correlated features were eliminated by Pearson correlation analysis using an r threshold of 0.75. A random forest (RF) classifier was then used to rank feature importance. The RF is an ensemble classifier that consists of several decision trees. Each node in the decision trees is a condition on a single feature that is designed to split the dataset into two sets, to ensure similar response values are in the same set. The measure that selects the local optimal condition is called impurity, and Gini impurity or information gain/entropy is typically used as the classification criterion. Thus, the degree, to which each feature reduces the weighted impurities in the tree, can be calculated during the training stage. For RF, the impurity decrease from each feature can be averaged and the features are ranked, based on this metric. Chalkidou et al. [26] proposed that at least 10–15 times of events per predictor variable are required to generate reasonably stable estimates of the impact of the dependent variables for multivariate analysis. Thus, in our study, the top five most important features, using Gini impurity as the criterion, were selected as the final radiomics features.

Radiomics model building and validation

The five selected radiomics features and corresponding labels (e.g., a binary label for predicting nonadvanced/advanced GISTs and encoded labels for predicting four types of risk) were fed into two RF classifiers to build the prediction model. The first model output was the predicted binary labels for nonadvanced or advanced GISTs. The second model output was encoded label values corresponding to four risk types. The hyper-parameter number of trees was optimized with tenfold cross-validation for both models. To obtain a less biased model performance evaluation, a nested three-time tenfold cross-validation was conducted and included feature selection, model building and evaluation. The final results, which included accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), were the average results across the three calculations used to quantify the performance of two models. A receiver operating characteristic (ROC) curve analysis was also conducted. The area under the ROC (AUC) representing the classification power of a classifier was also calculated. A higher AUC score indicated a better classification ability [27].

The predictive performance of the radiomics model was then tested in the validation cohort. The same indexes as in the primary cohorts were calculated for independent model evaluation.

Statistical method

Statistical analysis was performed using SPSS 22.0 software (SPSS Inc., Chicago, IL, USA). Quantitative data were described as mean ± standard deviation (SD), and qualitative data were described as frequency (percentage). T test was used to analyze continuous variables (age) and risk degree, while X2 test or Fisher test was used to analyze the relational categorical variables (site, sex) and risk degree, where available. p < 0.05 was considered statistically significant.

Results

Clinical characteristics

To determine whether the radiomics model could distinguish between nonadvanced and advanced GISTs, we used a training dataset comprising 53 men and 47 women (mean age 59.30 years; range 33–79 years) and a validation dataset comprising 23 men and 17 women (mean age 62.50 years; range 39–86 years). The training and validation datasets showed no statistically significant differences for age and sex. Table 3 presents the patients’ clinical characteristics of the binary classifications.

Table 3 The clinical characteristics for the binary classification

To determine whether the radiomics model could distinguish between the four risk degrees of GISTs, we used a training dataset comprising 43 men and 57 women (mean age 60.10 years; range 33–86 years) and a validation dataset comprising 23 men and 17 women (mean age 60.00 years; range 34–86 years). Differences between the training and the validation datasets were not statistically significant for age and sex in the very-low-risk, low-risk, intermediate-risk and high-risk groups. Table 4 presents the clinical characteristics of the four-class classification.

Table 4 The clinical characteristics for the four-class classification

Selected radiomics features

To determine whether the radiomics model could distinguish between nonadvanced and advanced GISTs, we used 48 of the 396 features that remained after de-correlation (Fig. 3). Figure 3a represents the important feature plot in predicting nonadvanced and advanced GISTs.

Fig. 3
figure 3

Feature importance plot shows mean decrease in Gini impurity. Features that most reduce Gini impurity are those that result in the least misclassification. a For discrimination of nonadvanced from advanced GISTs; b for discrimination of the four different risk degrees of GISTs

The first five important features calculated by RF ranking were RunLenth-Nonuiformity_AllDirection_offset1_SD, Surface Volume Ratio, Zone Percentage, VoxelValueSum and Grey Level Nouniformity_AllDirection_offset4_SD.

For the radiomics model to distinguish the four different risk degrees of GISTs, 43 out of the 396 features remained (Fig. 3). Figure 3b represents the important feature plot in predicting four kinds of risk types. The first five important features calculated by RF ranking were RunLenthNonuiformity_Angle0_offset4, RunLenthNonuiformity_AllDirection_offset1_SD, LongRunLowGreyLevelEmphasis_Angle0_offset1, Surface Volume Ratio and GreyLevelNouniformity_AllDirection_offset4_SD.

The importance of the remaining features calculated by RF ranking is presented in Table 5.

Table 5 The five important radiomics features calculated by RF for the model of predicting nonadvanced/advanced and model of predicting four kinds of risk

Radiomics model performance

When determining whether the radiomics model could distinguish nonadvanced from advanced GISTs, as well as between the four risk degrees of GISTs, we applied three-time tenfold cross-validation on the training and validation datasets to obtain accuracy, specificity, sensitivity, PPV and NPV (Fig. 4 and Table 6). The radiomics signature demonstrated discriminative performance for advanced and nonadvanced GISTs with an AUC of 0.935 (95% CI 0.870–1.000) and an accuracy of 90.2% for validation cohort. The radiomics signature demonstrated favorable performance for the risk stratification of GISTs with an AUC of 0.809 (95% CI 0.777–0.841) and an accuracy of 67.5% for the validation cohort. We also performed DeLong test on training and testing cohorts, and the results showed that p value was 0.672 (> 0.05), indicating that there was no significant difference between training and testing group, and the model was not overfitting.

Fig. 4
figure 4

a The ROC curve for classification of advanced and nonadvanced GISTs in training cohort; b the ROC curve for classification of advanced and nonadvanced GISTs in validation cohort; c the ROC curve for classification of the four different risk degrees of GISTs in training cohort, numbers 1–4 represent very low risk, low risk, immediate risk and high risk, respectively; d the ROC curve for classification of the four different risk degrees of GISTs, numbers 1–4 represent very low risk, low risk, immediate risk and high risk, respectively, in validation cohort

Table 6 The predictive performance of radiomics signature for discrimination of nonadvanced from advanced GISTs and for discrimination of the four different risk degrees of GISTs

Discussion

In the current study, we have developed and validated a pretreatment CT-based radiomics model with five features and demonstrated that CT-based radiomics analysis is a potentially useful method for distinguishing between nonadvanced and advanced GISTs and classifying the four risk stratifications of GISTs. GISTs have a complex biologic behavior that makes it difficult to predict their malignant potential [28]. The introduction of imatinib as an effective targeted therapy and the development of risk stratification for GISTs have changed the paradigm of GIST treatments [9, 29]. Therefore, predicting the risk of GIST recurrence for individual patient is increasingly important. Previous studies used qualitative and subjective analyses for the risk assessment of GISTs [30]. However, variability in expertise in subjective assessments of mitotic count and CT findings, as well as the possibility of non-representative biopsy samples due to tumor heterogeneity, can be challenging and difficult to overcome [31]. A possible remedy for this limitation is to identify tumors as VOIs and extract radiomics features from them for a more comprehensive analysis [24, 32, 33]. Compared with local biopsy samples and two-dimensional features, the 3D volume CT-based radiomics signature contains more detailed information to avoid the influence of tumor heterogeneity [18, 34].

The stratification of patients based on the GIST risk of recurrence is a key issue in managing primary GISTs [35]. The National Comprehensive Cancer Network guidelines recommend adjuvant imatinib treatment of ≥ 3 years for patients with a significant risk of recurrence (i.e., high-risk and intermediate-risk levels) [36]. However, patients, who have favorable outcomes after undergoing resection of very-low-risk or low-risk tumors, are likely to be cured by surgery alone and should not receive adjuvant imatinib [37, 38]. Thus, in the first portion of our study, we classified patients in the very-low-risk and low-risk stratifications of GISTs as having nonadvanced GISTs, and patients in the moderate-risk and high-risk stratifications as having advanced GISTs. Kang et al. [30] studied 22 cases of primary GISTs using diffusion-weighted imaging and found that the apparent diffusion coefficient values distinguished patients with very-low-risk or low-risk GISTs from those with intermediate-risk or high-risk GISTs with acceptable accuracy (81.8%). It is still lower than the accuracy of radiomics features performed in our study in either training dataset or validation dataset (accuracy: 81.8% vs 89.9% or 90.2%). However, the patient cohort of this study was very small. Because CT is usually used to examine GISTs preoperatively [29], CT-based radiomics signature in distinguishing nonadvanced from advanced GISTs could have better popularization and application in clinical practice. Wang [11] found that diagnostic accuracy of support vector machine diagnostic model, established with ten imaging features as indexes, was 70.0%, and it was especially reliable when diagnosing GISTs of high or low risk. Our result for the training and validation cohorts showed an accuracy of 89.9% and 90.2%, respectively, indicating better performance than the subjective model. Tao Chen reported that the generated radiomics model had an AUC value of 0.867 [95% confidence interval (CI) 0.803–0.932] in the primary cohort and 0.847 (95% CI 0.765–0.930) in the external cohort for differentiating low- and high-malignant-potential GISTs [39]. Our results for the binary classification had an AUC value of 0.935 (95% CI 0.870–1.000) in the validated cohort, showing a good consistency with previous research, which strongly supports the potential of radiomics analysis in terms of diagnostic accuracy. In addition, in order to obtain the best four-classification model, we re-selected the features to find the optimal features. Compared with the two-classification model, the four-classification model has more characteristics of running distance matrix, pays more attention to the change of the distribution of the same gray value in the tumor and reflects more information of internal heterogeneity.

Joensuu et al. [40] reported that very-low-risk, low-risk and intermediate-risk GISTs generally had favorable and similar outcomes, whereas the high-risk category would be considered for adjuvant treatment. Based on the perspective of clinical treatment and prognosis, clinicians are more concerned for patients with high-risk GISTs, because these patients require postoperative targeted drugs to prevent recurrences. Therefore, patient selection based on risk assessment for adjuvant treatment is important. In this context, providing an accurate risk assessment using the CT-based radiomics signature could be a new development for treating high-risk GISTs. Zhou et al. [17] indicated that the AUC of the obtained logistic regression model, based on subjective CT features for three risk degrees of primary GISTs, was 0.806 (95% CI 0.727–0.885). In our study, the radiomics signature demonstrated the four risk stratifications of GISTs for the training and validation cohorts with the AUC of 0.869 (95% CI 0.821–0.917) and 0.809 (95% CI 0.777–0.841), respectively, showing better performance than the subjective model.

As a computer-aided and quantitative biomarker, the radiomics signature could aid less experienced radiologists and minimize inter-observer variability in imaging diagnosis. CT-based radiomics analysis is important and necessary as a noninvasive complementary approach to provide a complete characterization of the tumor, especially in cases with incomplete tumor samples. Radiomics analysis could also be conducted on existing images without additional cost, thus having the potential for a wider range of clinical applications.

Nevertheless, the current study had some limitations. (1) We did not conduct a direct comparison between subjective CT features and pathological results, because it is possible for radiomics features of the whole volume tumor to represent the macro- and micro-pathology. The CT features are a type of subjective findings that have inter-observer variability in imaging diagnosis. However, even if microscopic tumor features cannot be directly assessed on CT, an indirect correlation can be assumed [41], which warrants further study. (2) It was not determined whether the venous phase could be used to distinguish the four risk degrees of GISTs, and further studies are necessary to address this point. (3) Although the arterial phase provided a good visibility of the tumor, this phase may be more commonly influenced by the type, amount and flow injection of contrast agent during radiomics, which may limit the reproducibility of the results. (4) There were only eight cases of tumors with very-low-risk degree, which might have produced bias in the assessment of the risk stratification of GISTs between the pathological diagnosis and the radiomics analysis. (5) Gene expression profiles are promising new methods for assessing the risk of GIST recurrence and/or metastasis [42], and further investigation is needed to better understand the potential mechanism.

Conclusion

We hereby conclude that radiomics analysis can capture features of the four risk categories of GISTs. We therefore provide a CT-based radiomics signature to distinguish between nonadvanced and advanced GISTs, as well as the four risk stratifications of GISTs, with good diagnostic accuracy. We highlight the potential of a quantitative radiomics analysis as a complementary tool to achieve an accurate diagnosis.