Abstract

In this paper, we mainly use random forest and broad learning system (BLS) to predict rectal cancer. A total of 246 participants with computed tomography (CT) image records were enrolled. The total model in the training set (combined with imaging and clinical indicators) has the best prediction result, with the area under the curve (AUC) of 0.999 (95% confidence internal (CI): 0.996–1.000) and the accuracy of 0.990 (95%CI: 0.976–1.000). Model 3, the general model in the test set, has the best prediction result, with the AUC of 0.962 (95%CI: 0.915–1.000) and the accuracy of 0.920 (95%CI: 0.845–0.995). The results of the model using random forest prediction are compared with those using BLS prediction. It can be found that there is no statistical difference between the two results. Our prediction model combined with image features has a good prediction result, and this image feature is the most important among all features. Consequently, we can successfully predict rectal cancer through a combination of the clinical indicators and the comprehensive indicators of CT image characteristics in four different periods (plain scan, vein, artery, and excretion).

1. Introduction

The incidence of rectal cancer is one of the most malignant tumors in the world [13]. Accurate clinical staging is the key to treatment decision, especially for rectal cancer [2, 3]. In the past century, medical imaging technology has experienced various hardships and achieved many new achievements in its continuous development [35]. In recent years, high resolution magnetic resonance imaging (MRI) and computed tomography (CT), along with the applications of endoscopic ultrasonography, enabled clinicians to more accurately choose corresponding treatment before surgery for rectal cancer staging, according to tumor location, infiltration depth, lymph node, and distant metastasis [110].

The current treatment of rectal cancer has entered into a multidisciplinary comprehensive treatment pattern [11, 12]. Among them, new adjuvant chemoradiotherapy followed by selective radical resection of rectal cancer is one of the generally recognized methods for the treatment of advanced rectal cancer [1315]. Until now, a series of novel adjuvant chemoradiotherapy have been recommended for patients with TNM stage rectal cancer [16]. Until now, medical imaging technology is still one of the most commonly used methods to obtain noninvasive information of human tissues and organs and assist disease diagnosis, which requires to accurately predict rectal cancer with medical images and clinical features of the patients in different periods [17, 18].

The purpose of this study was to predict whether patients had rectal cancer by analyzing the CT images and clinical features over four periods. The manuscript was organized as follows. The proposed models are introduced in Section 2. In Section 3, experimental results of these models are given, and finally, the results are compared and discussed with a principle to choose the most suitable prediction model.

2. Proposed Models

2.1. Data Collection and Treatment

The data were collected from Tianjin Fourth Central Hospital from 2016 to 2021, which involve 246 individuals with CT information. All statistical tests were conducted by bilateral test [19]. During the significance test, the parameter α = 0.05 is employed to define whether the differences were statistically significant [20]. The measurement data were tested by Kolmogorov–Smirnov [21], where the continuous variables of normal distribution were expressed by mean standard deviation (mean ± SD). The comparison between data groups was tested by t-test analysis [2224]. The measurement data of nonnormal distribution are enhanced by median and interquartile distance [25]. The rank sum test was used for comparisons among the data groups [2628]. Classification variables will describe the number and percentage of cases of each category in test [29]. All missing values were filled by random interpolation [30]. The gap-filling result is reliable since there is no difference before and after filling, as shown in Table 1.

As seen in Figure 1, each patient has four CT images, which represent four different periods, including the plain scan period (no contrast medium was pushed), the arterial period (inject contrast agent into artery for visualization to observe whether the blood supply of diseased artery is abundant), the venous period (contrast agent enters venous blood vessels and observe the blood supply of diseased veins), and the excretion period (during which the contrast medium is excreted). The features of CT in four periods were extracted by pyradiomics and then combined with CT in four periods of a sample. The rectal part of each CT is extracted as the ROI (region of interest) [31].

Random number 2021 is used to randomly split the data into 8 : 2, 80% of the data is used as the training set to build the model, and the remaining 20% of the data is used as the verification set to verify the model. In the training set, Lasso regression (α = 0.01) was used to screen the clinical data and imaging features [9, 10]. The final clinical data screened out four characteristics: gender, diabetes history, family cancer history, and fecal occult blood. A total of 15 image features were screened out.

2.2. Construction of Prediction Model

We mainly use random forest and broad learning system (BLS) to predict rectal cancer. Broad learning maps input data and constructs the mapping features and then activates the mapping features to enhancement layers and outputs the two parts features together. In this paper, broad learning system is used to learn the variables in the model to obtain the output variables.Output value of mapping nodes: Enhanced nodes output value: Output nodes value: , So, the pseudoinverse matrix:

There are n groups of Z with k nodes in each group and m groups of H with P nodes in each group.

Then, we need to screen out new features by using the loss function of the 1-norm in Lasso regression and incorporate the new features into the random forest, . And, random forest is a combination of decision trees. Each decision tree is trained by randomly generating new data sets from the original data set. The decision result of random forest is the decision result of most decision trees. Single model classification method often has not high precision, prone to overfitting problem, so many scholars often through the combination of multiple single models to improve prediction accuracy, and these methods are called classifier combination method. Random forest is an algorithm that proposed to solve the overfitting problem of single decision tree model.

The random forest uses the Bootstrap resampling method to extract multiple samples from the original samples and then conducts decision tree modeling for each Bootstrap sample and then synthesizes multiple decision tree for prediction and obtains the final prediction result through voting. The core idea of Bootstrap resampling is to sample n original sample data with the sample size of N, and the probability of each observation object being selected is equal, that is, 1/N. The sample is regarded as the whole, and the subsamples sampled are regarded as samples from the sample. The resulting subsample is called the bootstrap sample.(1)Each decision tree is generated by training sample X with sample size K and random vector (2)Random vector sequence is independently and identically distributed(3)Random forest is the set of all decision trees

Each decision tree model has one vote to select the classification result of input variable X:where H(x) represents random forest classification result, represents the classification result of a single decision tree, Y represents the classification target, represents an indicative function, and the random forest classification model uses simple voting strategy to complete the final classification.

Four models are constructed, as shown in Figure 2. In model 1, only four clinical features are used to build a prediction model with random forest. Model 2 is a prediction model based on 15 image features and random forest, whose prediction probability is also used in model 3 as a new index, combining with four clinical characteristics, using a prediction model built by random forest. Model 4 is a prediction model based on BLS training using the data of model 3.

The area under the curve (AUC) and accuracy were used to evaluate predictive performance of the models. Delong test was used to compare the evaluation indexes of the models. With  < 0.05, the difference was statistically significant. In this study, R (version 4.0.3) is used for clinical data preprocessing, SAS (version 9.4) is used for comparison among data groups, and Python (version 3.7.4) is used for data screening and model building.

3. Results and Discussion

3.1. Characteristics between the Training and Testing Sets

After splitting the data into the training set and the test set, the balance test is carried out, and the final P values of all variables are >0.05, indicating that the balance of the two groups of data is comparable. See, for details, Table 2.

3.2. Characteristics between Rectal Cancer and Non-Rectal Cancer Groups

Table 3 shows the characteristics of the participants between rectal cancer and non-rectal cancer groups. We can find that there are significant differences between rectal cancer and non-rectal cancer in gender, past diabetes history, family cancer history, drinking history, fecal occult blood test, carcinoembryonic antigen, and carbohydrate antigen [11, 12].

3.3. Comparison for the Predictive Performance of Models (1, 2, and 3)

After Delong test, it is finally found that the total model in the training set, namely, model 3, has the best prediction efficacy, with the AUC of 0.999 (95%CI: 0.996–1.000) and the accuracy of 0.990 (95%CI: 0.976–1.000). See, for details, Table 4.

After Delong test, the total model in the testing set, that is, model 3, has the best prediction efficacy, with the AUC of 0.962 (95%CI: 0.915–1.000) and the accuracy of 0.920 (95%CI: 0.845–0.995). See, for details, Table 5. The ROC curves of these models (1, 2, and 3) are shown in Figure 3.

3.4. Comparison for the Predictive Performance between Model 3 and BLS Model

The results of the model using the random forest analysis are compared with those using BLS prediction. It is found that there is no statistical difference between the two results, see Tables 6 and 7. The ROC curves of model 3 and BLS model are shown in Figure 4.

3.5. Discussion on the Importance of Variables

Because the prediction model established by using random forest is more interpretable, we finally choose random forest model. Finally, the importance of the characteristics of the stochastic forest model shows that our comprehensive image index score is the most important [13, 14], followed by fecal occult blood, family cancer history, gender, and diabetes. See details in Figure 5.

According to relevant studies, the early symptoms of rectal cancer are not obvious, and most patients have already developed to the stage of local progression when they visit the doctor. The ideal treatment effect cannot be achieved solely by relying on surgical treatment, and the postoperative 5-year survival rate is only about 50% [19, 20]. Result and analysis become very important to the diagnosis of rectal cancer, and foreign-related research [21] have shown that using the method of gas injection and waterflooding in rectum CT scan can find early lesions, can improve the diagnostic accuracy rate to 86%–90%, can stage judgment accuracy up to 84%, and for staging the identification of main basis insufflate the surrounding fat clearance is clear. Whether the gap between bowel and surrounding organs disappears, this will also cause a certain error. In this study, four models were mainly used for comparative analysis and intergroup comparison of relevant factors. Finally, it was found that the prediction correlation of model 3 was the best, that is, the prediction model using random forest and model 2 combined with four clinical features had the best effect. The prediction is 99 percent accurate. This paper also explored the use of random forest for model prediction after adopting the BLS learning feature. It can be found that the prediction effect of the training set is much better than that of model 3, but the test effect is still not as good as that of model 3. In summary, clinical indicators were combined with a comprehensive index of CT image features at four different periods (plain, venous, arterial, and excretory) to predict rectal cancer.

4. Conclusion

In this paper, after dividing the data set, we perform a balance test and can get detailed values, which can be found in Table 2. At the same time, through intergroup comparison, we can clearly find differences and their statistical significance in several aspects such as gender, previous history of diabetes, and family cancer history. Herein, four models were developed to predict the risk of rectal cancer. Our findings showed that the prediction model (model 3) which included clinical characteristics and CT images had good predictive performance for rectal cancer. It is beneficial for clinicians to identify rectal cancer cases and to improve the prognosis by early treatment.

Data Availability

The data utilized to support the findings are available from the corresponding authors upon request.

Disclosure

Yingyin Feng and Qi Ding are the co-first authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the Key Project of Health and Family Planning Commission of Tianjin (no .14KG113).