Introduction

Breast cancer is the leading cause of cancer-related deaths, comprising approximately 15%, in women [1]. The standard treatment for breast cancer includes surgery, radiotherapy, chemotherapy, and hormone therapy. The majority of patients with early breast cancer underwent surgical tumor removal [2], and the 5-year relative survival rate for patients diagnosed with localized breast cancer is approximately 98% [3]. However, the local recurrence rate has been reported to be between 2 and 17% in patients with breast cancers who underwent surgical treatment [4], and the 5-year relative survival rate declined to approximately 80% for regional involvement [3]. Thus, identifying patients highly at risk of disease recurrences is clinically important, and accurate diagnostic methods for predicting tumor aggressiveness are needed for pretreatment risk stratification and prognostication in patients with breast cancers.

Positron emission tomography (PET)/X-ray computed tomography (CT) has been used in oncology including breast cancer with tracer 2-deoxy-2-[18F]fluoro-d-glucose ([18F]-FDG), a glucose analog that can determine metabolic glucose activity [5]. Several studies examined [18F]-FDG-PET-based radiomic characteristics for predicting treatment response or recurrence in patients with breast cancer [6, 7]; however, robust pretreatment radiomic characteristics for outcome prediction have not yet been identified in patients with breast cancer.

It has been reported that machine learning (ML) has the potential to identify complex interactions among the variables by learning from data with updating the algorithms to achieve the accurate prediction [8, 9].

Recently, several studies have reported that classification methods based on the ML approach using [18F]-FDG-PET-based images in the field of breast cancers differentiate benign and malignant breast lesions and predict axillary lymph node metastases and treatment responses [10,11,12,13,14,15]. However, to our knowledge, no study has examined whether the ML approach using [18F]-FDG-PET/CT-based imaging can predict postoperative disease recurrence in patients with breast cancers. In clinical practice, tumor characteristics such as stage, histology, and hormone receptor status also provide important information when predicting disease recurrence in patients with breast cancer.

Therefore, this study aimed to develop ML models using pretreatment clinical and [18F]-FDG-PET-based radiomic characteristics to predict disease recurrences in patients with breast cancers who underwent surgery.

Materials and Methods

Patients

The institutional review board of our institution approved this retrospective study and waived the requirement for written informed consent. A total of 232 consecutive patients with suspected or confirmed first primary breast cancer underwent pretreatment [18F]-FDG-PET/CT from April 2012 to March 2021. Clinical records were reviewed to identify patients for analysis.

The inclusion criteria for this study were as follows: (1) pathologically proven breast cancer; (2) patients who were planned to have breast surgery; (3) no preoperative history of radiotherapy, chemoradiotherapy, or chemotherapy; and (4) primary tumor exhibiting visible uptake on PET/CT reports. The exclusion criteria were as follows: (1) coexisting other malignancy; (2) patients who did not undergo surgery at post–pretreatment [18F]-FDG-PET/CT; (3) the duration between [18F]-FDG-PET/CT and surgery was >3 months; (4) the primary tumor was too small (volume of interest [VOI] <64 voxels) for accurate texture analysis; and (5) incomplete follow-up data.

Imaging Protocols

PET/CT was performed using two whole-body PET/CT scanners, namely, Discovery 600-M PET/CT (GE Healthcare, Milwaukee, WI, USA) from January 2011 to 2018 and Discovery MI (GE Healthcare) from February 2018 to March 2021. Patients were instructed to fast for at least 5 h at pre-examination (mean plasma glucose level, 103 mg/dL; range, 76–174 mg/dL). [18F]-FDG (FDG Scan; Nihon Medi-Physics, Tokyo, Japan) was injected intravenously. The emission scan was performed 1 h after [18F]-FDG was injected (mean, 222 ± 31 MBq; range, 158–288 MBq) after CT data acquisition (slice thickness, 3.75 mm; pitch, 1.375 mm; 120 keV; and auto mA, 40–100 mA depending on the body mass). The acquisition time was 2.5 min per bed position (7–11 bed positions). Attenuation-corrected data were acquired. For the Discovery 600-M scanner, images were reconstructed using a three-dimensional ordered subset expectation–maximization algorithm (image matrix size, 192 × 192; 16 subsets, two iterations; voxel size, 3.125 × 3.125 × 3.27 mm3; VUE Point Plus). A Bayesian penalized likelihood reconstruction algorithm (image matrix size, 192 × 192; voxel size, 2.60 × 2.60 × 2.78 mm3; penalization factor, 700; Q. Clear) was used with the Discovery MI scanner and with the point spread function.

Image and Radiomic Feature Analyses

Two radiologists (with 17 and 19 years of [18F]-FDG-PET/CT experience) who were informed of the study purpose but were blinded to the clinical and pathological information reached a consensus regarding whether the primary lesion had abnormal [18F]-FDG uptake (greater than background activity in the surrounding tissues). A third radiologist (with 3 years of [18F]-FDG-PET/CT experience) performed quantitative analyses of visible primary lesions. The third radiologist generated the VOI by manually placing the region of interest on a suitable reference-fused axial image and defined the craniocaudal and mediolateral extent encompassing the entire visible primary lesion, excluding any adjacent physiological [18F]-FDG-avid structures. A 40% threshold of the maximum standardized uptake value (SUVmax) was used to define VOI boundaries [16]. The LIFEx package (version 6.00) [17] was used to extract 40 radiomic characteristics from PET images (Supplemental Table 1). These 40 radiomic characteristics were included in 5 categories which were shape and first-order characteristics, gray level co-occurrence matrix, neighborhood gray level different matrix, gray level run length matrix, and gray level zone length matrix. These characteristics were selected according to the previous report which examined the usefulness of a ML approach by using the [18F]-FDG PET-based radiomics to predict the histological subtypes of lung cancers [18]. The LIFEx calculates textural characteristics only for VOIs of at least 64 voxels. VOI and SUV were resampled into discrete bins using absolute resampling to minimize the correlation between textural characteristics and reduce the noise impact and the matrix sizes [19]. A total of 64 bins were used for the PET component with minimum and maximum bounds of the resampling interval set to SUVs of 0 and 20, respectively, and the voxel size was resampled to 3.0 × 3.0 × 3.0 mm3. Therefore, a bin size of 0.3 SUV was used to analyze the PET component. Voxels with an SUV of >20 were grouped in the highest bin [19].

Because two different PET scanners were used, post-reconstruction harmonization was performed for all PET parameters using the ComBat harmonization method for R software (https://github.com/Jfortin1/ComBatHarmonization) [20], which has demonstrated the efficacy of PET [21].

Patient Treatment and Follow-Up

All patients underwent surgery with or without axillary lymph node dissection. Surgical procedures consisted of radical mastectomy and breast-conserving surgery. Patients were staged based on the American Joint Committee on Cancer (AJCC) TNM Staging Manual, 8th Edition [22].

Medical records provided information on patients’ prognoses. The last follow-up was conducted in December 2021. Clinical and imaging follow-up established disease recurrence, such as local/regional recurrence, lymph node metastasis and distant metastasis in any site, or death of any cause. When a patient with multiple lesions showed disease recurrence, only lesions with the highest stage were selected for analysis. Follow-up consisted of required examinations once every 1–3 months for the first year, every 2–6 months for the next 2–5 years, and annually thereafter.

Histological Analyses

All pathological reports were reviewed, and all breast cancers were classified based on the World Health Organization histological classification [23]. Histological grading of invasive ductal carcinoma (IDC) was based on the tubular structure, nuclear pleomorphism, and mitotic count according to the Nottingham modification of the Scarff–Bloom–Richardson scoring system [24], with grades ranging from 1 to 3. The hormone receptor status (estrogen receptor (ER) and progesterone receptor (PR)) was analyzed immunohistochemically, and ER or PR positivity was defined as nuclear staining in >1% of cancer cells regardless of staining intensity in routine procedures [25]. Expressions of human epidermal growth factor receptor-2 (HER-2) and the Ki-67 index in IDC were immunohistochemically assessed by routine procedures, and HER-2 overexpression (positivity) was defined based on the 2013 American Society of Clinical Oncology/College of American Pathologists guidelines [26]. The Ki-67 index was recorded as percentage. The pathological axillary lymph node involvement was also recorded.

ML Approach

Twelve clinical characteristics (T stage, N stage, AJCC stage, histology, histological grade, three hormone status (ER, PR, and HER-2), Ki-67 index, and three tumor markers (CEA, CA153, and BCA225)) which has been reported as prognostic factors for breast cancer [24,25,26,27] and 40 radiomic characteristics were used to predict disease recurrence using ML approaches. Then, 3 different ML models were constructed using clinical characteristics (clinical ML models), radiomic characteristics (radiomic ML models), and both clinical and radiomic characteristics (combined ML models).

Data were stratified by event and randomly assigned into training (80%) and testing (20%) cohorts. Seven ML algorithms which have been reported as popular classification ML algorithms were used for binary classification [28, 29]: decision tree, random forest (RF), neural network, k-nearest neighbors (kNN), naive Bayes, logistic regression, and a support vector machine (SVM).

To overcome imbalanced data, a synthetic minority over-sampling technique was used for the training cohorts [30]. In this study, the sample was relatively small (n=112), it was necessary to reduce the set of characteristics to avoid the influence of overfitting. On the training cohorts, the ranking-based method was applied to select reduced set characteristics based on the decrease of Gini impurity. As a rule of thumb, it is necessary to use less than 10% of the sample size as the number of characteristics for classification problem [31]. The sample size of this study was n=112; thus, we selected the top ten ranking characteristics for constructing each ML model. Moreover, one of solutions of overfitting is the use of a resampling technique called k-fold cross-validation [32, 33]. It has been reported that 10 folds are a common choice for k-fold cross-validation, especially if the dataset is not too large or too sparse [33]. In this study, a 10-fold cross-validation was used to minimize the negative influence of overfitting.

Receiver operating characteristic curve (ROC) analysis was performed to compare predictive performances of models, and the area under the ROC curve (AUC) was calculated. The computed performance measures were AUC, accuracy, F1 score, precision (positive predictive value), and recall (sensitivity) for average over classes. The F1 score (F score or F measure) is the harmonic average between precision and recall [18]. Each ML algorithm calculated the probability score (range, 0–1) of disease recurrence for each tumor. The predictive performance of each machine model was independently estimated in the testing set by quantifying the AUC, accuracy, F1 score, precision, and recall.

On the testing cohorts, the diagnostic indices including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were also calculated. The importance of characteristics for ML modeling process was calculated using the decrease in AUC [34]. Higher decrease in AUC for a characteristic indicates the higher importance of that variable [34].

ML was performed using Orange version 3.24.1 software (Bioinformatics Laboratory, University of Ljubljana, Ljubljana, Slovenia), an open-source data-mining and visualization package [35].

Statistical Analysis

The Mann–Whitney U test or the chi-square test was used to appropriately assess differences between two quantitative variables or compare categorical data. ROC analysis was performed to examine the diagnostic performance of each parameter to predict disease recurrence. The DeLong method was used to analyze the statistical significance of differences between AUCs [36]. The diagnostic indices including sensitivity, specificity, PPV, NPV, and accuracy were compared by the McNemar’s test or the chi-square test.

Data were presented as median and interquartile range (IQR). All p-values were two-sided, and p-values <0.05 indicated a statistically significant difference. MedCalc (MedCalc Software, Mariakerke, Belgium) software was used to perform statistical analysis.

Results

Patient Characteristics

Of the 232 patients, 48 who had lesions that were not [18F]-FDG-avid, 22 who did not undergo surgery, and 3 who had undergone surgery 4 months after pretreatment [18F]-FDG-PET/CT were excluded. Among the remaining 159 patients, 47 were further excluded because of small tumor sizes (n=30; VOI was <64 voxels), coexisting malignant tumors (n=4; two colon cancers, one lung cancer, and one malignant lymphoma), phyllodes tumor (n=1), fibroadenoma (n=1), or lack of follow-up information (n=11).

Finally, 112 patients (109 women and 3 men; mean age, 61 ± 13 years; age range, 30–87 years) with 118 lesions were eligible for analyses. A total of six patients had two lesions. The 118 lesions comprised 96 IDCs, 14 ductal carcinoma in situ (DCIS), and 8 others (2 tubular carcinomas, 2 microinvasive (ductal) carcinomas, 1 invasive micropapillary carcinoma, 1 matrix-producing carcinoma, 1 invasive lobular carcinoma, and 1 carcinoma with apocrine differentiation). Fourteen, 26, 51, and 27 lesions were belonged to the AJCC stage 0, I, II, and III, respectively. Following [18F]-FDG-PET/CT, patients underwent surgery within 3–67 (mean, 32) days.

The demographic and clinical characteristics of study participants in the training and testing cohorts are summarized in Table 1. Of the 95 lesions in the training cohort, 10 had disease recurrence: 2 local recurrences, 2 lymph node metastases, and 6 distant metastases (lung (n=2), bone (n=2), brain (n=1), and liver (n=1)). Of 23 lesions in the testing cohort, 3 exhibited distant metastases: lung (n=2) and bone (n=1). All six patients with two lesions showed no disease recurrence.

Table 1 Characteristics of 118 breast tumors

Significant differences were observed in histological grade, T stage, and AJCC stage between the non-recurrence and recurrence groups in the training (each, p<0.05) and testing (each, p<0.05) cohorts. Significant differences were observed in N stage and Ki-67 index between the non-recurrence and recurrence groups within the training cohort (Ki-67 index, p=0.024; N stage, p=0.016) but not within the testing cohort (Ki-67 index, p=0.15; N stage, p=0.22). No significant differences were observed in histology, three hormone receptor status, and three tumor markers between the non-recurrence and recurrence groups in both the training and testing cohorts.

Although a significant difference was observed in the N stage between the training and testing cohorts (p=0.028), no significant differences were observed in histology, three hormone receptor status, histological grade, Ki-67 index, three tumor markers, T stage, and AJCC stages between the two cohorts (each, p>0.05).

ML Models to Predict Recurrence in Breast Cancer

Clinical and radiomic characteristics were ranked by the decrease in Gini impurity, and the top ten characteristics of the clinical, radiomic, and combined ML models are shown in Table 2. Each ML model was constructed using these top ten characteristics to avoid the overfitting.

Table 2 The top ten ranking of clinical and positron emission tomography-based radiomic characteristics for predicting disease recurrence in patients with breast cancer

Table 3 presents the diagnostic performance for each ML algorithm based on the clinical, radiomic, and combined ML models in the training and testing cohorts for predicting disease recurrences in breast cancer. In the training cohorts, all 7 ML algorithms except for logistic regression algorithm in the radiomic ML model (AUC = 0.760) achieved AUC values of >0.80 for predicting disease recurrence in the clinical (range, 0.892–0.999), radiomic (range, 0.809–0.984), and combined (range, 0.897–0.999) ML models.

Table 3 Diagnostic performance of each machine learning model using the top ten ranking characteristics in the training and testing cohorts for predicting disease recurrences in breast cancer

In the testing cohorts, as for the clinical ML model, three of 7 ML algorithms (RF, naive Bayes, and SVM) achieved an AUC of >0.80 (range: 0.833–0.867), but the performance of the remaining 4 ML algorithms was poorer in the testing cohorts (AUCs of 0.592–0.750) than in the training cohorts. The RF algorithm achieved the highest AUC and accuracy among the 7 ML algorithms (AUC, 0.867 and accuracy: 87.0% [20/23]).

As for the radiomic ML model, three of 7 ML algorithms (RF, kNN and naive Bayes) achieved AUC values of >0.80 in the testing cohort (range: 0.850–0.975), and the performance of the remaining 4 ML algorithms was poorer in the testing (AUCs of 0.400–0.800) than in the training cohorts. The naive Bayes algorithm achieved the highest AUC of 0.975 with sensitivity of 100% (3/3), followed by the RF algorithm (AUC = 0.875), while the naive Bayes algorithm showed the lowest specificity (70.0% (14/20)) and accuracy (73.9% (17/23)) among the 7 ML algorithms. On the other hand, the RF algorithm as well as decision tree (AUC; 0.800) achieved the highest specificity (90% (18/20)) and accuracy (87.0% (20/23)) among the 7 ML algorithms.

As for the combined ML models, five of 7 ML algorithms (RF, neural network, kNN, naive Bayes, and logistic regression) achieved AUC values of >0.80 in the testing cohorts (range, 0.900–0.992), and the RF algorithm achieved the highest values of all diagnostic indices among the 7 ML algorithms ((average over classes: AUC, 0.992; F1 score, 0.959; precision, 0.967; recall, 0.957); sensitivity: 100% (3/3), specificity: 95.0% (19/20), PPV: 75.0% (3/4), NPV: 100% (19/19), accuraccy:95.7% (22/23)).

Robustness of ML Models in the Testing Cohort

Table 4 shows the testing cohort diagnostic performance of the highest AUC or accuracy of ML algorithms in the clinical, radiomic, and combined ML models for predicting disease recurrences in breast cancer.

Table 4 The testing cohort diagnostic performance of the highest AUC or accuracy in the clinical, radiomic and combined machine learning models for predicting disease recurrences in breast cancer

The RF algorithm achieved the highest AUC in both clinical (0.867) and combined (0.992) ML models. In the radiomic ML model, the naive Bayes algorithm showed the highest AUC (0.975), but its accuracy was 73.9% (17/23), while the RF algorithm (AUC; 0.875) as well as decision tree algorithm (AUC; 0.800) showed the highest accuracy 87% (20/23).

Diagnostic indices including sensitivity, specificity, PPV, NPV, and accuracy of these ML algorithms were not significantly different among these 3 ML models (each, p>0.05); however, the RF algorithm of combined ML model displayed the best performance among all ML algorithms. Moreover, the classification performance of this RF algorithm exhibited in the testing cohorts was similar to the training cohorts (AUC: training cohort, 0.999; testing cohort, 0.992). The important characteristics in this RF algorithm calculated using the decrease in AUC are shown in the supplemental Fig. 1. The GLZLM_ZLNU is the most important feature with the highest mean value (0.0063), followed by the AJCC stage with a mean value of 0.0052, and these two characteristics had higher contribution in the modeling process.

Representative [18F]-FDG-PET/CT images for patients without and with disease recurrence are presented in Figs. 1 and 2, respectively.

Fig. 1
figure 1

A 60-year-old woman with right breast cancer (invasive ductal carcinoma, T4N0M0, stage III, histological grade 2, ER: positive, PR: negative, HER-2: positive, Ki-67: 37.8%, CEA: 298.4 ng/ml, CA15-3: 90.8 U/ml, and BCA:225: 3400 U/ml) showed no recurrence postoperatively. Transaxial pretreatment 2-deoxy-2-[18F]fluoro-d-glucose ([18F]-FDG) positron emission tomography/computed tomography image revealed abnormal [18F]-FDG uptake in the primary lesion. Yellow line represents the border of the volume of interest. The calculated probability scores to predict disease recurrence (recurrence ≥ 0.5) were as follows: the RF algorithm (best performance algorithm) in the clinical ML model; 0.80, the naive Bayes algorithm (highest AUC algorithm), the RF algorithm (highest accuracy algorithm) and the decision tree algorithm (highest accuracy algorithm) in the radiomic ML models; 0.99, 0.40, and 0.0, respectively, and the RF algorithm (best performance algorithm) in the combined ML models; 0.40. Although the scores of the clinical ML model and the radiomic ML naive Bayes algorithm (highest AUC algorithm) model suggested recurrence, both radiomic ML models with highest accuracy algorithm and combined ML model predicted non-recurrence for this case. She has survived without recurrence for 104 months postoperatively

Fig. 2
figure 2

A 50-year-old woman with left breast cancer (invasive ductal carcinoma, T2N3M0, stage III, histological grade 3, ER: positive, PR: positive, HER-2: negative, Ki-67: 19%, CEA: 2.0 ng/ml, CA15-3: 14.9 U/ml, and BCA:225: 35 U/ml) displayed recurrence after surgery. Transaxial pretreatment 2-deoxy-2-[18F]fluoro-d-glucose ([18F]-FDG) positron emission tomography/computed tomography image revealed focal [18F]-FDG uptake in the primary lesion. Yellow line represents the border of the volume of interest. The calculated probability scores to predict disease recurrence (recurrence ≥ 0.5) were as follows: the RF algorithm (best performance algorithm) in the clinical ML model; 0.39, the naive Bayes algorithm (highest AUC algorithm), the RF algorithm (highest accuracy algorithm) and the decision tree algorithm (highest accuracy algorithm in the radiomic ML models; 0.99, 0.50 and 0.96, respectively, and the RF algorithm (best performance algorithm) in the combined ML models; 0.67. Although the score of the clinical ML model suggested non-recurrence, both radiomic and combined ML models predicted recurrence for this case. She developed progressive disease and died 24 months postoperatively

Discussion

The current study evaluated the usefulness of pretreatment clinical and [18F]-FDG-PET-based radiomic characteristics using the ML approach for predicting disease recurrence in patients with breast cancers who underwent surgery. The RF algorithm of combined model displayed the best performance with the highest AUC and accuracy among all ML algorithms. The GLZLM_ZLNU and AJCC stage had higher contribution in the modeling process of this RF algorithm. Therefore, ML analyses using both clinical and [18F]-FDG-PET-based radiomic characteristics may be useful for predicting disease recurrences in patients with breast cancers who underwent surgery.

Several studies have reported the classification methods based on the ML approach using [18F]-FDG-PET-based images in the field of breast cancer [10,11,12,13,14,15]. Romeo et al. used the ML approach to examine the classification value of radiomic characteristics in [18F]-FDG-PET/MRI to differentiate benign and malignant breast lesions. In that study, the adoption of SVM as a classifier yielded the highest classification performance (AUC, 0.983) [10]. Song reported that the ML model with XGBoost algorithms using the [18F]-FDG-PET/CT radiomic model for the prediction of axillary lymph node metastasis was successfully developed, and its sensitivity, specificity, and accuracy for the prediction of axillary lymph node metastasis in the test cohorts were 90.9%, 71.4%, and 80%, respectively [11]. Morawits et al. examined the usefulness of ML models with RF algorithm using the morphologic and metabolic characteristics in [18F]-FDG-PET/MRI for the prediction of axillary lymph node metastasis, and a sensitivity of 96.2% was achieved by the RF classifier [12]. Li et al. examined the usefulness of an ML approach using [18F]-FDG-PET-based radiomics to predict the treatment response in patients with breast cancer, and they reported that the RF model achieved a prediction accuracy of 0.857 (AUC, 0.844) on the training split set and 0.767 (AUC, 0.722) on the independent testing set [13]. Alongi et al. reviewed the clinical indications and limitations of PET imaging for comprehensive artificial intelligence (AI) in relation to breast cancer, and they reported that despite the current role of AI in this field remains still undefined, several applications for PET/CT imaging are under development, with some preliminary interesting results particularly focused on the staging phase that might be clinically translated after further validation studies [37]. However, to our knowledge, no study has previously investigated the efficacy of [18F]-FDG-PET-based radiomics via an ML approach for predicting disease recurrences in patients with breast cancer who underwent surgery. Moreover, most studies only included [18F]-FDG-PET-based radiomics in the ML models.

In our study, to avoid the influence of overfitting, 3 different ML models including clinical, radiomic, and combined ML models were constructed using the top 10 characteristics ranked by the decrease in Gini impurity to predict disease recurrences using ML approaches. In the training cohorts, all ML algorithms except for the logistic regression algorithm in the radiomic ML model exhibited good classification performance with AUC values of >0.80 for clinical, radiomic, and combined ML models. However, in the testing cohorts, only 3 ML algorithms (RF, naive Bayes, and SVM) based on the clinical ML model and 3 ML algorithms (RF, kNN, and naive Bayes) based on the radiomic ML model achieved AUC values of >0.80 because of overfitting. On the other hand, five (RF, neural network, kNN, naive Bayes, and logistic regression) of 7 ML algorithms achieved AUC values of >0.80 for the combined ML model. Although neither the highest AUC nor highest accuracy of each represented ML algorithm was significantly different among these 3 ML models, the RF algorithm of combined ML model was the best performing classifier with achieving the highest diagnostic accuracy (95.7% (22/23)). Moreover, this RF algorithm exhibited similar classification performance between training and testing cohorts (AUC: training cohort, 0.999; testing cohort, 0.992), and the important characteristics for ML modeling process of this RF algorithm were GLZLM_ZLNU and AJCC stage. GLZLM_ZLNU is one of the higher order texture features, and GLZLM_ZLNU represents that the non-uniformity of the length of the homogeneous zones [38]. There have been some reports which examined the characteristics of the GLZLM_ZLNU with [18F]-FDG PET/CT in oncology [38, 39], and Cheng et al. [38] evaluated pretreatment [18F]-FDG PET/CT images of oropharyngeal cancer patients and concluded that GLZLM_ZLNU is an independent prognostic factor. These findings indicate that combining clinical and [18F]-FDG-PET-based radiomic characteristics might provide more robust ML models compared with using clinical or radiomic characteristics alone for predicting disease recurrence. Thus, the ML approach combining clinical and [18F]-FDG-PET-based radiomic characteristics especially with GLZLM_ZLNU and AJCC stage might potentially predict disease recurrence in patients with breast cancers who underwent surgery.

This study has several limitations. First, this was a retrospective study with a relatively small study population. Moreover, more than half of patients were excluded according to the exclusion criteria. However, it might be necessary to exclude these patients to achieve the study purpose, and case selection bias was therefore inevitable. Thus, a prospective study of a much larger population is needed to validate and confirm our findings. In this connection, among 48 excluded patients with non-[18F]-FDG-avid lesions, 8 patients had incomplete follow-up after surgery and one patient exhibited lung metastases due to oral carcinoma, while the remaining 39 patients showed no disease recurrence. Among 30 excluded patients with [18F]-FDG-avid small (volume of interest (VOI) <64 voxels) breast cancers, 26 patients showed no disease recurrence, one patient had incomplete follow-up after the surgery, but 3 patients showed disease recurrence (2 local recurrences and 1 lymph node metastasis) and recured in IDCs with AJCC stage I, II, and III, respectively. Thus, an alternative approach to predict recurrence would be necessary in patients with such [18F]-FDG-avid small breast cancers. Second, the use of two different PET/CT scanners may have affected the results of [18F]-FDG-PET-based radiomic analyses. However, the post-reconstruction harmonization using ComBat was conducted during analyses to mitigate this issue. Third, only 40 radiomic characteristics which were extracted from the LIFEx software were used for the ML analyses, and other radiomic characteristics were not examined for the ML analyses. However, the LIFEx software has been widely used for the radiomic analyses in the field of PET/CT studies [39, 40]. Fourth, only 7 classification ML algorithms including decision tree, RF, neural network, kNN, naive Bayes, logistic regression, and SVM were used for the ML analyses, and other ML algorithms such as the fisher linear discriminant analysis [41, 42] were not used for the ML analyses. Thus, it might be necessary to perform the ML analyses using alternative ML algorithms. However, the applied 7 ML algorithms have been reported as the popular classification ML algorithms [28, 29]. We only used the ML algorithms which were able to be implemented in Orange software. Orange software is a popular open-source tool that provides a visual approach to ML for an interactive data analysis which enables easy construction and configuration of workflows for the ML studies [35]. Finally, although internal and external validation showed high classification performance in the combined clinical and [18F]-FDG-PET-based radiomic characteristic models, a training–test scheme with a larger population might be preferred for model validation.

Conclusion

The ML analyses using both clinical and [18F]-FDG-PET-based radiomic characteristics may be useful for predicting disease recurrences in patients with breast cancers who underwent surgery.