In a recent bibliometric analysis study on artificial intelligence in radiology, the article entitled “A survey on deep learning in medical image analysis” was found to be the most cited publication, both overall and per year, which confirms the overwhelming enthusiasm of the scientific community toward deep learning applications for medical imaging [1, 2]. The authors of this insightful paper keenly draw the attention to a number of unique challenges for successful deep learning modeling in the field of radiology, underlining how lesion annotation faces the issues of both object detection and substructure segmentation. Among others, data-related problems are presented, such as the lack of large expert-labeled datasets, class imbalance, and the simplification behind dichotomous classifications (e.g., a retention cyst and an ectopic benign prostatic hyperplasia nodule in the peripheral zone have characteristic appearances at prostate MRI and differ from normal benign background, but a model needs to recognize them as belonging to the same class). However, the terms “validation,” “reproducibility,” “generalizability,” and “transferability” are quite surprisingly never mentioned in the whole document. Broadly, the ability of reproducing results is pivotal in research but often problematic in radiology, including deep learning applications [3]. Despite its importance, there is a significant inconsistency in the use of this terminology which is at least partly due to the different backgrounds of the involved professionals (e.g., artificial intelligence experts use the term “validation” to refer to the tuning step and physicians/radiologists commonly use it to indicate the final model testing) [4]. To simplify, we could say that deep learning modeling (the model building stage) ends after training and tuning are completed (validation); to ensure that model’s performance is dependable and assessing its generalizability, testing on external independent datasets is required. In this light, to gain trust in deep learning models and facilitate their adoption into clinical practice, we need to address the question “Is this model reproducible or does it detect patterns merely valid in the training dataset?” In the overabundant sea of newly trained and developed deep learning models proposed for a plethora of different tasks, the paper published in this issue of European Radiology joins a small group of refreshing and much-needed generalizability studies. Indeed, in the study by Netzer and colleagues, one can identify several points of strength that should serve as an example for future external test studies [5]. The authors have assessed the performance of a previously trained model (UNETM) in the task of automatically detecting and annotating csPCa lesions on biparametric prostate MRI scans [6]. Of note, the test dataset can be considered of high quality both in terms of overall size (640 cases in total) and diversification of sources (two separate institutions and a public dataset, multiple scanner vendors). Furthermore, acquisition protocols were not homogenous across all sites, which better reflects real-world practices in prostate MRI [7]. In particular, the inclusion of a public dataset, from the PROSTATEx challenge, is especially important to allow comparability of results with future similar research efforts.

On the other hand, readers should also be aware of some limitations that need to be accounted for when interpreting the reported findings. The PROSTATEx data is fairly dated, and not perfectly in line with current guidelines [8]. Also, the highly curated nature of the validation data in the study may also affect the translation of the reported results in clinical practice. Specifically, the distribution of csPCa positive and negative exams was balanced to remove potential confounding factors in the analysis of generalizability across data sources. While understandable from an analytical point of view, as discussed by the authors, this cannot accurately reflect clinical practice. In this setting, csPCa prevalence may vary on a geographical basis as well as in relation to the nature of the medical institution where prostate MRI is performed (i.e., referral centers or less-specialized clinics). Ideally, computer-assisted diagnosis tools should perform robustly in all of these contexts.

Nevertheless, useful insights are provided by the experiment conducted by Netzer’s group. In our opinion, the most interesting outcome is represented by the need to tune the probability thresholds for model output to optimize its performance on new data. This indeed allowed the authors to obtain consistent performance compared to expectations from previous studies, without need for fully or partially retraining the neural network [5, 6]. Such a solution may be a practical approach to address changes in data distribution across sites, as in the present study, or in the same site over time (i.e., data drift). It will be necessary to see if upcoming regulations on the use of artificial intelligence in healthcare, such as the European Commission’s AI Act, may consider this type of tuning akin to retraining or not [9]. Current laws tend to require recertification of medical software in any case of retraining, which may represent an insurmountable issue for radiologists who wish to maximize performance based on their own hospital’s data. It is not a surprise, after all, that compliance with regulations remains a central aspect of artificial intelligence clinical implementation [10].

In conclusion, high quality efforts to externally validate machine learning models need to continue both for research and commercially available solutions. Supporting this type of research is essential in the ongoing efforts of improving clinical applicability of the latest technologies and, ultimately, improving patient outcomes.