A simplified cluster model and a tool adapted for collaborative labeling of lung cancer CT scans
Introduction
Lung cancer, a highly invasive and rapidly metastasizing disease, is the most common type of cancer associated with a poor prognosis [1]. According to Bray et al., in 2018 [2], it was the leading cause of cancer deaths worldwide (18.4%), followed by stomach (8.2%), liver (8.2%), and breast (6.6%) cancer . Although a number of new targeted agents and immunotherapies are being developed, early detection and treatment are still the best options for the long-term survival of lung cancer patients [3], [4]. This requires routine monitoring of high-risk subjects using computed tomography (CT) and involves examination of a huge number of CT scans by radiologists [5]. Computer-aided diagnosis (CAD) tools based on machine learning (ML) models are intended to assist the radiologists by marking suspicious features on chest images aiding the human inspection.
The cornerstone of developing and improving accurate and computationally efficient ML models is the availability of high-quality training and testing datasets. The main datasets used in lung cancer research are the combined database of the Lung Image Database Consortium and the Image Database Resource Initiative (LIDC/IRDI) [6], the LUNA16 subset of LIDC/IDRI database [7], the dataset provided by the LUNGx challenge organized by SPIE, the American Association of Physicists in Medicine (AAPM), and the National Cancer Institute (NCI) [8], the Lung Test Images from Motol Environment (Lung TIME) database [9], ANODE09 database [10], the database of Lung CT Imaging Signs (LISS) [11], and the data from National Lung Screening Trial (NLST) of NCI [12].
Currently, most large datasets for lung cancer research are created from images acquired in screening trials and therefore consist of low-dose CT (LDCT) scans. Unsurprisingly, the most notable achievements in performance of ML models are made in this area [13], [14]. However, LDCT has its limitations [15], and for some scenarios, the use of standard-dose CT is preferable. Several studies report that standard-dose CT images provide data for radiomics analysis that can be used for early detection of metastases [16], [17], [18]. Unfortunately, these studies rely on non-public datasets of limited size, which does not allow the fine-tuning of the proposed methods. The insufficient availability of large amounts of accurately annotated training data currently is a bottleneck of this line of research.
There is a variety of software tools developed for medical image annotation [19], [20], [21], [22], [23]. They enable partial or full automation of the labeling process, but the interpretation of radiological data still depends on human intelligence. Crowdsourcing platforms have performed well in cost-effective large-scale image annotation [24]; however, they have limitations as the correct reading of CT scans requires special training and experience [25]. Weak labeling approaches (for example, free-text radiology reports [26], bounding boxes, or outlier correction with the use of a weakly labeled atlas [27]) are proposed to reduce the workload of medical experts.
We propose an open-source tool adapted for collaborative multitenant annotation of CT scan datasets, available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The tool is based on a cluster model for nodule localization. The model’s main features are the tolerance to slight differences in interpretations of individual readers and the ability to describe complex-shaped lesions with low effort. Using the double-reading approach with arbitration for ground truth annotation, we have created CTLungCa-500, a publicly available “proof-of-concept” dataset of thoracic standard-dose CT scans, consisting of 536 cases of patients with a high risk of lung cancer.
Section snippets
Patient data
The Mandatory Health Insurance System of the Russian Federation provides free health services for everyone who resides permanently or temporarily in Russia. Federal Law No. 326-FZ regulates the collection of personal data relating to all diagnoses, outcomes, forms, duration, and scope of medical care. Clinical data are stored in the Unified Medical Information Analysis System (UMIAS), and the corresponding medical images are stored in the Unified Radiology Information System (URIS). For some
Results
The participants providing their data for the dataset were 60% females and 40% males aged 50 to 75 years ( and for females and males, respectively). For 72 subjects, radiologists did not find any lung abnormalities. The remaining 464 CT scans contained 3151 nodules marked by the readers.
Of these, 1761 (55.8%) nodules were recognized as malignant, and 445 (14.1%) nodules were assigned to the category of benign lesions. There were also 926 (29.4%) abnormalities of non-nodular
Discussion
The size and quality of a training dataset are the key factors for ML model performance in any application, including medical imaging [44]. Unfortunately, there are no standardized rules and guidelines on how to annotate medical image data properly. Almost every available collection of clinical CT scans has its own information organization design, with its advantages and limitations. Creating new datasets is a time-consuming and challenging task that requires human experts to provide a ground
Conclusion
We present a new simplified cluster model for nodule localization, which minimizes the inaccuracy of tumor shape approximation while utilizing the efficient maximum transverse diameter approach. The model is best suited for collaborative and crowdsourced projects using the double reading approach for CT scan interpretations. It automatically groups the marks made by different readers for the same nodule, even if there is some disagreement on the shape and size of the lesion and location of its
Availability of data and materials
The FAnTom software is available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The dataset supporting the conclusions of this article is available at https://mosmed.ai/en/datasets/ct_lungcancer_500/.
CRediT authorship contribution statement
S.P. Morozov: Project administration. V.A. Gombolevskiy: Conceptualization, Methodology, Methodology, Formal analysis, Writing - review & editing. A.B. Elizarov: Software. M.A. Gusev: Software. V.P. Novik: Software. S.B. Prokudaylo: Software. A.S. Bardin: Data curation. E.V. Popov: Data curation. N.V. Ledikhova: Conceptualization, Methodology. V.Y. Chernina: Investigation. I.A. Blokhin: Investigation. A.E. Nikolaev: Investigation. R.V. Reshetnikov: Methodology, Formal analysis, Writing - review
Declaration of Competing Interest
Authors declare that they have no conflict of interest.
Acknowledgements
The authors thank Olga Korchazhkina, Senior Research Fellow of Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, for valuable discussions of the project. The authors are also grateful to Ekaterina Korepina and Nikolay Pavlov for their active and valuable participation in creating the web interface. The authors acknowledge Marina Vlasova for proof-reading the manuscript. Ethical approval no. 2 (1-II-2020) was granted by the regional ethics board,
References (48)
- et al.
Lung cancer: biology and treatment options
Biochim. Biophys. Acta
(2015) - et al.
Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge
Med. Image Anal.
(2017) - et al.
Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method
Lancet Digit. Health
(2019) - et al.
Radiomics-based features for pattern recognition of lung cancer histopathology and metastases
Comput. Methods Prog. Biomed.
(2018) - et al.
User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability
Neuroimage
(2006) - et al.
3D slicer as an image computing platform for the quantitative imaging network
Magn. Reson. Imaging
(2012) - et al.
Missed lung cancer: when, where, and why?
Diagn. Interv. Radiol.
(2017) - et al.
Medical image data and datasets in the era of machine learning-whitepaper from the 2016 C-MIMI meeting dataset session
J. Digit. Imaging
(2017) - et al.
Computed tomography data collection of the complete human mandible and valid clinical ground truth models
Sci. Data
(2019) - et al.
The global burden of cancer 2013
JAMA Oncol.
(2015)
Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
CA Cancer J. Clin.
Adjuvant therapy in non-small cell lung cancer: current and future directions
Oncologist
Reduced lung-cancer mortality with low-dose computed tomographic screening
N. Engl. J. Med.
The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans
Med. Phys.
LUNGx challenge for computerized lung nodule classification
J. Med. Imaging (Bellingham)
Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study
Med. Image Anal.
The LISS a public database of common imaging signs of lung diseases for computer-aided detection and diagnosis research and medical education
IEEE Trans. Biomed. Eng.
End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography
Nat. Med.
Limitations of screening for lung cancer with low-dose spiral computed tomography
Clin. Cancer Res.
Radiomics analysis of lung CT image for the early detection of metastases in patients with breast cancer: preliminary findings from a retrospective cohort study
Eur. Radiol.
Computer-aided diagnosis for early-stage lung cancer based on longitudinal and balanced data
PLoS One
Imagej2: imagej for the next generation of scientific image data
BMC Bioinform.
Cited by (18)
Interpretable vertebral fracture quantification via anchor-free landmarks localization
2023, Medical Image AnalysisEstablishment and validation of an AI-aid method in the diagnosis of myocardial perfusion imaging
2023, BMC Medical ImagingLimitations of Out-of-Distribution Detection in 3D Medical Image Segmentation
2023, Journal of Imaging