A simplified cluster model and a tool adapted for collaborative labeling of lung cancer CT scans

doi:10.1016/j.cmpb.2021.106111

Computer Methods and Programs in Biomedicine

Volume 206, July 2021, 106111

https://doi.org/10.1016/j.cmpb.2021.106111 Get rights and content

Highlights

•
We have developed a simplified cluster model and an open-source tool for lung cancer CT scans annotation.
•
The model accelerates the nodule markup process and enhances its efficiency.
•
Using the model, we have created the publicly available CTLungCa-500 dataset of lung CT images.
•
The dataset contains crowd-sourced markup of healthy and tumorous tissues.

Abstract

Background and objective: Lung cancer is the most common type of cancer with a high mortality rate. Early detection using medical imaging is critically important for the long-term survival of the patients. Computer-aided diagnosis (CAD) tools can potentially reduce the number of incorrect interpretations of medical image data by radiologists. Datasets with adequate sample size, annotation, and truth are the dominant factors in developing and training effective CAD algorithms. The objective of this study was to produce a practical approach and a tool for the creation of medical image datasets.

Methods: The proposed model uses the modified maximum transverse diameter approach to mark a putative lung nodule. The modification involves the possibility to use a set of overlapping spheres of appropriate size to approximate the shape of the nodule. The algorithm embedded in the model also groups the marks made by different readers for the same lesion. We used the data of 536 randomly selected patients of Moscow outpatient clinics to create a dataset of standard-dose chest computed tomography (CT) scans utilizing the double-reading approach with arbitration. Six volunteer radiologists independently produced a report for each scan using the proposed model with the main focus on the detection of lesions with sizes ranging from 3 to 30 mm. After this, an arbitrator reviewed their marks and annotations.

Results: The maximum transverse diameter approach outperformed the alternative methods (3D box, ellipsoid, and complete outline construction) in a study of 10,000 computer-generated tumor models of different shapes in terms of accuracy and speed of nodule shape approximation. The markup and annotation of the CTLungCa-500 dataset revealed 72 studies containing no lung nodules. The remaining 464 CT scans contained 3151 lesions marked by at least one radiologist: 56%, 14%, and 29% of the lesions were malignant, benign, and non-nodular, respectively. 2887 lesions have the target size of 3–30 mm. Only 70 nodules were uniformly identified by all the six readers. An increase in the number of independent readers providing CT scans interpretations led to an accuracy increase associated with a decrease in agreement. The dataset markup process took three working weeks.

Conclusions: The developed cluster model simplifies the collaborative and crowdsourced creation of image repositories and makes it time-efficient. Our proof-of-concept dataset provides a valuable source of annotated medical imaging data for training CAD algorithms aimed at early detection of lung nodules. The tool and the dataset are publicly available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git and https://mosmed.ai/en/datasets/ct_lungcancer_500/, respectively.

Introduction

Lung cancer, a highly invasive and rapidly metastasizing disease, is the most common type of cancer associated with a poor prognosis [1]. According to Bray et al., in 2018 [2], it was the leading cause of cancer deaths worldwide (18.4%), followed by stomach (8.2%), liver (8.2%), and breast (6.6%) cancer . Although a number of new targeted agents and immunotherapies are being developed, early detection and treatment are still the best options for the long-term survival of lung cancer patients [3], [4]. This requires routine monitoring of high-risk subjects using computed tomography (CT) and involves examination of a huge number of CT scans by radiologists [5]. Computer-aided diagnosis (CAD) tools based on machine learning (ML) models are intended to assist the radiologists by marking suspicious features on chest images aiding the human inspection.

The cornerstone of developing and improving accurate and computationally efficient ML models is the availability of high-quality training and testing datasets. The main datasets used in lung cancer research are the combined database of the Lung Image Database Consortium and the Image Database Resource Initiative (LIDC/IRDI) [6], the LUNA16 subset of LIDC/IDRI database [7], the dataset provided by the LUNGx challenge organized by SPIE, the American Association of Physicists in Medicine (AAPM), and the National Cancer Institute (NCI) [8], the Lung Test Images from Motol Environment (Lung TIME) database [9], ANODE09 database [10], the database of Lung CT Imaging Signs (LISS) [11], and the data from National Lung Screening Trial (NLST) of NCI [12].

Currently, most large datasets for lung cancer research are created from images acquired in screening trials and therefore consist of low-dose CT (LDCT) scans. Unsurprisingly, the most notable achievements in performance of ML models are made in this area [13], [14]. However, LDCT has its limitations [15], and for some scenarios, the use of standard-dose CT is preferable. Several studies report that standard-dose CT images provide data for radiomics analysis that can be used for early detection of metastases [16], [17], [18]. Unfortunately, these studies rely on non-public datasets of limited size, which does not allow the fine-tuning of the proposed methods. The insufficient availability of large amounts of accurately annotated training data currently is a bottleneck of this line of research.

There is a variety of software tools developed for medical image annotation [19], [20], [21], [22], [23]. They enable partial or full automation of the labeling process, but the interpretation of radiological data still depends on human intelligence. Crowdsourcing platforms have performed well in cost-effective large-scale image annotation [24]; however, they have limitations as the correct reading of CT scans requires special training and experience [25]. Weak labeling approaches (for example, free-text radiology reports [26], bounding boxes, or outlier correction with the use of a weakly labeled atlas [27]) are proposed to reduce the workload of medical experts.

We propose an open-source tool adapted for collaborative multitenant annotation of CT scan datasets, available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The tool is based on a cluster model for nodule localization. The model’s main features are the tolerance to slight differences in interpretations of individual readers and the ability to describe complex-shaped lesions with low effort. Using the double-reading approach with arbitration for ground truth annotation, we have created CTLungCa-500, a publicly available “proof-of-concept” dataset of thoracic standard-dose CT scans, consisting of 536 cases of patients with a high risk of lung cancer.

Section snippets

Patient data

The Mandatory Health Insurance System of the Russian Federation provides free health services for everyone who resides permanently or temporarily in Russia. Federal Law No. 326-FZ regulates the collection of personal data relating to all diagnoses, outcomes, forms, duration, and scope of medical care. Clinical data are stored in the Unified Medical Information Analysis System (UMIAS), and the corresponding medical images are stored in the Unified Radiology Information System (URIS). For some

Results

The participants providing their data for the dataset were 60% females and 40% males aged 50 to 75 years ( $62.3 \pm 6.2$ and $62.4 \pm 8.7$ for females and males, respectively). For 72 subjects, radiologists did not find any lung abnormalities. The remaining 464 CT scans contained 3151 nodules marked by the readers.

Of these, 1761 (55.8%) nodules were recognized as malignant, and 445 (14.1%) nodules were assigned to the category of benign lesions. There were also 926 (29.4%) abnormalities of non-nodular

Discussion

The size and quality of a training dataset are the key factors for ML model performance in any application, including medical imaging [44]. Unfortunately, there are no standardized rules and guidelines on how to annotate medical image data properly. Almost every available collection of clinical CT scans has its own information organization design, with its advantages and limitations. Creating new datasets is a time-consuming and challenging task that requires human experts to provide a ground

Conclusion

We present a new simplified cluster model for nodule localization, which minimizes the inaccuracy of tumor shape approximation while utilizing the efficient maximum transverse diameter approach. The model is best suited for collaborative and crowdsourced projects using the double reading approach for CT scan interpretations. It automatically groups the marks made by different readers for the same nodule, even if there is some disagreement on the shape and size of the lesion and location of its

Availability of data and materials

The FAnTom software is available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The dataset supporting the conclusions of this article is available at https://mosmed.ai/en/datasets/ct_lungcancer_500/.

CRediT authorship contribution statement

S.P. Morozov: Project administration. V.A. Gombolevskiy: Conceptualization, Methodology, Methodology, Formal analysis, Writing - review & editing. A.B. Elizarov: Software. M.A. Gusev: Software. V.P. Novik: Software. S.B. Prokudaylo: Software. A.S. Bardin: Data curation. E.V. Popov: Data curation. N.V. Ledikhova: Conceptualization, Methodology. V.Y. Chernina: Investigation. I.A. Blokhin: Investigation. A.E. Nikolaev: Investigation. R.V. Reshetnikov: Methodology, Formal analysis, Writing - review

Declaration of Competing Interest

Authors declare that they have no conflict of interest.

Acknowledgements

The authors thank Olga Korchazhkina, Senior Research Fellow of Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, for valuable discussions of the project. The authors are also grateful to Ekaterina Korepina and Nikolay Pavlov for their active and valuable participation in creating the web interface. The authors acknowledge Marina Vlasova for proof-reading the manuscript. Ethical approval no. 2 (1-II-2020) was granted by the regional ethics board,

References (48)

H. Lemjabbar-Alaoui et al.
Lung cancer: biology and treatment options
Biochim. Biophys. Acta
(2015)
A.A.A. Setio et al.
Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge
Med. Image Anal.
(2017)
P. Huang et al.
Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method
Lancet Digit. Health
(2019)
J.R. Ferreira Junior et al.
Radiomics-based features for pattern recognition of lung cancer histopathology and metastases
Comput. Methods Prog. Biomed.
(2018)
P.A. Yushkevich et al.
User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability
Neuroimage
(2006)
A. Fedorov et al.
3D slicer as an image computing platform for the quantitative imaging network
Magn. Reson. Imaging
(2012)
A.D. Ciello et al.
Missed lung cancer: when, where, and why?
Diagn. Interv. Radiol.
(2017)
M.D. Kohli et al.
Medical image data and datasets in the era of machine learning-whitepaper from the 2016 C-MIMI meeting dataset session
J. Digit. Imaging
(2017)
J. Wallner et al.
Computed tomography data collection of the complete human mandible and valid clinical ground truth models
Sci. Data
(2019)
C. Fitzmaurice et al.
The global burden of cancer 2013
JAMA Oncol.
(2015)

F. Bray et al.

Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

CA Cancer J. Clin.

(2018)

R. Sangha et al.

Adjuvant therapy in non-small cell lung cancer: current and future directions

Oncologist

(2010)

D. Aberle et al.

Reduced lung-cancer mortality with low-dose computed tomographic screening

N. Engl. J. Med.

(2011)

S.G. Armato et al.

The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans

Med. Phys.

(2011)

S.G. Armato et al.

LUNGx challenge for computerized lung nodule classification

J. Med. Imaging (Bellingham)

(2016)

M. Dolejsi, J. Kybic, M. Polovincak, et al., The lung time: annotated lung nodule dataset and nodule detection...

B.V. Ginneken et al.

Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study

Med. Image Anal.

(2010)

G. Han et al.

The LISS a public database of common imaging signs of lung diseases for computer-aided detection and diagnosis research and medical education

IEEE Trans. Biomed. Eng.

(2014)

National Cancer Institute. National Lung Screening Trial, 2018, (https://www.cancer.gov/types/lung/research/nlst),...

D. Ardila et al.

End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

Nat. Med.

(2019)

J.R. Jett

Limitations of screening for lung cancer with low-dose spiral computed tomography

Clin. Cancer Res.

(2005)

Y. Qi et al.

Radiomics analysis of lung CT image for the early detection of metastases in patients with breast cancer: preliminary findings from a retrospective cohort study

Eur. Radiol.

(2020)

T. Sun et al.

Computer-aided diagnosis for early-stage lung cancer based on longitudinal and balanced data

PLoS One

(2013)

C.T. Rueden et al.

Imagej2: imagej for the next generation of scientific image data

BMC Bioinform.

(2017)

Cited by (18)

Interpretable vertebral fracture quantification via anchor-free landmarks localization
2023, Medical Image Analysis
Vertebral body compression fractures are early signs of osteoporosis. Though these fractures are visible on Computed Tomography (CT) images, they are frequently missed by radiologists in clinical settings. Prior research on automatic methods of vertebral fracture classification proves its reliable quality; however, existing methods provide hard-to-interpret outputs and sometimes fail to process cases with severe abnormalities such as highly pathological vertebrae or scoliosis. We propose a new two-step algorithm to localize the vertebral column in 3D CT images and then detect individual vertebrae and quantify fractures in 2D simultaneously. We train neural networks for both steps using a simple 6-keypoints based annotation scheme, which corresponds precisely to the current clinical recommendation. Our algorithm has no exclusion criteria, processes 3D CT in 2 seconds on a single GPU, and provides an interpretable and verifiable output. The method approaches expert-level performance and demonstrates state-of-the-art results in vertebrae 3D localization (the average error is $1 mm$ ), vertebrae 2D detection (precision and recall are 0.99), and fracture identification (ROC AUC at the patient level is up to 0.96). Our anchor-free vertebra detection network shows excellent generalizability on a new domain by achieving ROC AUC 0.95, sensitivity 0.85, specificity 0.9 on a challenging VerSe dataset with many unseen vertebra types.
Performance of [<sup>18</sup>F]FDG PET/CT versus FAPI PET/CT for lung cancer assessment: a systematic review and meta-analysis
2024, European Radiology
Establishment and validation of an AI-aid method in the diagnosis of myocardial perfusion imaging
2023, BMC Medical Imaging
Limitations of Out-of-Distribution Detection in 3D Medical Image Segmentation
2023, Journal of Imaging
Redesigning Out-of-Distribution Detection on 3D Medical Images
2023, arXiv
An advanced perceptual U-Net segmentation based DDO-SDN classification system for lung pulmonary cancer detection U-Net segmentation based DDO-SDN classification system
2023, IMAGING

View all citing articles on Scopus

View full text

A simplified cluster model and a tool adapted for collaborative labeling of lung cancer CT scans

Highlights

Abstract

Introduction

Section snippets

Patient data

Results

Discussion

Conclusion

Availability of data and materials

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

Biochim. Biophys. Acta

Med. Image Anal.

Lancet Digit. Health

Comput. Methods Prog. Biomed.

Neuroimage

Magn. Reson. Imaging

Diagn. Interv. Radiol.

J. Digit. Imaging

Sci. Data

The global burden of cancer 2013

JAMA Oncol.

Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

CA Cancer J. Clin.

Adjuvant therapy in non-small cell lung cancer: current and future directions

Oncologist

Reduced lung-cancer mortality with low-dose computed tomographic screening

N. Engl. J. Med.

The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans

Med. Phys.

LUNGx challenge for computerized lung nodule classification

J. Med. Imaging (Bellingham)

Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study

Med. Image Anal.

The LISS a public database of common imaging signs of lung diseases for computer-aided detection and diagnosis research and medical education

IEEE Trans. Biomed. Eng.

End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

Nat. Med.

Limitations of screening for lung cancer with low-dose spiral computed tomography

Clin. Cancer Res.

Radiomics analysis of lung CT image for the early detection of metastases in patients with breast cancer: preliminary findings from a retrospective cohort study

Eur. Radiol.

Computer-aided diagnosis for early-stage lung cancer based on longitudinal and balanced data

PLoS One

Imagej2: imagej for the next generation of scientific image data

BMC Bioinform.