A simplified cluster model and a tool adapted for collaborative labeling of lung cancer CT scans

https://doi.org/10.1016/j.cmpb.2021.106111Get rights and content

Highlights

  • We have developed a simplified cluster model and an open-source tool for lung cancer CT scans annotation.

  • The model accelerates the nodule markup process and enhances its efficiency.

  • Using the model, we have created the publicly available CTLungCa-500 dataset of lung CT images.

  • The dataset contains crowd-sourced markup of healthy and tumorous tissues.

Abstract

Background and objective: Lung cancer is the most common type of cancer with a high mortality rate. Early detection using medical imaging is critically important for the long-term survival of the patients. Computer-aided diagnosis (CAD) tools can potentially reduce the number of incorrect interpretations of medical image data by radiologists. Datasets with adequate sample size, annotation, and truth are the dominant factors in developing and training effective CAD algorithms. The objective of this study was to produce a practical approach and a tool for the creation of medical image datasets.

Methods: The proposed model uses the modified maximum transverse diameter approach to mark a putative lung nodule. The modification involves the possibility to use a set of overlapping spheres of appropriate size to approximate the shape of the nodule. The algorithm embedded in the model also groups the marks made by different readers for the same lesion. We used the data of 536 randomly selected patients of Moscow outpatient clinics to create a dataset of standard-dose chest computed tomography (CT) scans utilizing the double-reading approach with arbitration. Six volunteer radiologists independently produced a report for each scan using the proposed model with the main focus on the detection of lesions with sizes ranging from 3 to 30 mm. After this, an arbitrator reviewed their marks and annotations.

Results: The maximum transverse diameter approach outperformed the alternative methods (3D box, ellipsoid, and complete outline construction) in a study of 10,000 computer-generated tumor models of different shapes in terms of accuracy and speed of nodule shape approximation. The markup and annotation of the CTLungCa-500 dataset revealed 72 studies containing no lung nodules. The remaining 464 CT scans contained 3151 lesions marked by at least one radiologist: 56%, 14%, and 29% of the lesions were malignant, benign, and non-nodular, respectively. 2887 lesions have the target size of 3–30 mm. Only 70 nodules were uniformly identified by all the six readers. An increase in the number of independent readers providing CT scans interpretations led to an accuracy increase associated with a decrease in agreement. The dataset markup process took three working weeks.

Conclusions: The developed cluster model simplifies the collaborative and crowdsourced creation of image repositories and makes it time-efficient. Our proof-of-concept dataset provides a valuable source of annotated medical imaging data for training CAD algorithms aimed at early detection of lung nodules. The tool and the dataset are publicly available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git and https://mosmed.ai/en/datasets/ct_lungcancer_500/, respectively.

Introduction

Lung cancer, a highly invasive and rapidly metastasizing disease, is the most common type of cancer associated with a poor prognosis [1]. According to Bray et al., in 2018 [2], it was the leading cause of cancer deaths worldwide (18.4%), followed by stomach (8.2%), liver (8.2%), and breast (6.6%) cancer . Although a number of new targeted agents and immunotherapies are being developed, early detection and treatment are still the best options for the long-term survival of lung cancer patients [3], [4]. This requires routine monitoring of high-risk subjects using computed tomography (CT) and involves examination of a huge number of CT scans by radiologists [5]. Computer-aided diagnosis (CAD) tools based on machine learning (ML) models are intended to assist the radiologists by marking suspicious features on chest images aiding the human inspection.

The cornerstone of developing and improving accurate and computationally efficient ML models is the availability of high-quality training and testing datasets. The main datasets used in lung cancer research are the combined database of the Lung Image Database Consortium and the Image Database Resource Initiative (LIDC/IRDI) [6], the LUNA16 subset of LIDC/IDRI database [7], the dataset provided by the LUNGx challenge organized by SPIE, the American Association of Physicists in Medicine (AAPM), and the National Cancer Institute (NCI) [8], the Lung Test Images from Motol Environment (Lung TIME) database [9], ANODE09 database [10], the database of Lung CT Imaging Signs (LISS) [11], and the data from National Lung Screening Trial (NLST) of NCI [12].

Currently, most large datasets for lung cancer research are created from images acquired in screening trials and therefore consist of low-dose CT (LDCT) scans. Unsurprisingly, the most notable achievements in performance of ML models are made in this area [13], [14]. However, LDCT has its limitations [15], and for some scenarios, the use of standard-dose CT is preferable. Several studies report that standard-dose CT images provide data for radiomics analysis that can be used for early detection of metastases [16], [17], [18]. Unfortunately, these studies rely on non-public datasets of limited size, which does not allow the fine-tuning of the proposed methods. The insufficient availability of large amounts of accurately annotated training data currently is a bottleneck of this line of research.

There is a variety of software tools developed for medical image annotation [19], [20], [21], [22], [23]. They enable partial or full automation of the labeling process, but the interpretation of radiological data still depends on human intelligence. Crowdsourcing platforms have performed well in cost-effective large-scale image annotation [24]; however, they have limitations as the correct reading of CT scans requires special training and experience [25]. Weak labeling approaches (for example, free-text radiology reports [26], bounding boxes, or outlier correction with the use of a weakly labeled atlas [27]) are proposed to reduce the workload of medical experts.

We propose an open-source tool adapted for collaborative multitenant annotation of CT scan datasets, available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The tool is based on a cluster model for nodule localization. The model’s main features are the tolerance to slight differences in interpretations of individual readers and the ability to describe complex-shaped lesions with low effort. Using the double-reading approach with arbitration for ground truth annotation, we have created CTLungCa-500, a publicly available “proof-of-concept” dataset of thoracic standard-dose CT scans, consisting of 536 cases of patients with a high risk of lung cancer.

Section snippets

Patient data

The Mandatory Health Insurance System of the Russian Federation provides free health services for everyone who resides permanently or temporarily in Russia. Federal Law No. 326-FZ regulates the collection of personal data relating to all diagnoses, outcomes, forms, duration, and scope of medical care. Clinical data are stored in the Unified Medical Information Analysis System (UMIAS), and the corresponding medical images are stored in the Unified Radiology Information System (URIS). For some

Results

The participants providing their data for the dataset were 60% females and 40% males aged 50 to 75 years (62.3±6.2 and 62.4±8.7 for females and males, respectively). For 72 subjects, radiologists did not find any lung abnormalities. The remaining 464 CT scans contained 3151 nodules marked by the readers.

Of these, 1761 (55.8%) nodules were recognized as malignant, and 445 (14.1%) nodules were assigned to the category of benign lesions. There were also 926 (29.4%) abnormalities of non-nodular

Discussion

The size and quality of a training dataset are the key factors for ML model performance in any application, including medical imaging [44]. Unfortunately, there are no standardized rules and guidelines on how to annotate medical image data properly. Almost every available collection of clinical CT scans has its own information organization design, with its advantages and limitations. Creating new datasets is a time-consuming and challenging task that requires human experts to provide a ground

Conclusion

We present a new simplified cluster model for nodule localization, which minimizes the inaccuracy of tumor shape approximation while utilizing the efficient maximum transverse diameter approach. The model is best suited for collaborative and crowdsourced projects using the double reading approach for CT scan interpretations. It automatically groups the marks made by different readers for the same nodule, even if there is some disagreement on the shape and size of the lesion and location of its

Availability of data and materials

The FAnTom software is available at https://github.com/Center-of-Diagnostics-and-Telemedicine/FAnTom.git. The dataset supporting the conclusions of this article is available at https://mosmed.ai/en/datasets/ct_lungcancer_500/.

CRediT authorship contribution statement

S.P. Morozov: Project administration. V.A. Gombolevskiy: Conceptualization, Methodology, Methodology, Formal analysis, Writing - review & editing. A.B. Elizarov: Software. M.A. Gusev: Software. V.P. Novik: Software. S.B. Prokudaylo: Software. A.S. Bardin: Data curation. E.V. Popov: Data curation. N.V. Ledikhova: Conceptualization, Methodology. V.Y. Chernina: Investigation. I.A. Blokhin: Investigation. A.E. Nikolaev: Investigation. R.V. Reshetnikov: Methodology, Formal analysis, Writing - review

Declaration of Competing Interest

Authors declare that they have no conflict of interest.

Acknowledgements

The authors thank Olga Korchazhkina, Senior Research Fellow of Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, for valuable discussions of the project. The authors are also grateful to Ekaterina Korepina and Nikolay Pavlov for their active and valuable participation in creating the web interface. The authors acknowledge Marina Vlasova for proof-reading the manuscript. Ethical approval no. 2 (1-II-2020) was granted by the regional ethics board,

References (48)

  • F. Bray et al.

    Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

    CA Cancer J. Clin.

    (2018)
  • R. Sangha et al.

    Adjuvant therapy in non-small cell lung cancer: current and future directions

    Oncologist

    (2010)
  • D. Aberle et al.

    Reduced lung-cancer mortality with low-dose computed tomographic screening

    N. Engl. J. Med.

    (2011)
  • S.G. Armato et al.

    The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans

    Med. Phys.

    (2011)
  • S.G. Armato et al.

    LUNGx challenge for computerized lung nodule classification

    J. Med. Imaging (Bellingham)

    (2016)
  • M. Dolejsi, J. Kybic, M. Polovincak, et al., The lung time: annotated lung nodule dataset and nodule detection...
  • B.V. Ginneken et al.

    Comparing and combining algorithms for computer-aided detection of pulmonary nodules in computed tomography scans: the ANODE09 study

    Med. Image Anal.

    (2010)
  • G. Han et al.

    The LISS a public database of common imaging signs of lung diseases for computer-aided detection and diagnosis research and medical education

    IEEE Trans. Biomed. Eng.

    (2014)
  • National Cancer Institute. National Lung Screening Trial, 2018, (https://www.cancer.gov/types/lung/research/nlst),...
  • D. Ardila et al.

    End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

    Nat. Med.

    (2019)
  • J.R. Jett

    Limitations of screening for lung cancer with low-dose spiral computed tomography

    Clin. Cancer Res.

    (2005)
  • Y. Qi et al.

    Radiomics analysis of lung CT image for the early detection of metastases in patients with breast cancer: preliminary findings from a retrospective cohort study

    Eur. Radiol.

    (2020)
  • T. Sun et al.

    Computer-aided diagnosis for early-stage lung cancer based on longitudinal and balanced data

    PLoS One

    (2013)
  • C.T. Rueden et al.

    Imagej2: imagej for the next generation of scientific image data

    BMC Bioinform.

    (2017)
  • Cited by (18)

    View all citing articles on Scopus
    View full text