1 Introduction

Unlike other types of tumor, osteosarcoma has a high degree of heterogeneity, as illustrated in Fig. 1 which makes it difficult in some cases to reach a common diagnostic among pathologists [4, 13]. Therefore, automating the analysis of different types of tumor can help to avoid observer bias, reduce diagnosis time, and explore various options for treatment.

Majority of tumor studies rely on Haematoxylin and Eosin (H&E) stain stained images [6], that dye the nuclei blue and background tissues pink in a histology slide. Currently, pathologists must manually evaluate these slides under a microscope to evaluate the extent of tumor and tumor necrosis. A study on renal cell carcinoma [5] found that there was large disagreement between pathologists, on same data samples.

This manual analysis by Pathologists is a labor-intensive process and subject to observer bias. Hence, it is desirable to develop an automatic approach for histopathological slide classification of Osteosarcoma. The whole slide scanning systems provides the opportunity to automate the analysis process. These systems digitize glass slides with the stained tissue at a high resolution (up to 40x). The digital whole slide images (WSIs) allow image processing and analysis techniques by utilizing the morphological and contextual clues present in the WSI as features for tissue classification [7, 8].

However, there are several roadblocks towards a fully automatic system. The digital image quality is effected by slide preparation and poor staining response which can cause many tissue and cellular regions to be under-represented.

This diverse cellular morphology resulting in variability in same type of cells (Fig. 1a) and similarity in different cellular structures (Fig. 1b) can make classification of tumor slides challenging. Particularly in Osteosarcoma, both the tumor cells and some types of normal cells (precursor cells) are stained the same blue color but the tumor cells are irregular in shape whereas the precursor cells are more round, close and regular (Fig. 1c). Moreover, each tumor type is significantly different from other types, which makes it difficult to apply one method developed for one tumor type to another tumor type. Osteosarcoma is one such tumor that has a high degree of intra-tumor histological variability and thus methods developed for lung or renal tumor types [5, 17] do not work well for it.

Fig. 1.
figure 1

Examples showing the complexity of dataset. (a) Shows intra class variance for Tumor class. (b) Shows inter class similarity between tumor and necrosis classes. (c) Shows the similarity in color of tumor cells and precursor cells. (Color figure online)

In this paper, we propose a convolution neural network (CNN) architecture to classify the H&E stained histopathology slides of Osteosarcoma. The typical CNN architecture for image processing consists of a series of layers of convolution filters, interspersed with pooling layers. The convolution filters are applied to small patches of the input image to detect and extract image features. Our neural network architecture combines features of AlexNet [9] and LeNet [10] to develop a fast and accurate slide classification system. The proposed system do not require nuclei segmentation which can be a difficult task due to the morphological and system limitations mentioned above. The proposed system works with the annotated image label to generate features at class level. As the paper does not aim to calculate the nuclei properties, we can focus on accurate and efficient class label identification.

Our Contribution. Our main contribution is a new, important, practical and efficient application of CNN, which gives promising results in Osteosarcoma image classification. We developed an efficient CNN architecture used to classify the input images into tumor classes through the use of data augmentation techniques that save time and space. We also provide comparative results of our proposed architecture with existing architectures AlexNet and Lenet to show that the proposed architecture performs better in tumor classification.

1.1 Background

Osteosarcoma is a type of bone cancer. The tumor usually arises in the long bones of the extremities in the metaphyses, next to the growth plates. In order to gauge the extent of treatment response and accurately calculate the percentage of tumor necrosis, it is necessary to consider different histological regions such as clusters of nuclei, fibrous tissues, blood cells, calcified bone segments, marrow cells, adipocytes, osteoblasts, osteoclasts, haemorrhagic tumor, cartilage, precursors, growth plates and osteoid (tumor osteoid and reactive osteoid) with and without cellular material. The goal of this paper is to utilize CNN to identify the four regions of interest (Fig. 2), namely, (1) Viable tumor, (2) Coagulative necrosis, (3) fibrosis or osteoid, and (4) Non tumor (Bone, cartilage) These four regions are used to extract information about the three main classes of interest: viable tumor, necrosis (coagulative necrosis, osteiod, and fibrosis), and other tissue (bone, blood vessels, cartilage, etc.).

Fig. 2.
figure 2

The figure shows different regions of interest: viable tumor, coagulative necrosis, osteoid, fibrosis, non-tumor (bone) regions in a slide

1.2 Related Work

Most of the existing work for tumor classification involves thresholding with region growing, k means, otsu, and morphological features like area and shape structures. Arunachalam et al. [2] presented multi-level otsu thresholding followed by shape segmentation to identify viable tumor, necrosis and no-tumor regions in osteosarcoma histology slides. Malon et al. [12] trained a convolution neural network to classify mitotic and non-mitotic cells using morphological features like color, texture, and shape.

In recent years, machine learning approaches like neural networks have been used for image classification and segmentation but majority of the tumor studies focus on identifying a super-set of features, although not all features are relevant. A recent study on non-small cell lung cancer [17] isolated 9000+ features from images, that consisted of parameters extracted from color, texture, object identification, granularity, density etc. Ciresan et al. [3] was the pioneer of utilizing Convolutional Neural Network (CNN) in mitosis counting for primary breast cancer grading. Litjens et al. [11] applied CNN for identifying breast cancer metastases in sentinel lymph nodes and prostate cancer detection.

Many of these methods are focused on nuclei segmentation and not on image classification as tumor or non tumor. Recent studies have proved that deep learning methods are successful in nuclei segmentation and give promising results for image classification. Su et al. [16] used a fast scanning deep convolution neural network for region segmentation and classification in breast cancer and Spanhol et al. [15] developed on existing AlexNet for different segmentation and classification tasks in breast cancer.

In summary, deep learning algorithms have been successfully implemented in the past for tumor detection in breast cancer and prostate cancer but the work mentioned above is focused on nuclei segmentation whereas evaluation on the classification into tumor classes is limited.

In this paper, we propose a deep learning approach capable of assigning tumor classes (viable tumor, necrosis) vs non-tumor directly to input slides in osteosarcoma, a type of cancer with significantly more variability in tumor description. We extend the successful Alexnet proposed by Krizhevsky (see [9]) and LeNet network architectures introduced by LeCun (see [10]) which uses gradient based learning with back propogation algorithm.

2 Our Approach

2.1 Convolutional Neural Network

Convolutional neural networks (CNNs) are powerful tools in deep learning with high success rate in image classification. The typical CNN architecture for image classification consists of a series of convolution filters paired with pooling layers. The convolution filters are applied to small patches of the input image which detect increasingly relevant image features like edges or shapes and texture. The output of the CNN is one or more probabilities or class labels. According to Sirinukunwattana et al. [14] “Mathematically, CNN can be defined as a feed forward artificial neural network C which is composed of L layers \(\left( C_{1},C_{2},...,C_{L}\right) \) which maps an input vector x to an output vector y i.e.

$$\begin{aligned} y = f\left( x;w_{1},w_{2},...,w_{L}\right) = f_{L}\left( ;w_{L}\right) \circ f_{L-1}\left( ;w_{L-1}\right) \circ \dots \circ f_{1}\left( x;w_{1}\right) \end{aligned}$$
(1)

where \(w_{l}\) is the weight and bias vector for the lth layer \(f_{l}\).”

Our approach is conceptually simple. It directly operates on raw RGB data sampled from the source. It is trained to classify patches into three bins: viable tumor, necrosis (coagulative necrosis, osteoid, fibrosis) and non-tumor. Classification in unseen images is done by applying the learned classifier as a sliding window to the data. Because the CNN operates on raw pixel values, no human input is needed beside the initial annotation of slides for training data, a significant advantage over previous attempts [2]. The CNN automatically learns a set of visual features from the training data.

We develop on existing proven networks LeNet and AlexNet because finding a successful network configuration for a given problem can be a difficult challenge given the total number of possible configurations that can be defined. The Lenet architectures [10] have been prototypes for many successful applications in image processing, particularly handwriting recognition and face detection. The data augmentation methods to reduce over-fitting on image data as described by Krizhevsky [9] has been proclaimed for its success rate in various object recognition applications.

2.2 CNN Architecture

CNN Design. Designing the architecture of a neural network is a complex task. We start with a simple 3 layer network [INPUT - CONVOLUTION - MAX POOL - MLP].

Fig. 3.
figure 3

The figure shows the architecture of a convolution neural network for the classification of osteosarcoma. The different layers in the network are 3 Convolution layer (C), 3 Sub-Sampling layer (P), and 2 fully connected multi-level perceptrons (M).

(1) INPUT [128\(\,\times \,\)128\(\,\times \,\)3] will hold the raw pixel values of the image, i.e. an image of width 128, height 128, and with three color channels R,G,B.

(2) CONVOLUTION layer will compute the output of neurons that are connected to local regions in the input image. Each neuron will compute the dot product between their weights and a small region that they are connected to in the input volume. This may result in volume such as [124\(\,\times \,\)124\(\,\times \,\)4] for 4 filters.

(3) MAX POOL layer will down-sample along the spatial dimensions (width, height), resulting in volume [62\(\,\times \,\)62\(\,\times \,\)4].

(4) MLP layer will compute the class scores, resulting in volume of size [1\(\,\times \,\)1\(\,\times \,\)4], where each of the 4 numbers correspond to a class score for the 4 tumor regions. This simple neural network is not able to identify all the features and the output classification accuracy is very low. This leads to the requirement of increasing the number of hidden layers in the network. But inclusion of many hidden layers can increase the training time and memory requirements making the network impractical. Hence a trade-off is needed between efficiency and accuracy. We worked with different number of hidden layers to define the best output in terms of tumor identification and computational resources needed (see Table 1).

Table 1. Comparison of accuracy, and running time for 3 different implementation of neural network with different number of hidden layers

The detailed architecture of the five level CNN for tumor classification is shown in Fig. 3. Our architecture combines the simplicity of Lenet architecture with the data augmentation methods used by AlexNet architecture. The lower 3 layers are comprised of alternating convolution and max-pooling layers. The first convolution layer has filter size 5\(\,\times \,\)5 used to detect low level features like edges which is followed by a max pooling layer of scale 2 to down-sample the data. This data is then sent to second layer of 5\(\,\times \,\)5 filters to detect higher order features like texture and spatial connectivity followed by a max-pooling layer. The last convolution layer uses a filter of size 3\(\,\times \,\)3 and max-pooling size 2 for down- sampling to generate more higher order features. The upper 2 layers are fully-connected multi-level perceptron (MLP) neural network (hidden layer + logistic regression). The second layer of the MLP is the output layer consisting of four neurons (see Table 2). The input to the first MLP layer is the set of all features maps at the layer below and the output is a class probability distribution from the four neurons (\(p_{1}\), \(p_{2}\), \(p_{3}\), \(p_{4}\)) for each image, where \(p_{1}\), \(p_{2}\), \(p_{3}\), \(p_{4}\) are the probability for viable tumor, coagulative necrosis, osteoid or fibrosis and non-tumor, respectively. The sum of the output probabilities from the MLP is 1, ensured by the use of Softmax algorithm as the activation function in the output layer of the MLP. The convolution and max pooling layers are feature extractors and the MLP is the classifier.

Table 2. Architecture of the proposed convolutional neural network for osteosarcoma classification. The network is built of Input (I), Convolution (C), Max-Pooling (P) and fully connected (M) layers

Data Augmentation. The easiest and most common method to reduce over-fitting of data is to artificially augment the dataset using label-preserving transformations. We use two distinct data augmentation techniques both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. This is a significant saving in both space and time, since WSI images are huge in size and disk read/write is a time consuming process. For this purpose, first we arbitrary rotate the training images by (\(0^{\circ }\),\(90^{\circ }\),\(180^{\circ }\),\(270^{\circ }\)) and flip them along the vertical and horizontal axis to ensure that the network does not learn any rotation dependent features. The second technique for data augmentation alters the intensities of the RGB channels in training images [9]. We perform Principal component analysis (PCA) on the set of RGB pixel values throughout the training set and then, for each training image, we add the following quantity to each RGB image pixel (i.e., \(I_{xy} = [I_{xy}^R,I_{xy}^G,I_{xy}^B]^T\) ): \([p_{1},p_{2},p_{3}][\alpha _1 \lambda _1,\alpha _2 \lambda _2,\alpha _3 \lambda _3]^T\), where \(p_{i}\) and \(\lambda _i\) are the i-th eigenvector and eigenvalue of the 3\(\,\times \,\)3 covariance matrix of RGB pixel values, respectively, and \(\alpha _i\) is a random variable drawn from a Gaussian with mean 0 and standard deviation 0.1. Data augmentation helps alleviate over-fitting by considerably increasing the amount of training data, removing rotation dependency and making the training images invariant to changes in the color brightness and intensity through PCA.

Initialization and Training. The network is trained with stochastic gradient descent. We initialized all weights with 0 mean by assigning them small, random and unique numbers from \(10 ^{-2}\) standard deviation Gaussian random numbers, so that each layer calculates unique updates and integrate themselves as different units of the full network.

3 Experimental Setup

3.1 Data

In digital histopathology, the H&E stained microscopic slides are scanned using powerful slidescanner software, such as Aperio, and converted to digital whole slide images (WSIs). Each WSI supports upto 40X magnification, capturing bones, tissues, cellular and sub-cellular structures such as nuclei and cytoplasm. After digitization the digital slides were partitioned into smaller tiles that were evaluated by pathologists to identify patients cases that capture the variability in osteosarcoma. Each case consists of an average of 25 individual svs images representing different sections of the microscopic slide. Three patient cases were identified for training and testing purposes. The dataset used includes three random svs slides from each of the three patient cases. From these 9 svs slides, 81 random tiles of size 1024\(\,\times \,\)1024 that represent different tissue and cellular regions with appearance of both normal and malignant regions were used. For the network to learn the correct representation of tumor, it is important that the training data contain enough information to allow discrimination between the different tissue and cellular structures present in the tiles. As such, the correct resolution used for tile generation was determined through discussions with senior pathologists and was fixed at 20x, which was then used to generate the 81 random tiles.

The pathologists then used an in-house tool that we developed to annotate these 81 tiles as viable tumor, necrosis, non-viable tumor, and non-tumor. As it is difficult to feed 1024\(\,\times \,\)1024 images to the neural network, we extracted small patches from the tiles for training. Patch size was determined through initial trial runs on the network. The 256\(\,\times \,\)256 patches limited the CNN due to memory issues and the 64\(\,\times \,\)64 patch size had very low accuracy. Hence we decided on a 128\(\,\times \,\)128 patch size. This resulted in about 5000 image patches in the dataset. Only 60% patches were used for training, and 20% data was used as validation set, the remaining 20% data was use for test set. Figure 4. Shows some example patches in the training set.

Fig. 4.
figure 4

Example patches of different types of regions found in the dataset.

3.2 Implementation

We used existing open source libraries to implement the neural network architecture. The architecture was developed in JAVA using dl4j (deep learning for java) libraries [1]. The training data was fed to the network in batch sizes of 100 to utilize parallelism and improve the network efficiency.

3.3 Results

Evaluation. The objective of the network was to classify the input images tiles into one of the four regions (viable tumor, coagulative necrosis, osteoid or fibrosis, non-tumor) mentioned before. The output of the neural network is the probability distribution with sum 1. The output class is the class with the highest probability. The regions coagulative necrosis, osteoid and fibrosis fall into class necrosis. The performance of the neural network was monitored by assessing the error rate on the validation set, once the error rate saturated after 10 epochs, training was stopped. The total training time for our implementation of the network was around 7 min.

We evaluate the accuracy of the proposed method quantitatively using accuracy A = (True Positives + True Negatives)/(Total Sample Size), precision P = (True Positives)/(True Positives + False Positives), recall R = (True Positives)/(True Positives + False Negatives), and F1-Score \(F_{1}\) = (2PR)/(P+R). Our implementation gives F1-score of 0.86 and an accuracy of 0.84.

Comparative Results. The output of a neural network is dependent on the architecture of the network. Different architectures, with different depths and/or numbers of units in the hidden layers result in different output. Shallower networks with fewer number of hidden units are more resistant to over-fitting, require less training data, and train faster per example but can result in loss of precision due to lack of higher order features. A deeper network with more hidden units may be able to learn patterns from the training data more precisely but could result in over-fitting of the data and loss of efficiency. In this section we present and compare the qualitative output of three architectures: AlexNet, Lenet, and our proposed architecture. We find that the running time of Lenet is fastest but the accuracy and precision of our proposed architecture is better than both AlexNet and Lenet (see Table 3).

We then proceeded to compare these results with a recent study which used color-based multi-level segmentation [2] and found our results to be comparable in both efficiency and accuracy. Arunachalam et al. [2] used a multi-level otsu threshold and clustering algorithms to segment out viable-tumor, necrosis, and non-viable tumor regions. The accuracy of the method is around 90% which is close to the accuracy of the neural network (see Table 4).

Results Discrepancy. The method proposed by Arunachalam et al. [2] depends on a threshold value which is derived through otsu segmentation, which makes the results biased towards training data. It can be argued that the results are prone to over-fitting and may not generalize well for other datasets whereas the neural network learns the features through the input images and thus can avoid over-fitting, while also becoming better once more data is fed in.

Table 3. Comparison of accuracy, precision, recall, F1-score and running time for 3 different architectures
Table 4. Comparison of accuracy of our method with multi-level otsu thresholding

4 Future Work

The architecture of the CNN proposed in this paper was chosen on the basis of datasets and resources available. Justifying any architecture through theory is an ongoing research and is currently done only through experiments and the output results. A deeper network architecture will allow for more variations in the input but will cost more resources. We can continue to explore different architectures and strategies for the training of a neural network by changing the hyper-parameters or pre-processing the input data like using the LAB color space instead of RGB space or by augmenting the results of initial segmentation (otsu segmentation) in the input data. These strategies may improve the output results.

The next step in the development of a fully automatic classification system is to map the output of the CNN to the whole slide images. This can be done by applying the full convolution neural network to generate color coded likelihood maps for the pathologists. This fully automated system can then be used for clinical diagnosis.

5 Conclusion

In this paper, we proposed a deep learning approach using convolutional neural network for tumor classification in osteosarcoma. The proposed method is efficient and accurate and focuses on class level identification instead of nuclei level. The training and evaluation was done on a dataset manually annotated by senior pathologists. As far as the authors are aware, this is the first paper describing the applicability of convolutional neural networks for diagnostic analysis of osteosarcoma. We have shown that the technique has high potential to improve the diagnostic process and be used as a clinical tool in osteosarcoma analysis.