Introduction

The nuclear protein Ki-67 was first detected in Hodgkin lymphoma cell line and introduced as a proliferative marker1. It was further confirmed that the monoclonal antibody Ki-67 is present during all cell cycle phases except for the G02. However variation in its extent of expression has been reported during different phases, as for the G1 accounting for the lowest3. Knowing that excessive cellular proliferation correlates with progression of malignancy, precise estimation of this protein marker can benefit physicians in identifying high-grade tumors and can also convey prognostic value in approach to tumor management4,5,6. Also tumoral cells may suppress the body’s immune mechanisms, however tumor infiltrating lymphocytes (TILs), introduced as an immune component against tumor progression, have been found beneficial in improving outcome of breast cancer patients7,8,9,10. On the other hand, heterogeneity of breast cancer complicates approach to its treatment.Accurate quantitative determination of markers such as Ki-67 and TILs can simplify this approach to some extent. The established method for Ki-67 detection is Immunohistochemical (IHC) analysis using MIB-1 or SP6 as monoclonal antibodies used in the staining process performed on paraffin embedded tissue11,12. TILs scoring is also evaluated using the same tissue blocks using recommendations by the international TILs working group8. Considering that both markers’ scoring is based on an expert pathologist’s decision, inter-observer result variations are inevitable. To increase the accuracy of estimation, it has been suggested to count all tumor cells from different fields of a breast tissue section. If impossible to do so, at least 500–1000 cells in the representative areas of the whole section are recommended to be counted by the pathologist12. This, in turn, can be very time consuming for large numbers of samples The mentioned limitations prompt the need for an exact continuous calculation of both markers, which could be assessed by means of Artificial intelligence (AI).

AI has significantly improved the speed and precision of clinical diagnosis and is becoming an inseparable part of different aspects of medicine13. Before introducing deep learning, conventional AI algorithms were commonly used; however, designing a generalized and robust method requires field experts to extract handcrafted features. By the advent of deep neural networks, having the facility of automatically learning the best features from the input data, this issue has almost been resolved. In addition, if these algorithms are developed and trained with diverse and adequate data, they are generalizable and robust. Convolutional neural network (CNN) is a class of deep neural networks widely used in different areas such as Robotics, Bioinformatics, Computer Vision, etc. and have shown to be highly efficacious specifically, in image processing14,15,16.

Although Neocognitron introduced by Fukushima et al. in 1979 and later CNNs were introduced by LeCun et al. in 1989 for handwritten digit classification, CNNs failed to be much successful due to computational barriers and lack of sufficient data. In 2012, improvement of CNN’s computational capability was observed in introduction of AlexNet17, which won the ImageNet competition using a CNN. To overcome the shortcomings of manual assessment of Ki-67 and TILs and yet to take advantage of the probable favorable role of both markers in approach to breast cancer, in this experimental study we have designed and suggested the use of AI assisted methods with emphasis on CNN for the more accurate detection of tumoral cells along with Ki-67 and TILs. Different aspects of this study can be summarized into four categories. First, a dataset with detection and classification annotation has been introduced that provides a benchmark for Ki-67 stained cell detection, classification, proliferation index and tumor infilterating lymphocytes (TILs) estimation. Second, we suggest a novel pipeline that can achieve cell detection and classification and further examined the proposed pipeline on our benchmark. Third, we recommend a deep network, named PathoNet, that outperforms the state of the art backends with the proposed pipeline in Ki-67 immunopositive, immunonegative, and lymphocyte detection and classification. Lastly, we introduce a residual inception module that provides higher accuracy without causing vanishing gradient or overfitting issues.

Literature review

Data regarding detection and estimation of Ki-67 by means of deep learning and conventional machine learning algorithms are present in the literature. Some conventional methods have been suggested in this regard; A study on neuroendocrine tumors (NET) presented a framework for Ki-67 assessment of NET samples that can differentiate tumoral from non-tumoral cells (such as lymphocytes) and has furthermore classified immunopositive and immunonegative tumor cells to achieve automatic Ki-67 scoring18. For the tumor biopsies of meningiomas and oligodendrogliomas based on immunohistochemical (IHC) Ki-67 stained images, Swiderska et al., introduced a combination of morphological methods, texture analysis, classification, and thresholding19. Shi et al. carried out a study based on morphological methods to address color distribution inconsistency of different cell types in the IHC Ki-67 staining of nasopharyngeal carcinoma images. They suggested classifying image pixels using local pixel correlations taken from specific color spaces20. Geread et al. proposed a robust unsupervised method for discriminating between brown and blue colors21.

Despite improvements, conventional methods not only lack generalization and accuracy compared to direct interpretation by pathologists, they are also complex to develop because of having handcrafted features. As for Deep Learning methods, different aspects including image classification, cell detection, nuclei detection, and Ki-67 estimation in histopathological images have been studied and reported. Xu et al. suggested using deep learning features in multiple instance learning (MIL) framework for colon cancer classification22. Weidi et al. proposed a cell spatial density map using CNNs to overcome interference of cell clumping or overlapping while performing task of automated cell counting and detection23.

On the other hand, Cohen et al. suggested redundant counting instead of proposing a density map. Moreover, they introduced a network derived from inception networks called Count-ception for cell counting24. Spanhol et al. compared conventional methods with deep features in breast cancer evaluation25. In another study on breast cancer Ki-67 scoring by Saha et al., decision layers were used in detecting hotspots using gamma mixture model assisted deep learning26. Zhang et al. used CNN to classify images as benign or malignant and a single shot multibox detector as an object detector to assess Ki-67 proliferation score in breast biopsies27. Sornapudi et al. extracted localized features by taking advantage of superpixels generated using clustering algorithms and thereafter applied a CNN for nuclei detection on extracted features28. Due to the restriction of manually labeled Ki-67 datasets, Jiang et al. proposed a new model consisting of residual modules and Squeeze-and-Excitation(SE) block named small SE-ResNet, which has fewer parameters in order to prevent the model from over-fitting. Similar classification accuracy was reported for SE-ResNet compared to the ResNet in classifying samples into benign and malignant29. Liu et al. addressed cell counting problem as a regression problem by producing cell density map in a preprocessing step and further utilized a stacked deep CNN model for counting30.

Available datasets in form of publicly presented benchmarks can be divided into the benign-malignant classification and cell counting categories. For benign-malignant image classification, Spanhol et al. introduced BreakHis, which consists of breast cancer histopathological images obtained from partial mastectomy specimens from 82 patients with four different magnifications31. Diverse cell counting and nuclei detection datasets such as synthetically generated VGG-CellS32, real samples of human bone marrow by Kainz et al.33, Modified Bone Marrow (MBM) and human subcutaneous adipose tissue (ADI) datasets by Cohen et al.24, and Dublin Cell Counting (DCC) proposed by Marsden et al.34 are all of the many examples of datasets presented. However, none of the mentioned benchmarks provide facilities for both cell detection and classification. To the best of our knowledge, SHIDC-B-Ki-67 is the first benchmark introducing IHC marked breast cancer specimens that has cell annotations in three different classes of immunopositive, immuno negative, and tumor infiltrating lymphocytes.

Dataset

The critical role of providing the accurate data for developing deep learning models is evident to experts in this field. In this study the unavailability of a comprehensive Ki-67 marked dataset, lead us to gathering SHIDC-B-Ki-67 by using numerous and various data labeled by expert pathologists. This dataset contains microscopic tru-cut biopsy images of malignant breast tumors exclusively of the invasive ductal carcinoma type (Table 1). Images were taken from biopsy specimens gathered during a clinical study from 2017 to 2020. SHIDC-B-Ki-67 contains 1656 training and 701 test data. Detailed statistics of the annotated cells have been elaborated in Table 2. All patients who participated in this study were patients with a pathologically confirmed diagnosis of breast cancer whose breast tru-cut biopsies were taken at Shiraz University of Medical Sciences’ affiliated hospitals’ pathology Laboratories in Shiraz, Iran. Shiraz University of Medical Sciences institutional review and ethical board committee approved the study (ethics approval ID: IR.SUMS.REC.1399.756) and written informed consent was gathered from all patients willing to take part in the study. All procedures including slide preparation, staining and image acquirement’s were performed according to institutional policies and regulations. Moreover, all the data were anonymized.

Table 1 Tumor characteristics of breast cancer patients enrolled for sample collection.
Table 2 Statistics of the annotated cells.

Images were taken from slides prepared from breast mass tru-cut biopsies, which were further stained for Ki-67 by IHC method. Specific monoclonal antibodies (clone SP6) were obtained from Biocare Medical, Ca, USA. The adjuvant detection kit, named Master polymer plus detection system (peroxidase), was obtained from Master Diagnostica, Granada. Dimethylbenzene (Xylene 99.5%) obtained from Samchun Chemical Co., Ltd, South Korea, Ethanol 100% from JATA Co., Iran, and 96% from Kimia Alcohol Zanjan Co., Iran, EDTA and Tris (molecular biology grade) obtained from Pars Tous Biotechnology, Iran. Phosphate buffer saline (PBS \(1{\times}\)) 0.01M, pH 7.4 was prepared. At first, paraffin-embedded blocks of breast tissue were sectioned (4–5 microns) and fixed on glass slides. Prepared slides were further immunostained. Hematoxylin was used for counterstaining to perform nuclear staining and semi-quantify the extent of immunostaining that would further be evaluated. Accordingly, the expert pathologists identified the tumoral areas in each slide, by visual analysis of tissue sections under a light microscope. The final diagnosis of each case was also approved by two experienced pathologists and confirmed by some ancillary tests such as immune staining for more markers. An Olympus BX-51 system microscope with a relay lens with a magnification of \(10{\times}\) coupled to OMAX microscope digital color camera A35180U3 were used to get digital images of the tumoral tissue slides. Complementary details of the camera and setup are provided in the supplementary information document. Images acquired in RGB (24-bit color depth, 8 bits per color channel) color space using \(400\times\) magnification, corresponding to objective lens \(40\times\). The stepwise acquisition of images is as follows: first, the pathologist identifies the tumor and defines a region of interest (ROI). In order to cover the whole ROI, several images are captured that may be overlapping. The pathologist preferentially selects images of the tumoral area, but some of the images also include transitional parts, e.g., tumoral/non-tumoral areas. A final visual (i.e., manual) inspection discards out-of-focus images. Then, stained images were labeled by expert pathologists as Ki-67 positive tumor cells, Ki-67 negative tumor cells, and tumor cells with positive infiltrating lymphocytes. Figure 1 depicts some samples of SHIDC-B-Ki-67 dataset.

Figure 1
figure 1

SHIDC-B-Ki-67 dataset samples.

Labeling Precision and quality of expert labels play a crucial role in the correct learning process and methods’ accuracy. However, labeling real world data is a challenging and labor-intensive task. In SHIDC-B-Ki-67, each image contains 69 cells on average and a total of 162,998 cells. Manually labeling all cells, requires the time, effort and precision of experts that may be tiresome and error-prone for large numbers of samples. Another big challenge in the process is choosing a label type. In the segmentation task, labeling requires determining a class for each pixel; however, due to overlapping pixels in many cells and the infeasibility of annotating each pixel on histopathological images scale, this approach is not applicable in our case. In addition, annotations of detection tasks are usually a bounding box around the object of interest. Utilizing this type of annotation in our case where cells are small and abundant with different sizes, makes the network design procedure more complicated. To overcome this issue, cell center plus cell type are selected as the annotation. Figure 2 demonstrates labels in this study.

Figure 2
figure 2

SHIDC-B-Ki-67 labeling process. (a) Capturing and cropping raw images, (b) specifying cell centers along with cell types by experts, (c) generating density maps from cell centers.

Although this type of annotations hastens labeling procedure, it is not without limitations. Since just one pixel is picked as the center of each cell, many pixels exist without a label. This makes the data unbalanced and, in turn, a more laborious learning process is needed. Furthermore, this labeling approach is not appropriate for most of the ordinary neural network loss functions because they cannot be a suitable representative of loss in the task. To clarify this issue, we bring an example in which a network predicts the center of a cell with either 2-pixel or 200-pixel drift compared to the center pixel picked by experts. Normal loss functions cannot discriminate between these two pixels and scores them equally. Also, experts’ annotations are error-prone that may cause the same problem. This issue can be addressed by considering an uncertainty for center pixels annotated by the experts. The uncertainty is modeled as a Gaussian distribution with the labeled pixel as the center and n-pixel variance for each cell. As a result, instead of having a 3-channel pixel as the label, a density map for each class is used. Consequently, the nature of the problem is converted into a density map estimation problem.

Methodology

In the following sections, we explain our suggested pipeline for cell classification and detection of Ki-67 and TILs. The pipeline takes advantage of using CNN to extract features and estimate density maps from an input RGB image.

U-Net is one of the most commonly used architectures in biomedical image segmentation35 that consists of symmetric U-shaped architecture with two paths named decoder and encoder. In each layer of the decoder, an up-sampling layer increases the feature map dimension until it reaches to the input image size. U-Net is a fully convolutional model made from 19 layers. The novelty of this method is in using skip connections between corresponding encoder and decoder layers, transmitting high detail features from encoder layers to the same size layers of the decoder. This approach leads to achieving accurate location results. Similar to many studies motivated and designed based on U-Net36,37, we suggest PathoNet as a backend for the proposed pipeline based on U-Net architecture.

Residual dilated inception module Cell detection, classification, and counting in histopathological images is a specialized and error-prone task and results may face inter-individual differences due to the nature of tissues with a variety of cell types and the high possibility of overlapping cells unless pathologists are very well experienced in the field. Nevertheless, detection of the aforementioned tumor features are crucial for the accurate diagnosis of disease and a physicians’ approach to its management. This explains the need for accuracy in developing such networks, which can mostly be done by designing a deeply structured network involving plenty of parameters, yet this mostly causes vanishing gradient issues.

On the other hand, cell size may vary from image to image, and since the same cell types usually sit together, picking a suitable kernel size is crucial. Szegedy et al.38 proposed an inception module to provide a wider field of view without having exponential parameter growth. In the inception model, instead of building a deeper network by stacking convolutional layers, parallel convolutional layers were added to make the network wider. Therefore, by increasing the number of kernels in a layer, higher accuracy can be achieved without facing the vanishing gradient problem. Also, by utilizing different kernel sizes in one module, the problem of choosing a fixed kernel size was solved, and as a result field of view increased. In the inception module proposed by Szegedy et al., three parallel kernels with 5 × 5, 3 × 3, and 1 × 1 sizes were used before using a max-pooling layer. Also, to prevent immense computation growth, input tensor channels were decreased using 1 × 1 kernels before 5 × 5 and 3 × 3 kernels. Still, their method increases network parameters, which means the network needs more data to train and is more likely to be over-fitted. To overcome this issue, Yang et al.39 employed dilated convolutions instead of regular convolutional layers inside the inception module. D-Dilated convolution has a D distance between each kernel element that can cover a wider region. Figure 3 shows this operator with different dilation rates. Equation (1) shows a K × K dilated convolution with step size D.

$$\begin{aligned} output[i] = \sum _{n=1}^K Input[i+d\cdot n]\cdot Kernel[n] \end{aligned}$$
(1)

In other words, a convolution operation is a dilated convolution with a dilation rate of one. By using dilated convolution in the inception module, Yang et al.39 maintained the accuracy and reduced the number of model parameters. Also, instead of using multiple 1 × 1 kernels before other kernels, a 1 × 1 kernel that shares output with the next kernels was used. In the parallel convolution, outputs add together while in the inception module proposed by Szegedy et al., these outputs concat together, which consumes more memory. In a study by He et al.40, the authors used ResNet blocks to design deeper architectures without facing overfitting or vanishing gradient effect. In the ResNet blocks, the activation function output of layers sums up with the previous layers’ output. Motivated by the mentioned studies, in this article, we used a new inception module called residual dilated inception module (RDIM). The RDIM consists of two parallel paths where the first path has two convolution layers with kernel size 3 × 3, and the second one is built by stacking two 3 × 3 dilated convolution layers with dilation rates equal to 4. In the end, the two paths’ output sum up with the module input. However, since the number of the input channels must be the same with the output to be able to perform summation, two inception modules are used in the encoder and the decoder. In the encoder part, where the input channels are half of the output, duplicated input sums with the result of the two paths. In contrast, in the decoder, a 1 × 1 convolution with the same kernel size as the two paths applies to the input then sums with the results of parallel routes. Figure 4 shows inception and residual dilated inception module structures.

Figure 3
figure 3

Dilated convolution visualization. Dilated convolution kernels presented with dilation rates (a) 1, (b) 2, and (c) 4. Increasing the dilation rate increases the kernel’s field of view.

Figure 4
figure 4

Inception modules. Comparison of conventional and proposed inception module (a) conventional inception module, (b) residual dilated inception module (encoder path), (c) residual dilated inception module (decoder path).

PathoNet PathoNet first extracts features from input images then predicts candidate pixels for Ki-67 immunopositive and immunonegative cells, and also lymphocytes with their corresponding density values. The proposed backend utilizes the U-Net-like backbone, where except for the first layer, convolutional layers are replaced by RDIM. In PathoNet, first, input passes through two convolutional layers. Then in the encoder, three and the decoder four RDIMs are used. In the end, a layer consists of three 1 × 1 convolution layers, and a linear activation function produces a three-channel output of the model. Figure 5 demonstrates PathoNet architecture. This network results in a three-channel, 256 by 256 matrix that each channel corresponds to the density map of Ki-67 immunopositive, immunonegative, or lymphocyte class.

Figure 5
figure 5

PathoNet architecture.

Watershed Watershed algorithm is a conventional method that is useful in medical and material science image segmentation41. Watershed was first introduced in 197842, but over the past decades, different other versions of the method have been proposed43. This algorithm maps grayscale images to a topographic relief space. In a 3-dimensional relief space, each point corresponds to a pixel in the input image with height value equal to the pixel intensity. This relief space consists of different regions, namely low-lying valleys (minimums), high-altitude ridges (watershed lines), and slopes (catchment basins). These regions are demonstrated in Fig. 6. Watershed algorithm’s objective is to find catchment basins or watershed lines. Watershed is based on a simple yet useful theory. Lines connecting these points are called watershed lines that clarify the segment borders, and the holes are catchment basins or image segments. Algorithm 1 describes a simple Watershed algorithm.

Figure 6
figure 6

Watershed Algorithm maps grayscale images to a topographic relief space.

figure a

Proposed pipeline The proposed pipeline consists of three components: (1) PathoNet network, (2) post-processing, and (3) Watershed algorithm. The Watershed algorithm and post-processing components do not contain trainable elements, therefore in the training phase, we train the PathoNet component. During the test phase, PathoNet generates a 3-channel density map from an input image where each channel corresponds to the density map of a class. Since there can be multiple pixels in a small region of the map with close or equal densities, choosing a single pixel as the center is rather ambiguous. Besides, due to the presence of noise and low-density points, the amount of false-positive predictions increases. Hence, a post-processing stage is added to the pipeline. Within the post-processing stage, at first, points with less than the specified threshold are removed and points more than the threshold maps to 255. Second, distance transformation is applied on each channel, producing a grayscale image in which the value of points in continuous regions shows the distance from region borders. After applying distance transformation, areas with only one maximum point are produced. For non-circular regions or where we have overlapping cells, there might be multiple maximum points. Finally, we apply the Watershed algorithm that produces cell center coordinates to segment these regions and the overlapping cells. It is notable that because Watershed finds minimum values, an inverse operation is applied before performing the Watershed method in the third step of the post-processing phase. Figure 7 presents the proposed pipeline.

Figure 7
figure 7

The proposed pipeline. First, a backend, which is a density map estimator based on CNNs, predicts a density map for each class. Then thresholding applies to density maps and produces binary images. Further, inversed distance transformation scores region centers with low and borders with high values. Finally, the watershed algorithm predicts cell centers, and the pipeline outputs the cell coordinates.

Experimental setup In order to obtain a balanced train and test set, 70% of each patient’s images were randomly selected for the training and 30% for the test set. The resulted splits are included in the dataset files. Then all the methods are trained on the training set, and results are reported on the test set. MSE loss function by means of the ADAM optimizer is used for training. Learning rate was empirically set to 0.0001 with a 0.1 decrease rate every ten epochs. Keras framework44 was also used to train the networks using two NVIDIA Geforce GTX 1060 and an Intel Core-i5 6400 processor.

Results and discussion

To the best of our knowledge, most of the methods introduced for automated detection of Ki-67 have reported their results on datasets that are not publicly available, Therefore in this study, the authors not only presented a publicly available dataset, but also introduced a backend for a more precise estimation Ki-67 index. Our presented method overrules others also in generating pixel coordinates in addition to cell class types, as for other studies, only prediction of cell nuclei or Ki-67 score is reported27,29. Since no previously reported method exists that can be applied directly on our classification and detection benchmark, comparison of the currently presented method was made with state of the art methods as the backend in our pipeline and results were reported accordingly. DeepLabV345 that has the best results on the VOC-PASCAL 201246 was used. The last layer of DeepLabV3 was replaced with a 3-channel convolution layer that has a linear activation function and can have outputs similar to PathoNet. The DeepLab method has Mobilenet and Xeption implementations, which we have provided results for both. Similar to PathoNet, FCRN-A, and FCRN-B47 were designed for density estimation but at a single class. For evaluating FCRN-A and FCRN-B methods as our pipeline backend, their last layer was changed from one kernel to three kernels. Since PathoNet backbone is similar to U-Net, we trained it in the same setting as the other methods did, yet, U-Net proved to be underfitting and, therefore, could not be evaluated. Though, by following the methods presented the by Zhou et al.48, a modified version of U-Net with batch normalization49 layers of this study were evaluated. Evaluation results of the proposed pipeline with different backends is provided in Table 3.

Table 3 Cell detection and classification results.

Measurements To evaluate our pipeline, first, we need to define true and false predictions. We count an estimation as True positive (TP) when the predicted center has a less than R pixel distance with the corresponding ground truth; otherwise, it is marked as a False positive(FP). If more than one detected center is within an R-pixel distance with the same cell type in the ground truth, estimation with lower distance counts as TP and otherwise as FP. Finally, cells are defined in the ground truth, but without any prediction for False Negatives (FN). With the given definitions, the precision and recall formulas are shown in Eq. (2).

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \quad Recall = \frac{TP}{TP+FN} \end{aligned}$$
(2)

A model with high recall and low precision rates detects most of the pixels as cell centers, while most of them are FP. On the other hand, low recall and high precision rates in a model bring about the detection of few cells, while most detected cells are TP. None of the mentioned cases is useful; therefore, our goal is to develop a model that holds a trade-off between precision and recall. F1 score or harmonic mean is an appropriate measure that can be used for this evaluation that can be calculated using Eq. (3).

$$\begin{aligned} F1-score = 2 \cdot \frac{Precision \cdot Recall}{ Precision + Recall} \end{aligned}$$
(3)

Considering the importance of precise estimation of both markers, we also evaluated the introduced pipeline with using different backends and the performance of the proposed model in terms of TILs and Ki-67 index calculation was evaluated using RMSE.

$$\begin{aligned}&Ki-67-score=\frac{Immunopositive}{Immunopositive+Immunonegative} \end{aligned}$$
(4)
$$\begin{aligned}&TIL-score = \frac{Lymphocyte}{Lymphocyte + Immunopositive+Immunonegative} \end{aligned}$$
(5)

Also, since each raw image is cropped into smaller ones, we grouped all images that belonged to a patient into one group and classified the Ki-67 ane TILs estimation into different cut-off categories that have been presented in previous studies (Fig. 8). This aggregated-image classification results are presented under TILs and Ki-67 cut-off accuracy column of Table 4. As suggested by Saha et al.26, cases with Ki-67 scores below 16 percent are accounted as less proliferative, between 16 and 30 as having average proliferation rate, and higher than 30 as highly proliferative. Also, based on TILs score, the cut-off ranges presented in literature is between 0 and 10, between 11 and 39, and higher than 40%8. Table 4 compares RMSE and accuracy of different backends based on the proposed pipeline for Ki-67 and TILs scoring.

Table 4 RMSE and aggregated cut-off accuracy.
Figure 8
figure 8

From whole mount slide image preparation to Ki-67 and TILs estimation. The process starts by taking pictures (b) from a patient’s tumor section (a) and cropping them into 256 × 256 pixels sub-images. Next, each cropped image (c) is fed to the proposed pipeline (d), resulting in predicted cell centers, corresponding class labels (e). Then, the Ki-67 index and TILs score can easily be calculated. Figure 10 depicts a prediction sample image with \(4912\times 3684\) pixels size.

Quantitative results As shown in Table 3, DeepLabv3-Xeption performed better in Ki-67 immunopositive cell detection. However, our introduced model outperforms the others in the detection of Ki-67 immunonegative cells in terms of precision and harmonic mean (F1 score). The suggested pipeline using FCRN-B has better precision and harmonic mean in the lymphocyte class of cells. In contrast, our model has a better recall rate, meaning that the pipeline using PathoNet could detect more labeled lymphocytes than the other methods. The proposed backend performed better in terms of precision by having the lowest FP and outperforming the others in terms of overall F1. Although DeepLabv3-Xception is very close to ours in the overall F1, as shown in Table 5, it has 12 times more parameters compared to the proposed method. So, not only DeepLabv3-Xception needs more computation resources for training, but also, the proposed method can be processed faster and provides higher FPS while maintaining a better F1 score than DeepLabv3-Xception.

Table 5 Inference speed and model parameters.

Qualitative results We elaborated the qualitative results on the SHIDC-BC-Ki-67 test set in Fig. 9. We have compared our proposed pipeline’s detection and classification results with different backends in addition to the ground truth and the input images that have been presented. The proposed pipeline leads to the detection of highly overlapped cells with different colors, sizes, and lighting conditions as shown in Fig. 9. Compared to the required clinical ROI, \(255 \times 255\) images are small. Considering the quantitative aggregated scores of patients presented in Table 4, by following the process in Fig. 8, qualitative results of aggregated sub-images of a patient is presented in Fig. 10.

Figure 9
figure 9

Qualitative results. Samples from SHIDC-B-Ki-67 test set. Points in red show negative Ki-67 stained tumor cells. Blue points depict positive Ki-67 stained tumor cells and dots with cyan color, show tumor infiltrating lymphocytes. In this image, ground truth column is expert annotation.

Figure 10
figure 10

Aggregated sub-image qualitative results. Input raw image is cropped into smaller sub-images and, after prediction, aggregates to build the input image. This sample is visualized from the test set.

Limitations Variations of tumoral cell size and color from patient to patient complicates labeling and annotation of cells. Also using only Ki-67 marked images for TILs scoring leads to masking of Ki-67 positive TILs. This issue could be resolved by dual IHC staining which was unavailable at our center. However, we believe that the powerful deep model feature extractors presented, have shown promising results in overcoming this issue. We should mention that the recommended staining for evaluation of TILs is Hematoxylin and Eosin (H&E), but since during IHC staining, a counter staining process is performed using Hematoxylin, we expected the lymphocytes’ nuclei be stained sufficiently as has been recommended by guidelines.

Conclusion

In this article we introduced a new benchmark for cell detection, classification , proliferation index estimation, and TILs scoring using histopathological images. We further proposed a new pipeline to be used in cell detection and classification that can utilize different deep models for feature extraction. We evaluated this pipeline on immunonegative, immunopositive, and TILs cell detection on Ki-67 stained images. Also, we suggested a residual inception and designed a new light-weight backbone called PathoNet that achieved state of the art results on the proposed dataset. The suggested pipeline showed high accuracy in Ki-67 and TILs cut-off classification compared with all backends. Finally, we showed that, designing PathoNet using RDIM provides high accuracy while slightly increasing model parameters.