1 Introduction

Deep learning is a process in which machines learn to process data and derive a conclusion using neural networks that are comprised of different levels, arranged according to hierarchy. Deep learning is used in various applications such as speech and image recognition, bio-informatics, military and most importantly medical image analysis. It is capable enough to transform the entire landscape of healthcare. The application of deep learning in healthcare is expected to grow in the time ahead. Deep learning is used alongside medical imaging for health check and monitoring, diagnosis and treatment of diseases, injuries etc. Medical image segmentation is yet another application of deep learning that is used to identify organs or lesions from different modalities of medical images such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), ultrasound etc.

Fig. 1.
figure 1

Evolution of CNN architectures

Initially, edge detection filters and mathematical methods were being used, after which deep learning was brought into use predominantly alongside transfer learning. Later, 2.5 dimensional CNN was introduced and this produced a remarkable balance between the performance and computational costs. After this, 3 dimensional CNN came into use and proved to be superior to 2.5D in terms of performance. Over time, various types of CNN architectures have evolved as shown in Fig. 1.

2 Deep Learning Network Architectures and Related Work

2.1 Basic CNN (1989)

Fig. 2.
figure 2

Basic architecture of CNN [1]

CNN is a well-known class of deep learning networks. It is widely preferred in image segmentation, image classification, object detection etc. CNN is also known as ConvNet and is shift-invariant. They are regularized versions of the Multi-Layer Perceptron (MLP). MLP networks are often fully connected and hence result in overfitting of data. The basic architecture of CNN comprises of 5 layers as shown in Fig. 2

Input Layer:

This layer is made up of artificial neurons that allows the initial data into the network for further processing.

Convolutional Layer:

The convolutional layer is comprised of weights that should be trained as per the application that the network is being used [2]. This layer is also responsible for feature extraction which includes edges, objects, textures and scenes [3].

Pooling Layer:

The feature map dimensions obtained from the convolutional layer are reduced in the pooling layer.

Fully Connected Layer:

In this layer, each input from the previous layer is connected to each activation function in the next layer.

Output Layer:

This layer is responsible for producing the final result of the segmentation, classification or relevant application.

The following authors have implemented various variants of CNN for segmentation of tumors in the brain. R. Thillaikkarasi et al. [4] presented a novel kernel-based CNN combined with a modified Support Vector Machine(SVM) for efficient and automatic segmentation of brain tumors. In this work, spectrum mixing was included along with the kernel to elevate the flexibility during segmentation. Sidra Sajid et al. [5] introduced a patch-based hybrid CNN approach to detect tumors in brain and considers local and contextual information. This method addressed the overfitting problem by making use of the dropout regulariser alongside with batch normalization procedure. This work also provided a solution for data imbalance using a two-phase training procedure and resulted in a DSC (Dice Score Co-efficient) of 86%. Farheen Ramzan et al. [6] proposed a 3D CNN network associated with residual learning and dilated convolutional operations to accurately analyze the end to end mapping from MRI volumes to the brain segments at the voxel level. DSC of 87 ± 3% was achieved for different datasets. Kumar et al. [7] also implemented a 3D CNN network for brain tumor segmentation.3D CNN was preferred over 2D CNN as it suffered a loss of quality in the input image due to compressed 2D image processing. The proposed 3D CNN consisted of five max-pooling layers, two fully connected layers and a soft max layer. Mostefa Ben naceur et al. [8] proposed a CNN network inspired by occipito temporal pathway which includes a special function known as selective attention that operates based on the receptive field sizes to identify the crucial objects in the scene.This method was used for segmentation of brain tumors and yielded a DSC of 90% for tumor, 83% for tumor core and 83% for enhanced tumor. Nai Qin Feng et al. [9] proposed a deep CNN framework in a cascaded structure with CRF (Conditional Random Field) for post processing which efficiently eradicates the contradiction between the accuracy of segmentation, depth of the network and the number of pooling layers in conventional CNN. This method was used to segment tumors in the brain and yielded a DSC of 86%. Zhaohan Xiong et al. [10] proposed a Dual FCN with 16 layers called AtriaNet for segmentation of the left atrium. This method yielded a DSC o 94%. Mamta Mittal et al. [11] proposed a combination of GCNN (Growing CNN) and SVM for segmentation of brain tumors. This GCNN permitted to encode properties of the inputs to improvise the next step and reduce the parameters. This method yielded a PSNR of 96%.W.V.Deng et al. [12] proposed a fusion of a Heterogeneous CNN (HCNN) and a CRF (Conditional Random Fields).The CRF has been developed as a Recurrent Regression NN (RRNN). This method could divide the brain images into several slices and yielded a precision and recall of 96.5% and 97.8% respectively which is by far the highest among the discussed methods.

Segmentation of Breast Tumors

Variants of CNN networks have been used in successful segmentation of breast masses as well. Ademola Enitan Ilesanmi et al. [13] proposed a DL based segmentation technique exclusively for breast tumors using VEU-Net. (Variant Enhanced Block).This method yielded a DSC varying from 74% to 91% for various datasets. Later, Mughad A. Al-Antari et al. [14] proposed a DL based segmentation technique for mammogram using full resolution CNN which yielded a DSC of 92% and accuracy of 92.97%.

Segmentation of Thyroid Nodules

Researchers have paved their way using DL into segmentation of thyroid nodules as welp in the recent days.Jeremy M. Webb et al. [15] implemented a combination of recurrent FCNN and DeepLab V3 for segmenting thyroid nodules from ultrasound images. This method yielded an Intersection over Union (IoU) of 42% for cycts, 53% for nodules and 73% for thyroid nodules. Viksit Kumar et al. [16] proposed a segmentation technique based on prong CNN for segmenting thyroid nodule, gland and cystic components. Prong is nothing but the network shape caused by splitting the architecture for generating multiple outputs. This method yielded a detection rate of 82% and 44% for thyroid nodules and cystic components respectively. Ngoc-Quang N Guyen et al. [17] proposed a Deep CNN network for segmenting boundaries in medical images. This DL network aids in identifying boundaries using multiscale effective decoders. This method yielded an accuracy of 95 ± 3% for segmenting boundaries in different datasets thus, proving to be superior to the other variants.

Segmentation of Parenchyma

Researches have been carried out using CNN variants on segmentation of various other diseases such as the implementation of a 3D patch-based CNN network for parenchymal segmentation from MRI images of the brain by Al-Louzi et al. [18]. This network not only resulted in robust and accurate outcome of brain atrophy and segmentation of lesions in PML but also proved to be valuable clinically and towards including standard forms of quantitative MRI measures in clinical therapies. This variant of the CNN network used was made up of a network architecture consisting of Multiview feature pyramid networks and hierarchical residual blocks consisting of embedded batch normalization and non-linear activation functions.Ying Chen et al. [19] introduced a dense deep CNN that includes popular optimization methods that include dense block, batch normalization and drop-out. This method was implemented to segment lung parenchyma and yielded an accuracy of 95%. J. Ramya et al. [20] introduced a technique for segmentation of optic cup combining DNN and hybrid particle swarm optimization technique which achieved superior performance with a DSC of 98%.

Segmentation of Prostate

The following authors have preferred CNN models for segmentation of prostate carcinoma.Davood Karimi et al. [21] proposed a variant of CNN which involved two strategies to segment prostate. The first strategy is to apply adaptive sampling strategy and the next is to use the disagreement of the CNN ensemble to identify the uncertain segmentations and estimate segmentation uncertainty map. Ke Yan et al. [22] proposed a P-DNN (Propagation DNN) for prostate segmentation. This method incorporates optimal combination of multi-level feature extraction on a single model. This method yielded a DSC of 89.9 ± 2%. Massimo Salvi et al. [23] proposed a CNN with rings (Rapid Identification of Glandular structures) for effective detection and segmentation of Gland segmentation in prostate histopathological images. This method yielded a DSC of 90%.

Segmentation of Cardiac Tissues

CNN networks have proved to be successful in segmentation of cardiac images as well.Huaifei Hu et al. [24] proposed a combination of FCN and 3D ASM (Active Shape Model) to segment right and left ventricles in cardiac MRI. The method yielded a JI of 89%. Hisham Abdeltawab et al. [25] proposed a segmentation technique for the left and right ventricle using dual FCN. FC1 and FC2 were concatenated at the final output. This method received a DSC of 88% to. 96% for different datasets.

Other Related ROI Segmentations

Futhermore, Tang et a. [26] implemented a multi-scale CNN network for Selective Internal Radiation Therapy (SIRT) patients. The trained model was not efficient enough on SIRT data which had low contrast due to reduced dosage as well as lesions having vast difference in density from their surroundings, abnormal liver shape or positioning. Ryu et al. [27] introduced a CNN network made-up of an encoder and inference branches which was combinedly for segmentation as well as classification purposes. This network takes the combination of an input image and its corresponding Euclidean distance maps of the foreground and background as the input data stream. However, several drawbacks were reported as it does not incorporate all kinds of machines available in clinical trials and hence results cannot be obtained for the left out machines. This resulted in a low Jaccard score index (JSI) of 68% and did not prove to be very efficient on heptic lesions. Terapap Apiparakoon et al. [28] proposed a modified CNN model with FPN for bone lesion segmentation.A ladder FPN was introduced to the top-down pathway to semi-supervise the network training and an additional layer was included to extract global features. This method yielded an F1 score of 84% and a precision of 85%. Kurnianingsih et al. [29] proposed a Mask R-CNN based technique for segmentation of cervical cells. This method used Resnet-10 as a backbone and yielded a precision of 92% and recall of 91%. Lee et al. [30] also used the 3D CNN network for the detection of plaque in major calcifications and obtained a decent F1 score of 92%. Nudrat Nida et al. [31] proposed a Region Based CNN (RCNN) in combination with fuzzy C-means clustering (FCM) technique for detecting and segmenting lesions in melanoma. In this method, CNN resolved the insufficient sample problem and FCM extracts affected patches with variable boundaries that aid in disease recognition. This method achieved a F1 score of 95% and accuracy of 94%.

Tariq Mahmood Khan et al. [32] proposed a CNN based network for segmentation of retinal vessel segmentation. This network was a Residual Connection based Encoder Decoder network. This architecture has is capable of retaining and exploiting low-level semantic edge information for robust vessel segmentation. The method yielded an accuracy on 96 ± 1% for different datasets. Veena et al. [33] introduced an optic disc and cup segmentation technique for the diagnosis of glaucoma which yielded an accuracy of 97% using a modified CNN network consisting of 39 layers including nineteen convolutional layers, four max-pooling layers, eleven drop out layers and a single merger layer.

Each of the above discussed CNN variants have their respective advantages and disadvantages. The most common advantage of CNN is that it does not require intense human supervision for feature detection unlike its predecessors but it also suffered certain drawbacks such as its requirement of a large training dataset and its inability to encode the position and orientation of the object.

2.2 Alexnet (2012)

Fig. 3.
figure 3

Architecture of Alexnet [34]

The architecture of alexnet consists of eight layers including five convolutional layers and three fully connected layers as displayed in Fig. 3. But it isn’t the layers that make the alexnet special. Instead, alexnet holds a series of features that ended up acting as new approaches to CNN frameworks which made alexnet stand out from the rest. AlexNet replaced the tanh function which was standard at the time with Rectified Linear Units (ReLU). Alexnet was designed to allow multi-GPUs which in turn enabled the training of bigger models at a reduced training time. Also, data augmentation and dropout techniques were deployed in alexnet as a means to the overfitting issue.

Lu et al. [35] proposed an improved alexnet network for detection and segmentation of abnormality in the brain from magnetic resonance images. The last few layers in this improved AlexNet were replaced with an exceptional learning machine which in turn was enhanced by a modified chaotic bat algorithm to attain improved generalization. Chen et al. [36] implemented a 3D framework of alexnet based on classic AlexNet to segment and reconstruct of prostate tumor medical images with adaptive improvement. This method yielded an accuracy of 92%. Also, in comparison with the conventional segmentation as well as depth segmentation methods, the efficacy of the proposed network was exemplary with respect to the time consumption during training, the amount of parameters considered, or evaluation of network performance. Alexnet was most preferred for feature extraction purposes rather than segmentation applications which yielded significant feature extraction results [37]. The disadvantage of alexnet is that because the model isn’t particularly deep, it faces difficulty scannning for all attributes, resulting in models that aren’t very good.

2.3 Resnet (2015)

Fig. 4.
figure 4

Residual block [38]

Resnets are made up of residual blocks as shown in Fig. 4. The resnet was originally introduced to rectify the vanishing gradient issue faced by the previous neural networks. The ResNet uses a 34-layered basic network architecture which was influenced by VGG-19 in which a time saving alternate connection is introduced as in Fig. 5. These alternate connections thereby form a residual network. The alternate shortcut connections in resnet are called skip connections and these are the core of the residual. Also, these skip connections are padded with an extra zero in order to increase their dimensions. The skip connections in ResNet help to rectify the setback of vanishing gradients by permitting this new path through which the gradient is allowed to flow. These residual deep learning networks are widely preferred in classification applications but have been used in certain segmentation applications as well.

A resnet50 based mask r-cnn was implemented by Jeevakala et al. [39] as well to segment internal auditory canals and their nerves. The localization results yielded an accuracy of 79%. Song Guo et al. [40] proposed a combination of Resnet-101 and VGG-16 for segmentation of retinal vessels. This network was a multi-scale network and yielded a F1 score of 82 ± 2%. Similarly, Resnet-101 was selected to be the foundation of Mask R-CNN proposed by Zhao et al. [41] where identity mapping block was used as a means to rectify degradation issues faced and facilitate training of the deeper network.

Comparatively, resnet based deep learning models proved to produce better results when used for classification applications rather than segmentation applications in terms of performance measures. Liu et al. [43] proposed a feature pyramid mask r-cnn network based on resnet to segment the nuclei present along the cervical. In this method, pixel-level information was used before hand to dispense supervisory information to train the mask r-cnn. The precision and recall yielded were 96%. One significant disadvantage of resnet is that deeper networks usually necessitate more training time.

Fig. 5.
figure 5

Resnet architecture [38]

2.4 U-net (2015)

Fig. 6.
figure 6

Basic architecture of U-net [43]

The U-net architecture first evolved from the traditional CNN in the year 2015 for bio-medical images. The u-net network is symmetric along both sides as represented in Fig. 6. The two major parts of the network architecture include the expansive 2D convolutional layers on the right and the contracting path comprising of the general convolutional process along the left. The pooling operations are replaced by upsampling operators consisting of multiple feature channels to amplify the resolution of the output. Different variants of u-net have been deployed for various medical segmentation tasks by researchers around the world as discussed below.

Segmentation of Optic Regions

U-net has been used for segmentation of various regions of the eye. Pan et al. [44] introduced a modified u-net based network for segmentation of retinal vessel segmentation in fundus images. As the traditional u-net was not deep enough, this network was proposed to bind the outcome of the convolutional layer with that of the deep CNN in the residual network under extreme depth conditions. Zheng qiang Jiang et al. [45] proposed a coronary vessel segmentation network based on U-net. This network was comprised of multi-resolution and multi-scales as the traditional U-net comprised of only a single convolution operation for a single scale image and hence does provide accurate segmentation. This method yielded a DSC of 79%. Xioming Liu et al. [46] proposed a modified U-net model for segmentation of fluids in retinal optic CT images. This variant has an automated attention mechanism to locate the fluid region to avoid the problem of excessive calculation in multi-stage methods. Also, the dense skip connections combined the high-level and low-level features thus making the results of segmentation more precise. This method achieved a DSC of 80 ± 2% for images from different devices. Sang Yoon Han et al. [47] proposed a segmentation technique inspired by the U-net for detection of pupil centreline this model the complexity of the U-net is reduced by decreasing the number of channels and floors in the U-net. This network achieved a detection rate of 87.3%. Bilel Daoud et al. [48] proposed a segmentation technique for nasopharyngeal carcinoma which was inspired by U-net. The proposed method consisted of 2 CNN based systems with overlapping patches with fixed sizes and with different sizes, thus yielding a DSC of 85% to 91% for axial, coronal and sagittal sections. Khaled Alsaih et al. [49] proposed a method for segmentation of retinal fluid segmentation using SegNet which resembled the U-net architecture except that the encoder path is replaced by VGG-16 network. This method yielded a DSC of 92%. Mangipudi et al. [50] proposed an improved u-net based network to segment optic disc and cup in glaucomatic images. Contrast to the original u-net architecture, this network consisted of only half the number of filters in each convolutional layer. Also, the input size was kept low so that the number of parameters used during training could reduce. By doing so, the computational time required for training was reduced to a significant extent and this method yielded a DSC of 93% and 95% for cup and disc segmentation respectively. Bhargav. J.Bhatkalkar et al. [51] proposed a segmentation technique for optic disc in fundus images. This method is a combination of DeepLab V3 + and U-net by incorporating an attention module between the encoder and de-coder to improve the accuracy. This method achieved a DSC of 95 ± 2% for different datasets. Monsumi et al. [52] implemented a iris segmentation method using an interactive variant of U-net that includes modules to squeeze and expand with an aim to reduce the training time and improve storage by reducing the parameters. This method resulted in a DSC of 98%. Shuang Yu et al. proposed a robust optic disc segmentation network based on U-net with Resnet-34 encoding layers. This method yielded an accuracy between 84% to 97% for different datasets.

Segmentation of Various Tumours

Researchers have also used U-net for segmentation of various tumours. A deeper 14 layer U-net model consisting of 26 blocks of VGG19 encoders with ImageNet was implemented by Lu et al. [53]. This method resulted in a DSC of 86% for segmenting the tumor mask, 76% for segmenting contour of the tumor and 66% for segmenting the contour of tumor after gaussian smoothening. Yong Zhou Lu et al. [54] implemented a U-net based DL model with VGG-16 encoder pretrained with ImageNet to segment tumours in PET images. This network yielded a DSC of 86% for mask of tumour and 76% for contour of tumour. Manhoor Ali et al. [55] introduced a model combining 3D CNN and U-net to segment brain tumour from MRI images. This method also replaced Relu activation function by leaky Relu and produced a DSC of 75%, 90% and 84% for enhancing tumour, whole tumour and tumour core respectively. In this method, 3D asymmetric kernels were used for convolution and flat stride was used for pooling to tackle anisotropic spacing. Mohamed A. Naser et al. [56] implemented a U-net model with 1 convolutional transpose layer instead of max-pooling in the de-coding part for segmentation of brain tumour yielded a DSC of 92%. Tran et al. [57] proposed the combined use of U-net and Un-net for segmenting liver tumours. The Un-net was designed in such a way that the skip connection path, pooling path and the up-convolutional path are replaced in the node structure. In the re-designed structure, the features in the node of the output layer are conjoined with the next node as well as the encoder node at the same level. This method yielded a DSC of 96.5% and 73% for liver and liver tumour segmentation respectively. Zhenxi Zhang et al. [58] proposed a U-net like model for segmenting 3D MRI images of the brain. Later, Tao Lei et al. [59] proposed an enhanced U-net model named as Def ED-Net (Deformable Encoder Decoder Network for liver and liver tumour segmentation. This method avoids loss of spatial contextual information of images by employing deformable convolution with residual structures to generate feature maps. This model yielded a DSC of 96% which is exemplary comparatively.

Segmentation of Blood Vessels

U-net has also proved to be efficient in segmentation of blood vessels. Manual E Gegundez-Arias et al. [60] proposed a simplified U-net with a combination of residual blocks and batch normalisation at the up and down scaling phases. This model was used to segment blood vessels in retinal images based and achieved an accuracy of 95% ± 1% on different datasets. Enda Boudegga et al. [61] proposed a DL network for segmenting blood vessels by extending the well-known U-net. In this method, the standard convolutional layers were replaced using LCM (Light weight Convolutional Modules) in order to reduce the computations. This method yielded a superior accuracy of 97%.

Segmentation of Cardiac Diseases

Various parts and diseases of the heart have been effectively detected and segmented using U-net. Gurpreet Sing et al. [62] proposed the use of the traditional U-net for automatic segmentation of cardiac CT images and achieved an overall accuracy of 73%. Can Xiao et al. [63] implemented an improved 3D U-net based on FCN for heart coronary artery segmentation. The upper part of the FCN was modified to enable propagation of information to higher resolution layers. This method yielded a DSC of 82%. Lohendran Baskaran et al. [64] proposed a U-net based segmentation of cardiovascular structures from Cardiac CT images and achieved a DSC of 82% as well-lined-Lu et al. [65] proposed a ringed residual U-net for pancreatic segmentation. With the use of the ring residual module this method yielded exemplary results via deep convolution and can consolidate the characteristics of traditional deep learning networks. This network yielded a DSC of 88.32 ± 2.84%. Tao Liu et al. [66] proposed a U-net based RCNN to efficiently segment heart diseases from cardiac images and obtained a DSC of 86% to 95% for different sections of the heart.

Segmentation of Brain Tissues and Tumours

Researches include the use of U-net in brain tissue segmentation tasks which have resulted to be successful. Sil C Van De Leemput et al. [67] proposed a FCNN based U-net model for brain tissue segmentation. In addition to the traditional U-net, shortcuts were added over every two convolutional layers as they speed up convergence and increase the overall performance, achieving a DSC of 87%. Nagaraj Yamanakumar et al. [68] proposed a brain-tissue segmentation model known as M-net which was inspired by U-net. M-net consisted of two side paths and two main encoding and de-coding paths which aids better feature learning. This method produced an accuracy of 94 ± 2%. Fan Zhang et al. [69] proposed a brain tissue segmentation technique using 2D U-net with a novel augmented target loss function to increase accuracy in tissue boundaries. This method yielded a high accuracy of 95 ± 2%.

Other Related ROI Segmentations

Variants of U-net have been employed in several other medical oriented segmentation tasks effectively.

The combination of U-net and U-net++ variant was proposed by Jonmohamadi et al. [70] to automatically segment multiple structures from knee arthroscopy. In U-net++, the skip connections are compensated using nested, dense skip connections as a means to develop a more efficient architecture. This modification was done to supress the semantic gap of the feature maps that lie between encoder and decoder operators. The U-net++ model yielded a DSC of 0.79%, 0.50%, 0.51% and 0.48% during segmentation of femur, tibia, anterio and meniscus respectively. Yuli Sun Hariyani et al. [71] proposed a dual attention based U-net variant for nailfold capillary segmentation. This model was named as the DA-CapNet and it improvised the U-net architecture by including a dual-attention module that captured feature maps more efficiently yielding an IoU of 64% and precision of 77%. Chen et al. [72] introduced the U-net plus variant which was used to segment esophagus and esophageal cancer. In this variant, 2 blocks were introduced to optimise feature extraction of tediously complex and abstract information and as a means to resolve irregular, vague boundaries with ease. The DSC obtained using the U-net plus was 79%. A dense U-net model was introduced by Li et al. [73] for segmentation of mammogramic masses. This variant of u-net combines densely connected CNN with attention gates. The encoder end is densely connected to the CNN whereas the attention gates are connected at the decoder end. This network produced an F1 score of 82.24%.

Sebastin Stenman et al. [74] introduced a U-net and ImageNet combination with Resnet backbone to segment leukocytes which yielded an IoU of 82%. Shyam Lal et al. [75] proposed a nuclei segmentation method for liver cancer detection using a modified U-net. This model was known as NucleiSeg Net and it included a residual block comprising of convolutional layers aiming to obtain high-level semantic features. This method yielded a F1 score of 83% and JSI of 72%. Yesenia Gonzalez et al. [76] proposed a sigmoid colon segmentation network based on U-net. The proposed network combined the use of 2D and 3D operationist DSC obtained using this network was 82% ± 6%. Xieli Li et al. [77] proposed a dual U-net based network for segmentation of overlapping nuclei. This method has a multi-task learning network in which the boundary and region information helps to improve the segmentation accuracy of glaucoma nuclei, especially overlapping ones yielded a F1 score of 82%. Junlong Chen et al. [78] proposed a variant of U-net for aortic dissection which when compared to the traditional encoder block consisted of an enhanced feature representation capability. This method achieved an accuracy of 85%.

Chanbo Huang et al. [79] employed a modified U-net for segmentation of cell images. This variant combined the advantages of U-net and resnet into one module and yielded an accuracy of 97% and IoU of 84%. Bing Bing Zheng et al. [80] introduced the Multi-scale Discriminative Net (MSD-Net) inspired by the U-net model. This variant was used to segment lung infections through four stages of operation. The four stages include feature map scale, a global average pooling layer to extract semantic consistence from the encoder and a pyramid convolutional block to achieve multi scale information. This method achieved a sensitivity of 82% to 86% for three different infections. Amine Amyar et al. [81] proposed a variant of U-net that comprised of convolutional layers with stride = 2 to replace pooling and maintain spatial information. Also, the number of filters were increased from 64 to 1024. This method yielded a DSC of 88%.

Catherine P. Jeyapandian et al. [82] proposed a network for segmenting histologic structures in the kidney cotex. This network was inspired by the conventional U-net with slightly tweaked parameters. The F1 scores obtained, varied from 81%–91% for various structures. Duo Wang et al. [83] implemented the 3D U-net for segmentation of pulmonary nodules and achieved an accuracy varying between 72% to 91% for different tasks. Van-Truong Pham et al. [84] proposed a DL network for segmentation of tympanic membranes from otoscopic images. This network was known as Ear U-net and was based on three paradigms. Firstly, efficientnet was used as encoder. The second paradigm is that, attention gate was used for skin connections and thirdly, residual blocks were used for decoder. The DSC achieved was 92%.

Zhang et al. [85] segmented epicardial fat using dual U-nets and a morphological processing layer. The function of the morphological layer is to accurately identify the pericardium. The first U-net network focuses solely on the detection of the pericardium and the second U-net network was used to locate and segment the epicardial fat. This dual network-based design has yielded a DSC of 91.19%.

Qi-Zhang et al. [86] proposed an Epicardial Fat segmentation network using dual U-nets including a morphological processing layer. The first U-net was for refining and obtaining the inside region of the pericardium and the second layer acted as a backbone for segmentation. This method yielded a DSC of 91%. Francesco Marzola et al. [87] proposed a segmentation technique for transverse musculoskeletal ultrasound images. This DL network was an ensemble NN that combined the predictions of U-net, U-net++, FPN and AttentionNet. This method yielded a precision of 88% and recall of 92%. Lian Ding et al. [88] proposed a light weight U-net variant for segmentation of pediatric hand bones. This model contained a reduced number of up sampling and down sampling operators as well as kernels. This method yielded a DSC of 92.9%.

A lightweight U-net model was introduced by Ding et al. [89] for segmentation of pediatric hand bones. Multiple filters with different kernel sizes were deployed along with two down-sampling operators, two up-sampling operators. This network frame yielded a DSC of 93.1% in the segmentation of pediatric bones. Javier Civit Mascot et al. [90] proposed a TPU (Tensor Processing Unit) cloud based U-net model for segmentation of eye fundus images and achieved a DSC of 94%. Tawsifur Rahman et al. [91] proposed a DL network for detection and segmentation of tuberculosis in chest x-ray with the use of two U-net models. The modified U-net includes a bi-directional convolutional long short term memory that combines feature maps. This method yielded an accuracy and F1 score of 96%. Guodong Zeng et al. [92] proposed a LP-Unet for segmentation of hip-joints in MRI images. In this network, the listic decomposition, convolution and dense up-sampling convolution were applied at the beginning of the 3D U-net. The main advantage of LP-net is that, it reduced the GPU memory. This method obtained a DSC of 97 ± 2%. Al-Kofahi et al. [93] used a combination of the U-net and MXNet library to quantify pixel-level predictions of a number of classes.

U-net networks not only proved to be efficient in medical image segmentation but also generated significant results in image reconstruction and pixel regression as well. The disadvantage of U-Net topologies is that learning may slow down in the middle layers of deeper models, putting the network at danger of ignoring the layers that represent abstract characteristics.

2.5 Volumetric Convolution Network (V-net, 2016)

Although, the V-net was inspired by the U-net, both architectures have their differences. The left portion consists of the compression path and the right portion consists of the de-compression path which is responsible for reverting the original size of the signal. Each portion is divided into various stages that govern different resolutions. Pooling operations are replaced by convolution layers that vary between one to three and a residual function is familiarised at each stage. The convolutional layers are made up of volumetric kernels of 5 × 5 × 5 voxels as displayed in Fig. 7. The Prelu non-linear activation function is present on the left portion and down-sampling is performed to increase the receptive field. On the other hand, the right portion performs a deconvolution operation to increase the size of the input. Few features are similar along both the portions such as the number of convolutional layers provided that the last convolutional layer is responsible for producing the same output size as the input. There are very few implementations of v-net by researchers which are discussed below and more works are to be expected in the latter days.

Fig. 7.
figure 7

V-net architecture [94]

Gibson et al. [95] introduced a dense v-net for segmentation of 8 organs in the abdominal region such as the stomach, duodenum, left kidney, liver, spleen, gallbladder and pancreas. Dense V-Net differs in certain ways. The down-sampler consisted of three dense feature stacks connected by down-sampling stridden convolutions. Every skip connection was a convolution of the associated stack output, and the up-sampler comprises bilinear up-sampling. Memory dependencies of the feature stack and spatial dropout enable deep networks at high resolutions, which is an advantage while segmenting smaller structures. Caixia Dong et al. [96] proposed a V-net based 3D DL network known as Di-Vnet for segmenting coronary arteries. It functions as two stages namely, cardiac segmentation, followed by a second stage of CAS(coronary arteries segmentation). This method achieved a DSC of 90 ± 1% for different datasets. Zeng et al. [97] implemented v-net architecture for image fetal segmentation. A combination of v-net and multi-scale loss function was used where v-net was used for the attention mechanism and the multi-scale loss function is used for deep supervision. The combination of these two functions induced significant results and helped to yield a DSC of 97.93%.

3 Discussion and Conclusion

Table 1. Summary of high-performance DL networks
Fig. 8.
figure 8

Distribution of papers

Medical image processing using deep learning is a vast, interesting and challenging research area that conjoins the medical field and the computer field. This survey covers the recent works involving the widely used deep learning networks in medical image segmentation as per the distribution in Fig. 8. Researchers all around the world have been introducing and implementing several variants of DL networks that are derived from the standard DL architectures, geared towards amplifying the performance and rectifying the drawbacks faced by the existing network performances. Such research works are performed to contribute towards the advancement of the healthcare field and assist radiologists in precise diagnosis. This paper summarises the standard network architectures of CNN, Alexnet, Resnet, U-net, V-net and the related works that cover the implementation of its variants along with performance study and a comparison chart. From the study, we have understood that U-net is most preferred and widely used for segmentation of medical images due its high performance measures. We would like to conclude stating that from this survey it is understood that through collaborative research between computer vision techniques and DL techniques the medical field can draw huge benefits Table 1.