Few-shot segmentation with duplex network and attention augmented module

Zeng, Sifu; Yang, Jie; Luo, Wang; Ruan, Yudi

doi:10.3389/fnbot.2023.1206189

ORIGINAL RESEARCH article

Front. Neurorobot., 21 June 2023
Volume 17 - 2023 | https://doi.org/10.3389/fnbot.2023.1206189

Few-shot segmentation with duplex network and attention augmented module

Sifu Zeng¹^*

Jie Yang²

Wang Luo³

Yudi Ruan²

¹School of Economics and Management, Chongqing Jiaotong University, Chongqing, China
²School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing, China
³College of River and Ocean Engineering, Chongqing Jiaotong University, Chongqing, China

Establishing the relationship between a limited number of samples and segmented objects in diverse scenarios is the primary challenge in few-shot segmentation. However, many previous works overlooked the crucial support-query set interaction and the deeper information that needs to be explored. This oversight can lead to model failure when confronted with complex scenarios, such as ambiguous boundaries. To solve this problem, a duplex network that utilizes the suppression and focus concept is proposed to effectively suppress the background and focus on the foreground. Our network includes dynamic convolution to enhance the support-query interaction and a prototype match structure to fully extract information from support and query. The proposed model is called dynamic prototype mixture convolutional networks (DPMC). To minimize the impact of redundant information, we have incorporated a hybrid attentional module called double-layer attention augmented convolutional module (DAAConv) into DPMC. This module enables the network to concentrate more on foreground information. Our experiments on PASCAL-5i and COCO-20i datasets suggested that DPMC and DAAConv outperform traditional prototype-based methods by up to 5–8% on average.

1. Introduction

Deep convolutional neural networks have made significant strides in semantic segmentation. However, most high-performing models require a large number of pixel-level annotated training images. This annotation process is not only expensive but is also cumbersome, thereby posing challenges in obtaining enough samples in some scenarios. Consequently, achieving generalization across different scenarios becomes challenging. In light of this, few-shot learning, which aligns more closely with cognitive learning, is likely to become the primary focus of deep learning in the future. Few-shot segmentation involves the use of a learned feature representation from training images to segment a query image. However, this task remains a challenge when the object category falls outside the sample range and a significant variation in appearance and pose exists between the objects in the support and query images.

Shaban et al. (2017) contributed an initial approach to semantic segmentation with few samples and introduced the concept of “prototype.” Prototype-based methods are currently considered advanced in few-shot learning. This approach emphasizes the weight vector, which is computed through global average pooling guided by the ground truth mask in the embedded feature map. This vector effectively condenses discriminative information across feature channels, making it easier to compare features between support and query images for semantic segmentation.

However, many challenges are still encountered in the research of few-shot segmentation. The use of a single prototype for few-shot learning can result in semantic ambiguity and deteriorate feature distribution. Relying solely on a single prototype and simple operations for prediction can result in loss of inherent object details in the query image. Additionally, when large variation in appearance or scale of the object in few-shot learning is observed, making predictions based solely on support information can become difficult. Furthermore, the segmentation failure of ambiguous boundaries is also an existing problem in the few-shot segmentation task at this stage.

Recent advancements in techniques, such as feature boosting, prototype alignment, and iterative mask refinement, have addressed the aforementioned challenges effectively. CANet (Zhang et al., 2019) employs an iterative optimization module to merge query and support features in an optimized manner. Prototype mixture models (PMMs) (Yang et al., 2020) combine prototype mixture and duplex manner to fully exploit channel semantic and spatial semantic information. SCL (Zhang et al., 2021) utilizes a self-guided mechanism to generate an auxiliary feature prototype. ASGNet (Li et al., 2021) is designed to adaptively partition the support features into multiple feature prototypes and subsequently select the most relevant prototype for matching with the query image. CRCNet (Liu et al., 2022) presents a solution to address semantic ambiguity and feature distribution issues by introducing cross reference. This approach involves multiple interactions between support sets and query sets to improve their overall performance. However, these approaches become extremely fragile in terms of segmentation capability when facing more complex situations, such as ambiguous boundaries in few-shot segmentation tasks. When solving problems in ambiguous boundaries, starting with just the foreground can be challenging. The effective utilization of the duplex network and thorough mining of information allows the model to establish stronger relationships between the support and query sets with minimal samples, ultimately leading to improved segmentation accuracy.

Our research draws inspiration from the foreground–background and duplex modes utilized in PMMs. By utilizing the duplex mode, we can effectively utilize channel semantic and spatial semantic information to its fullest potential, as shown in Figure 1. This approach can enhance the accuracy of the image segmentation process in complex scenarios where the foreground and background have similar characteristics. However, we observed that the duplex mode in PMM only utilizes features that are extracted from the backbone network, indicating that the full potential of this mode remains untapped. Additionally, in-depth research on this mode is lacking in current studies. To gain a deeper understanding of the duplex manner, we plan to develop a new attention model and enhance the existing duplex mode through further investigation and exploration.

FIGURE 1

Figure 1. Visualization of duplex networks and previous networks.

In this paper, we propose a novel approach called dynamic prototype mixture convolutional network (DPMC) inspired by the baseline method. Our method improves the duplex strategy used in the baseline by incorporating a prototype match structure to fully exploit the information in the support and query images. Additionally, we use channel information and spatial semantic information to segment the query image. To achieve sufficient support–query interaction, we introduce dynamic convolution in DPMC. Specifically, we apply kernel generation to produce different convolution kernels, which are applied with convolutions of different receptive fields to extract more image information. To enhance the segmentation performance of DPMC, we designed a double-layer attention augmented convolutional module (DAAConv). This module efficiently acquires contextual information, focuses on important regions, and removes redundant information. The attention module designed in this work effectively improves DPMC's ability to focus on the foreground, which results in enhanced segmentation performance. In conclusion, our experiments on the Pascal and COCO datasets have shown that the combination of DAAConv and DPMC significantly improves the baseline. Additionally, we conducted ablation experiments, which demonstrate that DAAConv enhances the duplex mode and DPMC outperforms the baseline.

The main contributions of our work are summarized as follows:

1. DPMC, which utilizes a duplex approach of suppressing the background and emphasizing the foreground, is presented in this study. Specifically, our proposed method is effective for addressing complex segmentation tasks with indistinct boundaries.

2. To improve the performance of duplex mode, DAAConv has been designed. This module can efficiently obtain contextual information and focus on important regions, ultimately enhancing the overall efficiency of the duplex mode.

3. The use of DAAConv and DPMC together fully maximizes the potential of the duplex concept. This approach achieves excellent performance in the classical dataset of few-shot learning, thereby significantly outperforming existing techniques.

The remainder of this paper is structured as follows: Section 2 reviews related works in semantic segmentation, attention and self-attention, and few-shot segmentation. Section 3 describes the DAAConv and DPMC models we constructed in detail. Section 4 demonstrates the superiority of our model through adequate experiments and proves the validity of our constructed model through multiple sets of ablation experiments. Section 5 summarizes our work and provides an outlook for the future.

2. Related work

In this section, we will discuss three aspects of work that are highly relevant to our work, including semantic segmentation, attention and self-attention mechanisms, and few-shot segmentation tasks.

2.1. Semantic segmentation

Semantic segmentation aims to divide an image into regions of different semantic categories. Classical methods, such as UNet (Ronneberger et al., 2015), correspond to fully convolutional networks with a U-shaped structure and symmetric encoding and decoding paths, as proposed by Ronneberger et al. It is not only known for its excellent segmentation accuracy but also for its decent speed. Other methods, such as PSPNet (Zhao et al., 2017) and DeepLab (Chen et al., 2017a,b), are also based on fully convolutional networks (FCN; Long et al., 2015). However, Their common shortcoming is limited ability to gain long-range context information, missing the global information. Recent research has focused on how to widen the visual field to simulate the remote context of an image. Inspired by non-local (Wang et al., 2018) approaches, some methods (Chen et al., 2016; Liu et al., 2017; Ding et al., 2018; Li et al., 2019; Hou et al., 2020; Pal et al., 2022) use attentional mechanisms to establish connections between image contexts. Transformer architectures also achieve good results in semantic segmentation, focusing on multi-scale feature fusion (Zhang et al., 2020; Chen et al., 2021; Wang et al., 2021; Xie et al., 2021; Jin et al., 2022a,b,c, 2023), and contextual feature aggregation (Liu et al., 2021; Strudel et al., 2021; Yan et al., 2022). For example, SETR (Zheng et al., 2021) uses the transformer framework to serialize images to achieve a fully attention-based feature representation encoder. In Cross ViT (Chen et al., 2021), a dual-branch transformer is used to group patches of different sizes in images, and multiple interactions with the attention mechanism are performed to integrate information better. FPANet (Wu et al., 2022) utilized a lightweight feature pyramid fusion module FPFM to reduce the number of feature channels. Additionally, SeBiFPN was employed to acquire semantic and spatial information from images and to merge features from various levels.

2.2. Attention and self-attention mechanisms

The introduction of the attention mechanism has shifted the attention to important areas and ignored irrelevant parts. The application of attention mechanism can be regarded as a dynamic selection process that adaptively achieves feature weighting processing based on the importance of the input. The superiority of the attention mechanism has been demonstrated in multiple visual tasks. For example, in semantic segmentation tasks, the classic channel attention module called SENet (Hu et al., 2018) improves the representation ability of the network by modeling the interdependence among convolutional feature channels. Classic spatial attention module (SAM) can also be utilized (Zhu et al., 2019). In recent years, many hybrid attention modules, such as the convolutional block attention module (CBAM; Woo et al., 2018), which contains the channel attention module (CAM) and the spatial attention module (SAM). For instance, DANET (Fu et al., 2019) employs two distinct attention modules in the spatial and channel dimensions and combines the outputs of these modules to enhance feature representation, thereby effectively improving segmentation accuracy. MANet (Wang et al., 2022) is used to alleviate the problem of excessive complexity of non-local networks by replacing the traditional single densely connected graph with two sparsely connected graphs. Attention mechanisms have many types, and excellent hybrid attention mechanisms similar to CBAM and DANET have not yet been fully developed.

Self-attention mechanisms and non-local neural networks have been proven to be highly successful in various tasks because of their effectiveness in modeling long-range contextual information. Particularly, within the realm of natural language processing tasks, self-attention mechanisms can automatically calculate and explore the relationships between the sentences themselves and finally obtain the connections among each variable in the sentence and all variables. For example, in transformer, self-attention helps to encode specific words while still obtaining information from other words in the sentence. However, in the field of imaging, the mechanisms for paying attention have not been sufficiently developed. In image classification tasks, Bello et al. (2019) developed a novel two-dimensional relative self-attention mechanism, which infuses relative positional information while maintaining translational equivariance, thereby making it very suitable for images. This attention mechanism is used to improve the convolutional operator to replace convolution by concatenating convolutional feature maps with a set of feature maps generated by the self-attention mechanism. The construction of the attention mechanism in this paper is also inspired by this.

2.3. Few-shot segmentation

Manual annotation is time consuming, laborious, expensive, and does not fit the learning style of humans. Therefore, been studied extensively in recent years. Existing few-shot learning updates these three components by incorporating two steps: first, associating the encoder's support set and query set image features, and second, minimizing the loss of the difference between the measurement prediction and the underlying facts of the query sample. A prototype learning or feature stitching approach is adopted when we need to solve the issue about how to associate support and query images. In OSLSM (Shaban et al., 2017), a two-branch one-time semantic image segmentation method is introduced to achieve few-shot segmentation. In this method, the first branch takes the labeled image as an input and produces a vector of parameters as an output. The second branch takes these parameters and a new image as input and produces a new class of image segmentation masks as output. PL (Dong and Xing, 2018) uses a prototype network to learn a prototype for each class. Then, it computes the cosine similarity between the test sample and each prototype to predict the class label. In CANet (Zhang et al., 2019), an iterative optimization module is used to iteratively optimize the results for merged queries and supporting features. In PMMs, the proposed model enhances the representations of semantic information in images through the association of image regions with multiple prototypes. Using an expectation-maximization (EM) algorithm, the model estimates the prototype-based semantic representation. Interestingly, PMMs use duplex mode to suppress the background region. Although the simple duplex network can partially address the problem in ambiguous boundaries, the limited interaction between the support set and query set, as well as the lower exploitation of the duplex network, can negatively influence the performance of PMM. In SSA-Net (wang et al., 2022), a spatial self attention network is introduced to broaden the sensory domain and enhance representation learning by extracting valuable contextual information from deeper layers through the use of a self-attention mechanism. CRCNet (Liu et al., 2022) explains the concept of cross reference, which involves predicting and cross-referencing query images and support images simultaneously. This approach helps mitigate issues related to semantic ambiguity and feature distribution that arise during few-shot learning. However, CRCNet ignores the deeper mining of both when pursuing a large number of interactions between support and query sets.

Our study is inspired by the duplex manner in PMMs, which can effectively depress background regions in few-shot segmentation tasks and improve the accuracy of segmentation. Features extracted through the backbone network, such as Resnet, contain a significant amount of redundant information. Despite their effectiveness in capturing local details, these features often fail to provide a global information of the input data. This limitation arises from the relatively narrow perceptual field of the network, which hinders the extraction of more comprehensive and meaningful information. In light of these observations, we believe that further exploration of feature selection and representation techniques is necessary to improve the performance of deep learning models in complex tasks. Therefore, the information extracted from the backbone network should be further processed before using the duplex method to maximize the effectiveness of the method. We have also made appropriate improvements to the duplex manner in PMMs to make the support–query interaction more adequate.

3. Method

3.1. Overview

To acquire more contextual information within the learning network, extract the target regions efficiently, and play the role of duplex mode efficiently, we design DAAConv, as shown in Figure 2.

FIGURE 2

Figure 2. Overall structure of our method with the double-layer attention augmented convolutional module (DAAConv) and the dynamic prototype mixture convolutional network (DPMC).

Our model also includes two network branches: the support branch and the query branch. Two weight-sharing CNNs are used as the backbone network for feature extraction in the support and query branches. The support image's feature set S is then fed into DAAConv. After being processed by the attention module, the feature set continues to be fed into the DPMC. In DPMC, the feature set is first divided into a positive (foreground) sample set, S+, and a negative (background) sample set, S-. Subsequently, a Prototype vector is generated using the EM algorithm before proceeding to the next step with duplexing. One side of the duplex mode uses PMS to activate query features, and dynamic convolution using custom convolution kernels learned from the support set by the kernel generator, which will effectively connect the support and query sets, while the other side generates probability maps by element-wise multiplication. Finally, the two sides are combined for semantic segmentation.

In summary, we construct a new hybrid attention module called DAAConv and a new duplex network called DPMC. The two modules combined in the network can effectively obtain contextual information, focus on important regions, improve the duplex model performance, fully mine the information in support and query, and increase the support–query interaction. The complementarity of the two modules effectively addresses the lack of support–query set interaction and deeper information mining in traditional few-shot segmentation. Next, we will explain each part mentioned above in detail.

3.2. DAAConv module

Next, we will formally introduce our DAAConv module. First, to obtain the channel information of the support set, we utilize the SE (Hu et al., 2018) attention module in the first layer of the attention mechanism, which mainly consists of squeeze and excitation, to effectively determine the meaning of each channel and weight the features according to the meaning, so as to highlight the important features and repress the non-important ones. The use of this module has successfully enabled the information to be used in various ways. The use of this module successfully focuses the information on the foreground part and weakens the background part.

Specifically, we refer to the height, width, and number of input filters of an activation map, given an input tensor S of shape (H, W, C). First, we pass X through the squeeze and excitation channel attention network. Then we obtain the output:

\begin{array}{l} D A A_{1} (S) = U = S E (S) . & (1) \end{array}

Next, we feed the output U ∈ ℝ^H′ × W′ × C′ into our second layer of attention, the self-attention mechanism. For the choice of the second layer of the attention mechanism, we draw on the multi-head-attention (MHA) part of a novel attention mechanism (AAConv; Bello et al., 2019). Self-attention is a recent advancement in capturing long-range interactions, but is mainly used in sequence modeling and generative modeling tasks. In contrast, AAConv preserves translational isomorphism while injecting relative position information, hence making it well suitable for images. We only selected the multihead-attention part as our second stage of the attention mechanism:

\begin{array}{l} D A A_{2} (U) = M H A (U) . & (2) \end{array}

The composition of DAA (Double-layer Attention Augmented networks) can effectively enable features to obtain contextual information and focus attention where we need it. DAA is only a part of our double-layer attention augmented convolutional module.

\begin{array}{l} D A A (S) = D A A_{2} (D A A_{1} (S)) . & (3) \end{array}

In our experiments, we found that the improvement of segmentation accuracy is more limited if we only use DAA. DAA can effectively capture the long-distance information of an image but ignores the local information. So, we introduce an additional feature mapping in the network or the second layer of our two-layer attention module. We achieve a balance between long-range and close-range information by concatenating the convolution module, which enhances localization, with the self-attention module, which captures long-range information.

We pass the support sets extracted through the backbone network sequentially through the ordinary convolution and SAM (Zhu et al., 2019).

\begin{array}{l} X^{'} = S A M (C o n v (S)) . & (4) \end{array}

Finally, we concatenate the additional feature map obtained and the attentional feature maps generated by DAA through the concatenating operation.

\begin{array}{l} D A A C o n v (S) = S^{'} = C o n c a t [S A M (C o n v (S)), D A A (S)] . & (5) \end{array}

We solve the high memory footprint of the self-attentive mechanism by using smaller batch sizes.

3.3. DPMC networks

3.3.1. Prototype generation

After the image features have passed through the DAAConv we have designed, more contextual information is effectively extracted, and important region features are automatically captured, which will be of good help for our next processing. We will then describe in detail the DPMC that we have designed.

We name the DAAConv(X) obtained above as S′ ∈ ℝ^H″ × W″ × C″. S is spatially divided into foreground samples S^′+ for object part learning and background samples S^′− for background part learning. In the prototyping section, DPMC relies on the idea of the probability mixture model (Yang et al., 2020), as

\begin{array}{l} p (s_{i}^{'} | θ) = \sum_{k = 1}^{K} w_{k} p_{k} (s_{i}^{'} | θ) . & (6) \end{array}

where w_k represents weight, and $p_{k} (s_{i}^{'} | θ)$ denotes the k^th base model.

Next, we obtain the prototype using the EM algorithm, which consists of iterative E-steps and M-steps. The expected value of the sample s_i is calculated in each E-step.

\begin{array}{l} E_{i k} = \frac{p_{k} (s_{i^{'}} | θ)}{\sum_{k = 1}^{k} p_{k} (s_{i^{'}} | θ)} . & (7) \end{array}

In each M-step, the mean vectors are updated using the expectation, as

\begin{array}{l} μ_{k} = \frac{\sum_{i = 1}^{N} E_{i k S_{i}^{'}}}{\sum_{i = 1}^{N} E_{i k}} & (8) \end{array}

We have successfully obtained the prototype using by EM algorithm. Then, we will use our duplex mode to process the prototype we obtained.

3.3.2. Job in duplex mode

The prototype vector that corresponds to S^′+ is $μ^{+} = {μ_{k}^{+}, k = 1, \dots K}$ , and the prototype vector corresponding to S^′− is $μ^{-} = {μ_{k}^{-}, k = 1, \dots, K}$ . In the baseline, the authors have conducted ablation experiments, which demonstrate that the effect is optimal when “K = 3.” Therefore, we will not perform additional experiments and will use “K = 3” as the default value.

3.3.2.1. PMS

Distinguishing from the P-Match in baseline, we redesigned a PMS, as shown in Figure 3. We perform the Matrix Multiplication of the processed support set with the foreground prototype. The feature fusion of support sets at different scales can mine more information in the support set. We then upsample the obtained results into the query set processed by the SE module.

\begin{array}{l} Q^{'} = P M S (μ_{k}^{+}, Q, S), k = 1, \dots, K . & (9) \end{array}

Compared with baseline, the PMS we designed accomplishes a deeper mining of support set information by fusing features from different scales of support sets.

FIGURE 3

Figure 3. Visual illustration of our proposed PMS.

3.3.2.2. Dynamic convolution

For more accurate segmentation, we innovatively introduce the dynamic convolution of the feature sets obtained from the EM algorithm and the PMS. The dynamic convolution generator based on the support set can enable more sufficient interaction between the support and query sets. Specifically, the support feature set S and its corresponding masks are inputted into a kernel generator that produces the dynamic convolution ker₁, ker₂ and ker₃ (i.e., one set of quadratic kernels and two sets of asymmetric kernels). Then, for each of the three prototypes, we perform convolution operations and summation using each of these three convolution kernels.

\begin{array}{l} Q_{k}^{″} = C o n v (k e r_{k}, P M S (μ_{k}^{+}, Q, S)), k = 1, \dots, K . & (10) \end{array}

More details about the kernel generator can be found in Liu et al. (2022).

3.3.3. Another job in duplex mode

In this section, we first multiply each prototype vector by the query feature Q using Element-wise Multiplication. Consequently, the resulting graph is converted into a probability map by using the softmax function on the channels and summing them to produce two probability maps, foreground, and background, $M_{p}^{+}$ , and $M_{p}^{-}$ .

To activate the object of interest, this is then further concatenated with the query function:

\begin{array}{l} Q^{‴} = C o n c a t (M_{p}^{+}, M_{P}^{-}, Q^{″}) . & (11) \end{array}

Finally, Q^‴ is passed to a decoder to generate a segmentation mask M_Q for the query image:

\begin{array}{l} M_{Q} = C o n v (A S P P (C o n v (Q^{‴}))) . & (12) \end{array}

4. Experiments

4.1. Experimental setting

4.1.1. Datasets

In our experiment, we validated the model on two classic few-shot segmentation datasets, namely, PASCAL-5ⁱ and COCO-20ⁱ. The first dataset is generated from PASCAL VOC 2012 (Everingham et al., 2009) with additional mask annotations from SDS (Hariharan et al., 2014) and consists of 20 semantic categories evenly divided into four-folds. The second dataset is built from MS COCO (Lin et al., 2014) and is composed of 80 semantic categories divided into four folds. Notably, COCO-20ⁱ includes 40,137 images (80 categories), which is higher than PASCAL-5ⁱ. Therefore, COCO-20ⁱ is a more challenging benchmark.

4.1.2. Evaluation indicators

In our experiments, we use mIoU as our evaluation metric. mIoU is a standard metric for semantic segmentation that measures the overlap ratio between the generated and original regions (i.e., the ratio of intersection to union). A higher mIoU indicates better segmentation results. mIoU can be calculated as follows

\begin{array}{l} m I o U = \frac{1}{C} \sum_{i = 1}^{C} I o U_{i}, & (13) \end{array}

\begin{array}{l} I o U = \frac{T P}{T P + F P + F N} . & (14) \end{array}

In predicted masks, TP (true-positives) are pixels that are truly predicted to be a part of the class, FP (false-positives) are pixels that are falsely predicted to be a part of the class, and FN (false-negatives) are pixels that are falsely predicted not to be a part of the class.

4.1.3. Implementation details

Our approach takes PMMs (Yang et al., 2020) as the baseline and employs VGG-16 and ResNet50 as the backbone. To obtain the prototype, we iterated the EM algorithm for 10 rounds. We use four data enhancement strategies (Zhang et al., 2019): normalization, horizontal flipping, random cropping, and random resizing. Although limited by computational resources, we used a learning rate of 0.0035 and a batch size of four to train both datasets, which did not affect our ability to demonstrate the effectiveness of our method. We ran a total of 200,000 steps. Our experiments were implemented using PyTorch 1.7 and ran on an NVIDIA RTX 3060 12g GPU.

4.2. Duplex mode analysis

Several existing studies have proposed models for solving the few-shot segmentation task using duplex networks. However, these models have only utilized duplex networks as a tool and have not delved into further exploration of their potential. This instance makes the performance of the duplex mode not fully developed. To demonstrate that duplex mode is a good solution for few-shot segmentation tasks, we visualize the segmentation results of DPMC with duplex mode, DPMC with only a single chain in the foreground, and DPCN with excellent performance without duplex mode, as shown in Figure 4. The single chain and DPCN can also perform the segmentation task well when segmenting images with a strong difference between the object and the background. However, when the background is more similar to the segmented objects, the duplex mode shows its superiority well, such as the chair and the cow. The much better-performing DPCN does not perform well with this tricky problem and show larger errors in two tasks, cow and chair, where the background is extremely similar to the segmentation target.

FIGURE 4

Figure 4. Segmentation results of DPCN, DPMC⁺, and DPMC. DPCN represents the method used in Liu et al. (2022). The method does not use duplex networks. DPMC⁺ represents the working path that only uses the foreground (i.e., the working path where the PMS is located). DPMC represents our complete duplex network.

The experimental results in Table 1 show that the use of duplex mode effectively improves the segmentation accuracy by 3.5%.

TABLE 1

Table 1. Duplex mode analysis of our DPMC on PASCAL-5ⁱ.

4.3. Performance

PASCAL-5ⁱ: We report the mIoU in the 1-shot and 5-shot settings in Table 2. In 1-shot and 5-shot settings, they outperform state-of-the-art methods, especially for the 5-shot setting, with a backbone of ResNet50, exceeds the baseline by 7% and exceeds the previous best model HSNet by 0.2%. Our model also performs well in the 1-shot setting, thereby outperforming the baseline by 5.5%, HSNet by 2.2%, and MMNet by 0.1%. Our experimental results show that our model effectively improves the baseline and enhances the performance of the duplex mode.

TABLE 2

Table 2. Comparison with state-of-the-arts on PASCAL-5ⁱ dataset under 1-shot and 5-shot settings.

We visualized several random segmentation results in the PASCAL-5ⁱ dataset, as shown in Figure 5. Our network shows a significant improvement in segmentation compared with the baseline. We can also observed from the figure that our network can dig into finer details compared with baseline, as seen in places, such as stool legs and airplane wings. Our network can effectively distinguish and segment similar objects, such as motorbikes and cars, when they appear together, thus outperforming the baseline.

FIGURE 5

Figure 5. Segmentation results of our model and baseline.

COCO-20ⁱ: COCO-20ⁱ is more challenging as it has a larger variety of objects and greater variation than PASCAL-5ⁱ. Our model performs well in 1-shot and 5-shot settings. Table 3 reports the mIoU of our model in these settings, showing that our model significantly outperforms the baseline. Our model outperforms the baseline by 8.1% in the 1-shot setting and by 7.2% in the 5-shot setting. It also outperforms MMNet, the best performing model on COCO-20ⁱ, by 1.2% in the 1-shot setting and RePRI, the best performing model, by 0.6% in the 5-shot setting. The experimental results demonstrate that our model can perform equally well in more difficult scenarios.

TABLE 3

Table 3. Comparison with state-of-the-arts on COCO-20ⁱ dataset under 1-shot and 5-shot settings.

4.4. Ablation study

To evaluate the effectiveness of our constructed DPMC and the usefulness of DAAConv in duplex mode, we conducted a series of ablation experiments, as shown in Table 4.

TABLE 4

Table 4. Ablation study of our DPMC and DAAConv on PASCAL-5ⁱ.

4.4.1. Superiority of DPMC

According to two separate experiments conducted by PMMs and DPMC, our designed DPMC effectively improved PMMs. The segmentation accuracy of DPMC has been improved by 2.3% relative to PMMs, thus providing additional evidence that our DPMC design effectively utilizes information from support and query features to enhance image segmentation.

4.4.2. Effectiveness of DAAConv

We evaluated the segmentation results of two experiments: DPMC running alone and DPMC and DAAConv running together. Our findings indicate that the addition of DAAConv can improve segmentation accuracy by 4% in the duplex mode. This experimental result effectively demonstrates the effectiveness of our constructed hybrid attention mechanism in improving the performance of duplex mode in small sample segmentation tasks.

4.4.3. Necessity of double-layer attention structure

We conducted two experiments using DPMC with DAA (DAAConv without Conv and SAM) and DPMC with DAAConv. Our findings indicate that the SAM and Conv layers in DAAConv play a crucial role in enhancing the model's final segmentation accuracy by 1.7%.

4.4.4. Generalization of DAAConv

DAAConv is effective in several prototype models, including CANet, FWB, and PANet. When inserted after the backbone network of these models, DAAConv has improved their performance to some extent, as shown in Table 5.

TABLE 5

Table 5. Generalization ability of the proposed DAAConv.

5. Conclusion and future work

We propose a DAAconv and a DPMC based on duplex mode to solve challenging few-shot segmentation tasks. DAAConv can effectively obtain contextual information and focus on important regions, and the double-layer structure achieves a balance between long-range and close-range information. DAAConv fits well with the idea of focus and suppression of duplex network, which can effectively improve the performance of duplex mode. Meanwhile, DPMC improves the duplex strategy by fully exploiting the information in support and query and fully realizing the support–query interaction. Moreover, DPMC retains the advantages of duplex mode, which can effectively solve complex segmentation scenarios, such as ambiguous boundaries, when combined with DAAConv. Extensive experiments have shown that the combination of DAAConv and DPMC performs well in few-shot segmentation tasks.

Future work will focus on two parts. First, we will continue to improve our model as we attempt to test it on a larger dataset and continuously test it in complex real-world scenarios. Second, we will combine the algorithm with the robotics algorithm to complete a complete set of work from recognition to operation.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found at: http://host.robots.ox.ac.uk/pascal/VOC.

Author contributions

SZ: software, writing-review and editing, and writing-original draft. JY: software, conceptualization, and methodology. WL: supervision. YR: software. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Group Building Scientific Innovation Project for Universities in Chongqing (CXQT21021), Joint Training Base Construction Project for Graduate Students in Chongqing (JDLHPYJD2021016), and College Students Innovative Entrepreneurial Training Plan Program (202210618005).

Acknowledgments

We would like to thank Prof. Lei Zhang and Mr. Songming Zhang, who worked at Chongqing Jiaotong University, for their thoughtful comments on the manuscript and language revision. We were grateful to all of the study participants for their time and effort.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ao, W., Zheng, S., and Meng, Y. (2022). Few-shot semantic segmentation via mask aggregation. arXiv:2202.07231. doi: 10.48550/arXiv.2202.07231

ORIGINAL RESEARCH article

Few-shot segmentation with duplex network and attention augmented module

1. Introduction

2. Related work

2.1. Semantic segmentation

2.2. Attention and self-attention mechanisms

2.3. Few-shot segmentation

3. Method

3.1. Overview

3.2. DAAConv module

3.3. DPMC networks

3.3.1. Prototype generation

3.3.2. Job in duplex mode

3.3.2.1. PMS

3.3.2.2. Dynamic convolution

3.3.3. Another job in duplex mode

4. Experiments

4.1. Experimental setting

4.1.1. Datasets

4.1.2. Evaluation indicators

4.1.3. Implementation details

4.2. Duplex mode analysis

4.3. Performance

4.4. Ablation study

4.4.1. Superiority of DPMC

4.4.2. Effectiveness of DAAConv

4.4.3. Necessity of double-layer attention structure

4.4.4. Generalization of DAAConv

5. Conclusion and future work

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher's note

References

This article is part of the Research Topic

People also looked at