Keywords

1 Introduction

Fast, accurate and robust dense stereo matching is required for many practical applications, e.g., scene surveillance and urban terrain 3-D reconstruction. Most of the reconstruction challenges often result from image sensor noise, specularity and reflection, illumination changes, and occlusion [2]. To tackle these challenges, numerous dense stereo matching algorithms have been published in the literature and mainly consist of local and global correspondence methods.

Konolige implemented a bock-matching method for stereo reconstruction [10]. He estimated by matching small patches of one image to another one on the basis of the cross-correlation similarity. Local gradient-based methods, particularly optical flow [3], calculate the disparity map in terms of inter-frame motion and image brightness. Hirschmuller proposed a semi-global matching algorithm to improve local correspondence methods [7]. Computational stereo also can be formulated as a multi-labeling problem to compute the disparity map on the basis of the pixel-wise correspondence between the left and the right camera images. To solve such a disparity labeling problem, many global optimization approaches, e.g., intrinsic curves [15], graph cuts [1, 9], dynamic programing [11, 17], and belief propagation [14] or their convex counterparts, are used for reducing sensitivity to stereo image variations. Unfortunately, global optimization is usually a time-consuming procedure that can potentially be hampered by premature convergence.

As one of many minimally invasive surgery approaches, robotic surgery, e.g., robotic laparoscopic prostatectomy, usually uses binocular or stereo endoscopes to provide the surgeon with direct vision close visual proximity of the surface structures in the operating room. However, the binocular endoscopic cameras are a small with quite narrow field of view, which limits the surgeon’s sense of accurate geometrical appreciation of the surgical site.

To expand the surgeon’s view of the surgical site and enhance his ability to control the surgical instruments, stereoscopic endoscopic scene 3-D reconstruction, as one of interactive and augmented visualization strategies, has been discussed recently by the medical robotic community. To achieve accurate scene visualization, dense stereo correspondence plays a critical role in binocular endoscopic video reconstruction. Various methods have been proposed in recent literature to address such a stereo correspondence problem. Stoyanov et al. reported a quasi-dense correspondence method for laparoscopic stereo images without compensating tissue deformation [13], by first detecting and matching reliable features between stereo pairs. Using these matched features as landmark structures, they propagated them using zero-mean normalized cross correlation (ZNCC) as the similarity metric to determine the disparity map. However, their method fails to reconstruct image regions with uniform or homogeneous texture and only can obtain semi-dense reconstruction. Chang et al. [5] recently proposed a stereo matching approach that constructs a cost volume in terms of intensity information and aggregates such a volume in terms of ZNCC and convex optimization [4, 8]. Although their proposed method outperforms the state-of-the-art, they did not use sufficient color information and also neglected gradient information on stereo pairs. Totz et al. [16] reported a similar matching method as the work of Stoyanov et al. [13] and obtained a semi-dense reconstruction.

Even though the methods discussed above work well in binocular endoscopic video reconstruction, they still remain challenging in the presence of photometric variations, specular reflections, noise, uniform texture, and occlusion. This work aims to develop fast and accurate binocular endoscopic 3-D scene reconstruction for robotic surgery. Being motivated by the fast cost-volume filtering method [8, 12], we introduce a color and gradient-driven stereo correspondence framework to boost stereoscopic endoscopic scene reconstruction. Such a local filtering method can avoid some of the drawbacks of the aforementioned schemes and yields a high-quality disparity map that is robust to illumination changes making it applicable in stereoscopic endoscopic scenarios.

The main highlight or contribution of this work lies in successfully constructing the correspondence cost aggregation on the basis of three-channel color and gradient image information for matching stereoscopic endoscopic pairs and achieving fully automatic dense 3-D reconstructions of surgical scenes. Recent robotic surgery systems, e.g., the clinically used da Vinci surgical system, provide high-definition (HD) binocular endoscopic images. These HD images provide sufficient color information to allow accurate scene reconstruction even if lacks texture and is noisy. Image gradient information can provide the means to overcome difficulties provided by illumination changes and noise.

The rest of this paper is organized as follows. Section 2 describes the various steps of our color and gradient-boosted aggregation stereo matching method. Experimental setups and results are shown in Sect. 3. Section 4 discusses different experimental results obtained from different stereo matching methods before concluding this work in Sect. 5.

2 Approaches

This section describes our color and gradient-boosted aggregation matching method that basically compares color and gradient differences between stereo pairs with left and right images. Currently available robotic surgical procedures usually provide HD endoscopic stereo videos. These HD stereo videos contain sufficient color information that can characterize the difference between stereo pairs more powerfully than intensity information directly used for the disparity computation in 3-D reconstruction. On the other hand, gradient information, as a simple image descriptor, is much more robust to illumination and exposure changes. Based on the work of Honsi et al. [8], our stereo matching framework first defines a cost volume function with respect to color and gradient information. We then aggregate the cost volume with color and gradient structures by the guided image filtering method [6]. An optimization strategy of winner-takes-all (WTA) is then employed for disparity selection. Finally, two postprocessing steps including occlusion removal and smoothness are performed to further improve the disparity accuracy. Each step is discussed below.

2.1 Cost Construction

Suppose \(\mathbf {I}_u\) and \(\mathbf {I}_v\) denote the left and right images of a stereo pair. Image \(\mathbf {I}_u\) with size of \(W\times H\) pixels (similarly \(\mathbf {I}_v\)) has three color channels of \(\mathbf {I}^r_u\), \(\mathbf {I}^g_u\), and \(\mathbf {I}^b_u\) in RGB model space. For each pixel \(\mathbf {p}\) on \(\mathbf {I}_u\) and each pixel \(\mathbf {q}\) on \(\mathbf {I}_v\), a color matching cost function \(\mathcal {F}_{\alpha }(\mathbf {p},\mathbf {q})\) minimizes average absolute color difference \(d_{C}(\mathbf {p},\mathbf {q})\) in each color channel:

$$\begin{aligned} \mathcal {F}_{\alpha }(\mathbf {p},\mathbf {q}) = \min \left( d_C(\mathbf {p},\mathbf {q}), \alpha \right) , \end{aligned}$$
(1)

where \(\alpha \) is a threshold and \(d_C(\mathbf {p},\mathbf {q})\) is calculated by

$$\begin{aligned} d_C(\mathbf {p},\mathbf {q})=\frac{1}{3}\sum _{i\in {r,g,b}}\left| \mathbf {I}^i_u(\mathbf {p})-\mathbf {I}^i_v(\mathbf {q})\right| . \end{aligned}$$
(2)
Fig. 1.
figure 1

Input stereo images and the reconstructed scene on the basis of the constructed cost volume disparity map

Moreover, we define a gradient matching cost function \(\mathcal {F}_\beta (\mathbf {p},\mathbf {q})\) to measure absolute gradient difference \(d_{G}(\mathbf {p},\mathbf {q})\) between the left and right images of the stereo pair:

$$\begin{aligned} \mathcal {F}_{\beta }(\mathbf {p},\mathbf {q}) = \min \left( d_G(\mathbf {p},\mathbf {q}), \beta \right) , \end{aligned}$$
(3)

where \(\beta \) is also a predefined threshold and \(d_{G}(\mathbf {p},\mathbf {q})\) computes derivative \(\nabla _x\mathbf {\hat{I}}_u\) along the x-direction of stereo pairs:

$$\begin{aligned} d_G(\mathbf {p},\mathbf {q})=|\nabla _x\mathbf {\hat{I}}_u-\nabla _x\mathbf {\hat{I}}_v|, \end{aligned}$$
(4)
$$\begin{aligned} \forall \; \mathbf {p} =(x,y), \nabla _x\mathbf {\hat{I}}_u = \frac{(\mathbf {\hat{I}}_u(x+1,y)-\mathbf {\hat{I}}_u(x-1,y))}{2}, \end{aligned}$$
(5)

where \(\mathbf {\hat{I}}_u\) and \(\mathbf {\hat{I}}_v\) are grayscale images converted from color images \(\mathbf {I}_u\) and \(\mathbf {I}_v\). \(\nabla _x\mathbf {\hat{I}}_v\) is similarly defined as \(\nabla _x\mathbf {\hat{I}}_u\).

Based on Eqs. 1 and 3, a stereo image color and gradient-boosted matching cost function \(\mathcal {F}_\mu \) can be defined as:

$$\begin{aligned} \mathcal {F}_\mu (\mathbf {p},\mathbf {q}) = (1-\mu )\mathcal {F}_\alpha (\mathbf {p},\mathbf {q}) + \mu \mathcal {F}_\beta (\mathbf {p},\mathbf {q}), \end{aligned}$$
(6)

where the prefixed constant parameter \(\mu \) is introduced to balance the color and gradient costs.

Finally, for pixel \(\mathbf {p}=(x,y)\) on \(\mathbf {I}_u\) and its potential correspondence \(\mathbf {q} = (x+\lambda , y)\) with disparity \(\lambda \) on \(\mathbf {I}_v\), a cost volume \(\mathcal {V}(\mathbf {p}, \lambda )\) is constructed in a 3-D space \(\mathfrak {R}^3 =\{x\in [1,W],y\in [1,H],\lambda \in [\lambda _{min},\lambda _{max}]\}\) (\(\lambda _{min}\) and \(\lambda _{max}\) are predefined minimal and maximal disparity values):

$$\begin{aligned} \forall \;(x,y,\lambda )\in \mathfrak {R}^3, \mathcal {V}(\mathbf {p}, \lambda ) = \mathcal {F}_\mu (\mathbf {p}(x, y),\mathbf {q}(x+\lambda , y)). \end{aligned}$$
(7)

Figure 1 shows the reconstructed surface (Fig. 1(d)) after constructing the cost volume disparity map (Fig. 1(c)) in terms of input left and right images (Fig. 1(a) and (b)).

2.2 Cost Aggregation

The cost construction is a generally ambiguous process. The correct matches might easily have a higher cost than the incorrect ones because of image noise. To deal with mismatches in the stereo pair, we use a guided filtering to weight each cost volume \(\mathcal {V}(\mathbf {p}, \lambda )\) built above and obtain a filtered cost volume \(\mathcal {\tilde{V}}(\mathbf {p}, \lambda )\) [6]:

$$\begin{aligned} \mathcal {\tilde{V}}(\mathbf {p}, \lambda ) = \sum _{\mathbf {k}} \varPsi _{\mathbf {p},\mathbf {k}}(\mathbf {I}_u)\mathcal {V}(\mathbf {k}, \lambda ), \end{aligned}$$
(8)

which employs the left color image \(\mathbf {I}_u\) as the guidance image.

Fig. 2.
figure 2

Weighted disparity map (a) from Fig. 1(a), optimized disparity map (c) from (a), reconstructed scenes (b) and (d)

Let \(\varOmega _\mathbf {o}\) be a squared region, of size \(a\times a\), centered at pixel \(\mathbf {o}\) on image \(\mathbf {I}_u\). Weight \(\varPsi _{\mathbf {p},\mathbf {k}}(\mathbf {I}_u)\) is computed by

$$\begin{aligned} \varPsi _{\mathbf {p},\mathbf {k}} = \frac{1}{N_\mathbf {o}}\sum _{\mathbf {o}\in \varOmega _\mathbf {o}}\left( 1+ (\mathbf {I}_u(\mathbf {p})-\delta _{\mathbf {o}})\varLambda (\mathbf {I}_u(\mathbf {k})-\delta _{\mathbf {o}})\right) , \end{aligned}$$
(9)

where \(N_\mathbf {o}=N_\mathbf {p}N_\mathbf {k}\), \(N_\mathbf {p}\) and \(N_\mathbf {k}\) are the number of pixels in squared regions \(\varOmega _\mathbf {p}\) and \(\varOmega _\mathbf {k}\), \(\varOmega _\mathbf {o}=\varOmega _\mathbf {p}\cap \varOmega _\mathbf {k}\), color image \(\mathbf {I}_u=(\mathbf {I}^r_u,\mathbf {I}^g_u,\mathbf {I}^b_u)^T\), average color image \(\delta _{\mathbf {o}}=(\delta ^r_{\mathbf {o}},\delta ^g_{\mathbf {o}},\delta ^b_{\mathbf {o}})^T\), and each average color channel of \(\mathbf {I}_u\) is calculated by

$$\begin{aligned} \delta ^i_{\mathbf {o}}=\frac{1}{N_\mathbf {o}}\sum _{\mathbf {p}\in \varOmega _\mathbf {o}}\mathbf {I}^i_u(\mathbf {p}), \quad i\in \{r,g,b\}. \end{aligned}$$
(10)

Term \(\varLambda =(\varSigma _\mathbf {o} + \epsilon \mathbf {A})^{-1}\) (\(\mathbf {A}\mapsto 3\times 3\) identity matrix) with smoothness factor \(\epsilon \) and covariance matrix \(\varSigma _\mathbf {o}\) of squared region \(\varOmega _\mathbf {o}\) from image \(\mathbf {I}_u\). Matrix \(\varSigma _\mathbf {o}\) is determined by \(\varSigma _\mathbf {o}\), which is in accordance with variance \(\sigma ^{ij}_\mathbf {o}\) on \(\varOmega _\mathbf {o}\):

$$\begin{aligned} \varSigma _\mathbf {o}= \left( \begin{array}{ccc} \sigma ^{rr}_\mathbf {o} &{} \sigma ^{rg}_\mathbf {o} &{} \sigma ^{rb}_\mathbf {o} \\ \sigma ^{rg}_\mathbf {o} &{} \sigma ^{gg}_\mathbf {o} &{} \sigma ^{gb}_\mathbf {o} \\ \sigma ^{rb}_\mathbf {o} &{} \sigma ^{gb}_\mathbf {o} &{} \sigma ^{bb}_\mathbf {o} \end{array} \right) _{3\times 3}, \; i,j\in \{r,g,b\}. \end{aligned}$$
(11)
$$\begin{aligned} \sigma ^{ij}_\mathbf {o}=\frac{1}{N_\mathbf {o}}\sum _{\mathbf {p}\in \varOmega _\mathbf {o}}\mathbf {I}^i_u(\mathbf {p})\mathbf {I}^j_u(\mathbf {p}) - \delta ^i_{\mathbf {o}}\delta ^j_{\mathbf {o}}, \; i,j\in \{r,g,b\}. \end{aligned}$$
(12)

After aggregating the cost volume, the disparity and reconstruction become more accurate (Fig. 2(a) and (b)).

Fig. 3.
figure 3

Accuracy-enhanced disparity by occlusion rejection and smoothing from the optimized disparity in Fig. 2(c)

2.3 Disparity Determination

Based on the aggregated cost volume \(\mathcal {\tilde{V}}(\mathbf {p}, \lambda )\), we use WTA to select disparity \(\tilde{\lambda }(\mathbf {p})\) for minimizing the matching cost relative to pixel \(\mathbf {p}\) within range \(\mathcal {D}=[\lambda _{min}, \lambda _{max}]\):

$$\begin{aligned} \tilde{\lambda }(\mathbf {p})=\arg \min _{\lambda \in \mathcal {D}}\mathcal {\tilde{V}}(\mathbf {p}, \lambda ). \end{aligned}$$
(13)

Figure 2(c) displays the chosen disparity map and its corresponding reconstructed scene (Fig. 2(d)).

2.4 Occlusion Removal

Occlusion is often unavoidable in stereo matching procedures. Detection of occlusion involves determining whether the calculated disparity in the left image is consistent with that of the right image. An occluded pixel will be removed and assigned the minimal disparity value if \(\tilde{\lambda }(\mathbf {p})\ne -\lambda (\mathbf {p} + \tilde{\lambda }(\mathbf {p}))\). To consider true or subpixel disparity, we employ a more general constraint for removing occluded pixels:

$$\begin{aligned} |\tilde{\lambda }(\mathbf {p})+\lambda (\mathbf {p} + \tilde{\lambda }(\mathbf {p}))|>1. \end{aligned}$$
(14)

Figure 3(a) shows the detected occlusion using Eq. 14. Figure 3(b) gives the result after occlusion rejection.

2.5 Disparity Smoothness

The occlusion removal step usually introduces edge blurring in the disparity map (Fig. 3(b)). To preserve the edge in the disparity map, we use bilateral filtering to weight each disparity after the occlusion rejection step:

$$\begin{aligned} \varphi _{\mathbf {st}}=\exp (-\frac{\Vert \mathbf {s}-\mathbf {t}\Vert ^2}{\gamma ^2_1}-\frac{\Vert \mathbf {I}^i_u(\mathbf {s})-\mathbf {I}^i_u(\mathbf {t})\Vert ^2}{\gamma ^2_2}), \end{aligned}$$
(15)

where pixel \(\mathbf {t}\) is located at squared region \(\varOmega _\mathbf {s}\) with size of \(s\times s\) centered at pixel \(\mathbf {s}\), \(\gamma _1\) and \(\gamma _2\) indicate pixel spatial and color similarities, and \(\Vert \cdot \Vert \) denotes the Euclidean distance (\(i\in {r,g,b}\)). We then calculate the cumulative histogram \(\mathcal {H}(\tilde{\lambda })\) for every disparity value \(\tilde{\lambda }\):

$$\begin{aligned} \forall \;\tilde{\lambda }\in \mathcal {D}=[\lambda _{min},\lambda _{max}], \mathcal {H}(\tilde{\lambda })=\sum _{\mathbf {t}\in \varOmega _\mathbf {t}|\tilde{\lambda }(\mathbf {t})\le \tilde{\lambda }}\varphi _{\mathbf {st}}. \end{aligned}$$
(16)

Finally, the optimal disparity \(\tilde{\lambda }_*(\mathbf {p})\) at pixel \(\mathbf {p}\) is decided by

$$\begin{aligned} \tilde{\lambda }_*(\mathbf {p})=\{\tilde{\lambda }|\mathcal {H}(\tilde{\lambda })\ge 0.5\mathcal {H}(\lambda _{max})\}. \end{aligned}$$
(17)

Figure 3(b) shows the disparity map with edge blurring after the occlusion is removed. Figure 3(c) gives the final disparity map after the smoothness step using the bilateral filtering.

figure a

In general, the color and gradient-driven matching method for 3-D reconstruction is implemented in Algorithm 1.

Fig. 4.
figure 4

Visual comparison of disparity maps of using matching methods of SBM [10], SGM [7], QSM [13] and CAM.

Fig. 5.
figure 5

Visual comparison of reconstructed scenes corresponding to disparity maps from Fig. 4

3 Experimental Results

Clinical stereoscopic endoscopic video sequences were recorded by the da Vinci Si surgical system (Intuitive Surgical Inc., Sunnyvale, CA, USA) during a robotic prostatectomy procedure. The HD (1920 \(\times \) 1080) video frames were downsampled to 320\(\times \)180 pixels. We investigate several methods as follows: (1) stereo block-matching (SBM) [10], (2) semi-global matching (SGM) [7], (3) quasi-dense stereo matching (QSM) [13], and (4) our color and gradient-boosted cost aggregation matching (CAM), as discussed in Sect. 2. All these methods were tested on a laptop installed with Windows 8.1 Professional 64-Bit System, 16.0-GB Memory, and Processor Intel(R) Core(TM) i7 CPU\(\times \)8 and were implemented on Microsoft Visual C++.

Figure 4 compares the disparity maps estimated from different stereo matching methods. Figure 5 visualizes the reconstructed scenes in terms of the disparity maps from Fig. 4. Figure 6(a) compares the reconstructed density of using different correspondence methods. The percentage of the reconstructed density was 81.7 %, 87.6 %, 60.8 %, and 99.5 % when using SBM [10], SGM [7], QSM [13] and CAM, respectively. The proposed method almost reconstructed all the pixels from the stereo color images. Figure 6 investigates the processing time of the disparity computation of each compared method. The processing time of the disparity calculation was 143, 146, 532, and 428 milliseconds (ms) using SBM [10], SGM [7], QSM [13] and CAM.

Fig. 6.
figure 6

Comparison of reconstruction density and processing time of disparity computation of using different methods

4 Discussion

The objective of this work is to explore fast and robust 3-D scene reconstruction of binocular endoscopic video sequences with image variations including noise, occlusion, specularity reflection, and motion blurring. In general, we achieved fast and accurate binocular endoscopic stereo scene reconstruction for augmenting image-guided robotic surgery.

Fig. 7.
figure 7

An example of collapsed reconstruction regions (green circles) due to image sensor noise and edge blurring

The effectiveness of our proposed stereo matching approach lies in two aspects of the matching cost aggregation function. First, sufficient color information was propagated in the disparity map estimation. Compared to intensity-based local matching methods (e.g., SBM and SGM) [7, 10], color is a more powerful characteristic to tackle textureless noisy stereo images. Furthermore, gradient-based cost aggregation can preserve the disparity edge and be robust to illumination variation, occlusion, and motion blurring. In contrast to feature-driven semi-dense matching (e.g.,), the gradient cost does not suffer from the featureless problem. The color and gradient-boosted correspondence strategies balance each other to construct the matching cost enabling it to outperform other current available stereo matching methods.

Although our proposed stereo matching method performs much better than others, it still suffers from image sensor noise and edge blurring problems, as shown in Fig. 7. To address these problems, edge-preserving filtering methods, which are robust to image variations, can be used to improve the quality of the computed disparity map during binocular endoscopic scene reconstruction. On the other hand, we currently employ simple optimization method, i.e., the WTA strategy, which is sensitive to image noise. Despite the fact that global optimization methods for dense stereo matching are somewhat time-consuming, they are nevertheless another option to further deal with image noise. Additionally, we still need to improve the processing time of the disparity map calculation. Graphics processing unit (GPU) and multi-threading programming techniques can be used to accelerate our stereo matching and reconstruction procedures to meet real-time requirements in clinical applications.

5 Conclusions

This work proposed a color and gradient-boosted cost aggregation stereo matching framework to reconstruct binocular endoscopic video scene. We integrate color and gradient information in HD images to construct the matching cost function making it robust to stereoscopic endoscopic image variations. We then employ guided image filtering procedure to weight each disparity value in the disparity map. Experimental results demonstrate that our proposed stereo correspondence framework significantly outperforms other matching methods. The reconstruction density improved from 87.6 % to 99.5 %, Future work includes reduction of the computational time for stereo matching, improvement of noise insensitivity, edge-preserving fileting procedure, and combination of global optimization methods.