Abstract
We present a novel method for recovering the 3D structure and scene flow from calibrated multi-view sequences. We propose a 3D point cloud parametrization of the 3D structure and scene flow that allows us to directly estimate the desired unknowns. A unified global energy functional is proposed to incorporate the information from the available sequences and simultaneously recover both depth and scene flow. The functional enforces multi-view geometric consistency and imposes brightness constancy and piecewise smoothness assumptions directly on the 3D unknowns. It inherently handles the challenges of discontinuities, occlusions, and large displacements. The main contribution of this work is the fusion of a 3D representation and an advanced variational framework that directly uses the available multi-view information. This formulation allows us to advantageously bind the 3D unknowns in time and space. Different from optical flow and disparity, the proposed method results in a nonlinear mapping between the images’ coordinates, thus giving rise to additional challenges in the optimization process. Our experiments on real and synthetic data demonstrate that the proposed method successfully recovers the 3D structure and scene flow despite the complicated nonconvex optimization problem.
Similar content being viewed by others
Notes
The source code is publicly available.
References
Ayvaci, A., Raptis, M., & Soatto, S. (2010). Occlusion detection and motion estimation with convex optimization. NIPS (pp. 100–108).
Basha, T., Moses, Y., & Kiryati, N. (2010). Multi-view scene flow estimation: A view centered variational approach. In Proc. IEEE conf. comp. vision patt. recog. (pp. 1506–1513).
Ben-Ari, R., & Sochen, N. A. (2007). Variational stereo vision with sharp discontinuities and occlusion handling. In Proc. int. conf. comp. vision (pp. 1–7).
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Proc. European conf. comp. vision (pp. 25–36).
Carceroni, R. L., & Kutulakos, K. N. (2002). Multi-view scene capture by surfel sampling: from video streams to non-rigid 3d motion, shape and reflectance. International Journal of Computer Vision, 49(2–3), 175–214.
Courchay, J., Pons, J. P., Monasse, P., & Keriven, R. (2009). Dense and accurate spatio-temporal multi-view stereovision. In Asian conf. on computer vision (pp. 11–22).
Felzenszwalb, P., & Huttenlocher, D. (2006). Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1), 41–54.
Furukawa, Y., & Ponce, J. (2008). Dense 3d motion capture from synchronized video streams. In Proc. IEEE conf. comp. vision patt. recog.
Huguet, F., & Devernay, F. (2007). A variational method for scene flow estimation from stereo sequences. In Proc. int. conf. comp. vision (pp. 1–7).
Isard, M., & MacCormick, J. (2006). Dense motion and disparity estimation via loopy belief propagation. In Asian conf. on computer vision (Vol. 3852, p. 32).
Li, R., & Sclaroff, S. (2008). Multi-scale 3d scene flow from binocular stereo sequences. Computer Vision and Image Understanding, 110(1), 75–90.
Min, D. B., & Sohn, K. (2006). Edge-preserving simultaneous joint motion-disparity estimation. In Proc. international conf. patt. recog. (pp. 74–77).
Neumann, J., & Aloimonos, Y. (2002). Spatio-temporal stereo using multi-resolution subdivision surfaces. International Journal of Computer Vision, 47(1–3), 181–193.
Pock, T., Schoenemann, T., Graber, G., Bischof, H., & Cremers, D. (2008). A convex formulation of continuous multi-label problems. In Proc. European conf. comp. vision (pp. 792–805).
Pons, J., Keriven, R., & Faugeras, O. (2007). Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision, 72(2), 179–193.
Robert, L., & Deriche, R. (1996). Dense depth map reconstruction: A minimization and regularization approach which preserves discontinuities. In Proc. European conf. comp. vision (pp. 439–451).
Scharstein, D., & Szeliski, R. Middlebury stereo vision research page. http://vision.middlebury.edu/stereo.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1–3), 7–42.
Scharstein, D., & Szeliski, R. (2003). High-accuracy stereo depth maps using structured light. In Proc. IEEE conf. comp. vision patt. recog. (pp. 195–202).
Strecha, C., Tuytelaars, T., & Gool, L. J. V. (2003). Dense matching of multiple wide-baseline views. In Proc. int. conf. comp. vision (pp. 1194–1201).
Vedula, S., Baker, S., Rander, P., Collins, R. T., & Kanade, T. (1999). Three-dimensional scene flow. In Proc. int. conf. comp. vision (pp. 722–729).
Vedula, S., Baker, S., Rander, P., Collins, R. T., & Kanade, T. (2005). Three-dimensional scene flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 475–480.
Vedula, S., Baker, S., Seitz, S., & Kanade, T. (2000). Shape and motion carving in 6D. In Proc. IEEE conf. comp. vision patt. recog. (Vol. 2).
Wedel, A., Brox, T., Vaudrey, T., Rabe, C., Franke, U., & Cremers, D. (2011). Stereoscopic scene flow computation for 3d motion understanding. International Journal of Computer Vision, 95(1), 29–51.
Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., & Cremers, D. (2008). Efficient dense scene flow from sparse or dense stereo data. In Proc. European conf. comp. vision (pp. 739–751).
Woodford, O. J., Torr, P. H. S., Reid, I. D., & Fitzgibbon, A. W. (2009). Global stereo reconstruction under second-order smoothness priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12), 2115–2128.
Young, D. (1954). Iterative methods for solving partial difference equations of elliptic type. Transactions of the American Mathematical Society, 76(1), 92–111.
Zhang, Y., & Kambhamettu, C. (2000). Integrated 3d scene flow and structure recovery from multiview image sequences. In Proc. IEEE conf. comp. vision patt. recog. (Vol. 2, pp. 674–681).
Zhang, Y., & Kambhamettu, C. (2001). On 3d scene flow and structure estimation. In Proc. IEEE conf. comp. vision patt. recog. (pp. 778–785).
Acknowledgements
The authors are grateful to the A.M.N. foundation for its generous financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
An earlier version of part of this work appear in CVPR 2010 (Basha et al. 2010).
Appendices
Appendix A: Mapping Between Images
Our 3D parameterization in the presented framework introduces a nonlinear transformation of the 3D unknowns, Z and V, to each of the image’s plane. A notable challenge in the minimization of the proposed functional arises from the nontrivial mapping of the images’ coordinates to the reference camera coordinate system.
Using our parametrization, each pixel in the reference camera, (x,y), and its corresponding depth, Z(x,y), specify a 3D point, P (see Eq. (5)). It follows that projecting P onto the ith camera maps (x,y,Z(x,y)) to the point p i =(x i ,y i )T. That is,
where f i is the mapping to the corresponding ith image. More precisely, f i is given by substituting Eq. (5) into Eq. (1). For example, the component x i is given by:
The coefficients a,b,c and d depend on the reference camera coordinates, (x,y):
where M i is the 3×4 projection matrix of the ith camera (subscripts denote the row and column indices). The expression for y i is equivalently computed.
Similarly, at time step t+1, projecting \(\widehat{\textbf {P}}= \textbf {P}+ \textbf {V}\) maps (x,y,Z(x,y),V(x,y)) to \(\widehat{\textbf {p}}_{i}\), denoted by a mapping, \(\widehat{f}^{i}\):
Analogously to Eq. (18), the component \(\widehat{x}_{i}\) is given by:
where the coefficients a,b,c and d are defined in Eq. (19).
Appendix B: Image Derivatives with Respect to the 3D Unknowns
A first step toward the numerical solution of the resulting Euler-Lagrange equations (Eq. (12) or Eq. (13)) requires computing the derivatives of the intensity functions with respect to the 3D unknowns. To produce the final expressions for these derivatives, the nonlinear relation between the 3D unknowns and the image plane has to be carefully considered (see Appendix A). This appendix shows how these computations are performed. The mathematical analysis is preformed in the continuous domain. Thus, the frames as well as the 3D unknowns are regarded as continuous functions. Finally, the resulting equations are discretized by using standard approximations for the derivatives.
For simplicity, given a time step, t, we use the intensity functions \(I_{i}^{t}\) and \(I_{i}^{t+1}\) to abbreviate I i (p i ,t) and \(I_{i}(\widehat{\textbf {p}}_{i}, t+1)\), respectively. We next elaborate on the computation of derivatives of \(I_{i}^{t}\) and \(I_{i}^{t+1}\) with respect to Z and u, denoted by \(\partial_{Z}I_{i}^{t}\), \(\partial_{Z}I_{i}^{t+1}\) and \(\partial_{u}I_{i}^{t+1}\) (the other derivatives with respect to v and w are similarly computed).
\(I_{i}^{t}\) can be regarded as a function of the reference image coordinates, (x,y), and the corresponding depth, Z(x,y), by considering a composition of two functions: the ith intensity function and the mapping transformation, defined in Appendix A. That is,
Similarly, \(I_{i}^{t+1}\) can be regarded as a function of (x,y,Z(x,y)) and V. That is,
Considering Eqs. (17)–(20), the chain rule is applied for computing the partial derivatives:
The derivatives \(\partial_{Z}\textbf {p}_{i}^{T} = ( \partial_{Z}x_{i}, \partial_{Z}y_{i})^{T}\) are directly computed from Eqs. (18)–(19).
To compute the derivative of \(I_{i}^{t}\) with respect to p i , \((\nabla I_{i}^{t})^{T}\), we use a warping approach. As discussed in Appendix A, a nonlinear mapping relates each of the image’s plane to the reference camera. By warping \(I_{i}^{t}\) toward the reference image using the estimated Z, the values of \(I_{i}^{t}\) can be directly related to the reference image values, \(I_{0}^{t}\). Specifically, the required derivatives, \(\nabla I_{i}^{t}\) are then computed using the warped image. Let \(I_{i,w}^{t}\) be the warped image of \(I_{i}^{t}\). That is,
The warped image gradient is related to the original image by:
where J is the Jacobian matrix of the change of coordinates, (x i ,y i )→(x,y). Therefore, the original image derivatives are obtained by multiplying Eq. (28) by J −1, leading to:
The Jacobian matrix, J, is obtained by computing the derivatives of p i with respect to x and y. In particular, J involves the derivatives of Z(x,y), namely ∂ x Z and ∂ y Z. Following the explanation above, \(\nabla I_{i,w}^{t+1}\) is similarly computed. In this case, the Jacobian matrix, J, additionally involves the derivatives of u,v and w with respect to the reference camera coordinates.
Appendix C: Linearizion
This appendix describes the linearizion process of the resulting Euler-Lagrange equations and the numerical approximations used. At each pyramid level, a linear system of equations is obtained and small increments in the 3D unknowns, dZ, and d V, are estimated. The total solution, Z+dZ, and V+d V, is then used to initialize the next finer level (see Sect. 2.3.2).
Considering equations (12)–(13), there are two sources of nonlinearity:
-
1.
nonlinearized data term;
-
2.
nonquadratic cost function Ψ.
Following the numerical approach suggested by Brox et al. (2004), two nested fixed point iterations are used at each pyramid level to remove the nonlinearity.
The outer iteration is responsible for removing the nonlinearity resulting from the nonlinear data term, using fixed point iteration on Z and V. Let k be the outer index iteration. The solution at the (k+1)th iteration is decomposed of the previous solution and small, unknown increments. That is, Z k+1=Z k+dZ k and V k+1=V k+d V k, where d V k=(du k,dv k,dw k)T.
The first step toward linearizion is approximating the nonlinear expression given in Eq. (11) using first order Taylor expansion. We use \(\Delta_{i}^{k},~\widehat{\Delta}_{i}^{k}\) and \(\Delta_{i}^{t, k}\) to denote the expressions given in Eq. (11) using the fixed values Z k and V k. That is,
where \(\textbf {p}_{i}^{k} = \mathit{Proj}(\textbf {P}^{k}, M^{i})\) and P k is given by placing Z k in Eq. (5). The expressions for \(\widehat{\textbf {p}}_{i}^{k}\) and \(\widehat{\textbf {P}}^{k}\) are analogously given. Using these notations, the first order Taylor expansions for these expressions are given by:
Equation (31) is computed by using the first order Taylor expansion for the following expressions:
where P k+1=P k+d P k is given by placing Z k+dZ k (Eq. (5)). Similarly, \(\widehat{\textbf {P}}^{k+1}=\widehat{\textbf {P}}^{k} + d\widehat{\textbf {P}}^{k}\) where \(d\widehat{\textbf {P}}^{k}= dZ^{k} + d\textbf {V}^{k}\). The computation of the image derivatives with respect to the 3D unknowns is detailed in Appendix A.
Therefore, deriving the associated Euler-Lagrange equations with respect to the unknown increments dZ k and du k results in:
The dependency of the above two equations in the increments, dZ k and du k, is obtained by substituting Eq. (31) into \(\Delta_{i}^{k+1},\Delta_{i}^{t, k+1}\), and, \(\widehat{\Delta}_{i}^{k+1}\). The equations for dv k and dw k are similar to Eq. (35).
Applying the above approximations (Eq. (31)), the resulting Euler-Lagrange equations are a nonlinear system of equations in the unknowns dZ k and d V k. The remaining nonlinearity is originated by Ψ′. Therefore, an additional fixed point iterations loop for Ψ′ expressions is preformed. Finally, after standard discretization of the derivatives, a linear system of equations is introduced. The solution is obtained by applying the successive overrelaxation (SOR) method.
Rights and permissions
About this article
Cite this article
Basha, T., Moses, Y. & Kiryati, N. Multi-view Scene Flow Estimation: A View Centered Variational Approach. Int J Comput Vis 101, 6–21 (2013). https://doi.org/10.1007/s11263-012-0542-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0542-7