1 Introduction

Image Stitching is a widely studied problem in the field of computer vision and graphics, which generates a single wide field-of-view image from a set of narrow field-of-view images. Several warping models, including homography-based warps [1, 2] spatially varying warping models [3,4,5], hybrid models [6,7,8,9], parallax tolerant models [10,11,12] and image stitching softwares such as Adobe Photoshop and Autostitch, fail to perform well when non-ideal data is provided as input. The main challenges of any stitching algorithm are parallax error, occlusions, motion blur and presence of moving objects. Specifically, for stitching frames of an unconstrained video (e.g shaky/jittery videos), the state-of-the-art techniques fail to provide satisfactory results. The reason is that image stitching assume specific underlying motion models, thus making the task highly challenging in presence of large parallax.

Common approaches to the image stitching algorithms follow the pipeline of estimating transformations between the images, aligning the images using a warping model and stitching them using seam techniques or blending techniques. We present a novel mesh-based warping model termed “GreenWarps”, utilizing Green coordinates [13] and a demon-based diffeomorphic warping model [14] to align the images. “GreenWarps” warping model consists of two stages, namely, pre-warping and “DiffeoMeshes”. The first stage produces a global conformal mapping between the images to be stitched. The conformal mappings induce no shear at all, thereby providing shape-preserving and distortion-free deformations. The second stage of the proposed method, termed “DiffeoMeshes”, provides a mesh deformation based on semi-dense correspondences of the two images and refines the alignment obtained from the first stage. Both the stages utilize Green coordinates for warping the deformed meshes, instead of warping the images based on computed transformations as in previous approaches. Since our method does not assume any motion model, it is immune to large parallax error.

2 Proposed Framework

The steps of the proposed “GreenWarps” method are: (i) estimate SIFT correspondences, (ii) pre-warping based on Green coordinates, (iii) mesh deformation based on DiffeoMeshes and warping based on Green coordinates, (iv) blend the images to obtain stitched image. Similar to spatially varying warps, GreenWarps perform a shape preserving deformation of the mesh for aligning images to the reference image. Interestingly, our approach does not compute any transformation matrix during the process of alignment or warping. This ensures that our method does not assume any motion model. Warping in both the stages (Pre-warping and DiffeoMeshes) is performed based on Green coordinates.

Among the images to be stitched, we take one of them as Reference image (R) and the other as Unaligned image (U). The unaligned image is first divided into image grids, where each grid has 4 vertices. The pre-warping stage takes a \(2\times 2\) mesh grid of U. Every point \(X_k\) of the unaligned image is defined in terms of the Green coordinates [13] of its corresponding mesh grid as \(X_k = \phi _k(X_k)^TV_k+\psi _k(X_k)^TN_k\), where \(\phi _k(X_k), \psi _k(X_k)\) are the Green coordinate vectors associated with the 4 coordinates and edges of the mesh grid containing the point \(X_k\), \(V_k\) is a vector of 4 vertices and \(N_k=[n(t_k^1)\ n(t_k^2)\ n(t_k^3)\ n(t_k^4)]\) is a vector of normals of edges \(t_k^i\) of the grid containing the point \(X_k\). An as-similar-as-possible mesh deformation [3] is performed generating the deformed vertices \(\hat{V}\) based on the corresponding SIFT features. The Green coordinates (for every pixel in the image) are first estimated from the initial mesh as derived in [15]. The warping of the image based on the deformed vertices are performed using the computed Green coordinates. The corresponding position of any point of the unaligned image in the pre-warped image (\(\hat{X}_k\)), with the deformed vertices \(\hat{V}\) and updated normals \(\hat{N}\) is obtained as follows: \(\hat{X}_k = \phi _k(X_k)^T\hat{V}_k+\psi _k(X_k)^Tm_k\hat{N}_k\). Here, \(m_k\) is the normalized edge length [13]. Warping based on Green coordinates, as in [13] provides a conformal mapping, preserving the shape of the structures. Thus, Green coordinates provide a natural transformation of the image for alignment without assuming any motion model. Perspective distortion, a problem in many previous approaches [10, 12, 16] is absent in our approach.

The second stage of our approach, termed DiffeoMeshes, refines the alignment by obtaining a per-pixel displacement (spatial transformation) of the region of the overlap of the pre-warped and reference images. Let the overlap regions of the pre-warped and reference images be \(M_U\) and \(M_R\) respectively. A mesh deformation is performed based on the spatial transformation obtained. The demon-based diffeomorphic transformation, s is estimated using the following optimization function [14]:

$$\begin{aligned} E_{diff}(s) = Sim(M_U, M_R\circ s) + Reg(s) \end{aligned}$$
(1)

The similarity (correspondence) term is \(Sim(M_U, M_R\circ s) = \sum _{p=1}^L||M_U(p)-M_R(p)\circ s(p)||_2^2\), and the second regularization term is defined as \(\sum _{p=1}^L||\nabla s(p)||_2\), where, \(\circ \) indicates the per-pixel spatial warping function and \(L=|M_U|=|M_R|\), where |.| is the cardinality function. All the demon-based diffeomorphic registrations [14, 17, 18] uses Gaussian smoothing for the purpose of regularization. Our proposed method utilizes TV-based regularization [19] and this helps in preserving the edges while updating the transformation. An iterative alternating minimization of the correspondence energy and the regularization energy is performed to obtain the diffeomorphic transformation.

Let the mesh grid vertices at the second stage before and after deformation be \(\mathcal {V}\) and \(\hat{\mathcal {V}}\). DiffeoMeshes minimizes the optimization function \(E(\hat{\mathcal {V}})=E_d(\hat{\mathcal {V}})+w_sE_s(\hat{\mathcal {V}})\) where \(E_d\) is the data term and \(E_s\) is the smoothness term. The data term minimizes the distance between the measured point and the interpolated location in the mesh using diffeomorphic transformation. The data term of DiffeoMeshes is: \(E_d(\hat{\mathcal {V}}) =\sum _{p=1}^{N_d} ||s(p)||_2^2\), where \(N_d\) is the number of pixels selected from the overlap region of the pre-warped and reference images and s(p) is the diffeomorphic transformation in pixel p. Only those pixels belonging to the edges, with exact match are taken for obtaining the mesh (semi-dense correspondences). \(E_s(\hat{\mathcal {V}})\) is the same as that used in [3]. The smoothness term minimizes the deviation of each deformed mesh grid from a similarity transformation of its input mesh. The solution of the problem is obtained using a Jacobi based linear solver. Once the deformed mesh vertices are obtained, the refined alignment is obtained by warping using Green coordinates similar to the first stage. The aligned images are then blended using the multi-band blending method of [20].

Fig. 1.
figure 1

Comparison of our method with (a) SPHP [8], (b) Parallax tolerant [10], (c) APAP [4], (d) SEAGULL [12]. In each example, the first column shows the input images, the second column shows the output of the comparing method and the third column shows the output of the proposed method. (Color figure online)

Table 1. Comparison of the performances for two standard datasets [10, 12] for large parallax and one dataset consisting of frames from unconstrained videos.

3 Experimental Results

Experiments were performed on two parallax-tolerant image stitching datasets [10, 12] and a new dataset consisting of 2-3 frames of unconstrained videos. Parallax error and presence of moving objects are the main challenges of the images in the dataset. Our method is evaluated with the state-of-the-art methods [3, 4, 8, 21]. The error measures used for determining the alignment quality of the images are mean geometric error (\(E_{mg}\)) and correlation error (\(E_{corr}\)). \(E_{mg}\) measures the average distance between the corresponding feature points after alignment and \(E_{corr}\) is defined as the average of one minus Normalized Cross Correlation(NCC) over a neighborhood in the overlapped region. Lower values of the measure indicates better performance. Table 1 shows the average (over the whole dataset) alignment errors of all 3 datasets in comparison to the state-of-the-art methods. As seen in the table, our method outperforms the state-of-the-art methods for every dataset. Some qualitative results are also shown in Fig. 1. The comparison with the methods [4, 8, 10, 12] are shown in the figure. The red boxes indicate the erroneous regions of alignment, whereas the blue boxes shows the corresponding regions accurately aligned. The superiority of the method can be seen from the qualitative and quantitative results.