Stereoscopic 3D from 2D video with super-resolution capability

https://doi.org/10.1016/j.image.2008.07.004Get rights and content

Abstract

This paper presents a new approach for the generation of super-resolution stereoscopic and multi-view video from monocular video. Such multi-view video is used, for instance, with multi-user 3D displays or auto-stereoscopic displays with head-tracking to create a depth impression of the observed scenery. Our approach is an extension of the realistic stereo-view synthesis (RSVS) approach, which is based on structure from motion techniques and image-based rendering to generate the desired stereoscopic views for each point in time. Subjective quality measurements with 25 real and 3 synthetic sequences were carried out to test the performance of RSVS against simple time-shift and depth-image-based rendering (DIBR). Our approach heavily enhances the stereoscopic depth perception and gives a more realistic impression of the observed scenery. Simulation results applying super-resolution show that the image quality can further be improved by reducing motion blur and compression artifacts.

Introduction

Extending visual communication to the third dimension by providing the user with a realistic depth perception of the observed scenery instead of flat 2D images has been investigated over decades. Recent progress in related research areas may enable various 3D applications and systems in the near future [39]. Especially, 3D display technology is maturing and entering professional and consumer markets. Often the content is created directly in some suitable 3D format. On the other hand the conversion of existing 2D content into super-resolution 3D is important for content owners. Movies may be reissued in 3D in the future.

Many fundamental algorithms have been developed to reconstruct 3D scenes from monocular video sequences [1], [4], [5], [6], [9], [10], [17], [18], [19], [20], [21], [22], [26], [29], [30], [36], [37], [38], [41], [42], [43], [45], [49]. These algorithms can roughly be divided into two categories: methods that tend to create a complete 3D model of the captured scene [1], [17], [20], [30], [36], [41], [42], [43], [45] and methods that just render stereoscopic views [4], [5], [6], [9], [10], [18], [19], [21], [22], [26], [29], [37], [38], [49].

Available structure-from-motion (SfM) techniques from the first category estimate the camera parameters and sparse 3D structure quite well, but they fail to provide dense and accurate 3D modeling as it is necessary to render high-quality views.

For the second category, depth-image-based rendering (DIBR) [4], [5], [6], [10], [18], [29], [49] seems to be the most promising technique both for stereoscopic view synthesis and for transmission in 3D-TV broadcast systems [7], [31]. DIBR approaches render new virtual views via dense depth maps for each frame of the sequence by shifting image pixels according to their assigned depth. On the other hand, dense depth estimation is still an error prone task and computationally very expensive. In [10] semi-automatic approach for dense depth estimation was introduced using a machine learning algorithm (MLA) for assigned keyframes and depth tweening between these frames.

Other approaches, e.g. [9], [26], are using motion parallax or spatio-temporal interpolation to generate the desired stereoscopic views. Ross [37] introduced a very simple but (for some video sequences) effective technique for stereoscopic depth impression using binocular delay. Finally, in [38] planar transformations on temporal neighboring views are utilized to virtually imitate a parallel stereo camera rig. But, in any case, time consistency along the sequence is heavily dependent on the 3D scene, since a stereo rig is not correctly modeled.

In this paper, we present a new approach for the generation of super-resolution stereo- and multi-view video from monocular video based on realistic stereo-view synthesis (RSVS) [19]. It combines both the powerful algorithms of SfM [17] and the idea of image-based rendering (IBR) [25] to achieve photo-consistency without relying on dense depth estimation.

Most available 3D display systems rely on two views (stereo video) to create a depth impression. However, more advanced systems use multiple views (e.g. eight views showing the same scene from different viewpoints). The presented algorithm is applicable to generate stereo video in its basic mode [19], but it is also capable to generate multi-view video [21]. We will show that the approach is quite suitable for converting the existing 2D video material into multi-view with higher resolution [22]. To our knowledge it is the first time that an approach for the generation of super-resolution multi-view video from monocular video is presented.

The proposed technique is performed in several stages. First, sparse 3D structure and camera parameters are estimated with SfM for the monocular video sequence (dark grey cameras in Fig. 1). Then, for each original camera position (white in Fig. 1) a corresponding multi-view set is generated (light grey in Fig. 1). This is done by estimating planar homographies (perspective transformations) to temporal neighboring views of the original camera path. Surrounding original views are used to generate the multiple virtual views with IBR. Hence, the computationally expensive calculation of dense depth maps is avoided. Moreover, the occlusion problem is almost nonexistent. Whereas DIBR techniques always have to inter- or extrapolate disclosed parts of the images when shifting pixels according to their depth values, our approach utilizes the information from close views of the original camera path, i.e. occluded regions become visible within the sequence.

In the extended mode, the so-called super-resolution mode, the temporal neighboring views are utilized for reconstructing a virtual stereo frame with a desired resolution higher than the original one. In order to do so, each pixel in the super-resolution stereo frame should be located as close to the pixel raster in one of the neighboring views as possible for pixel warping, i.e. the effect of low-pass filtering caused by bilinear warping is reduced. Another benefit of this approach is, as will be shown in Section 5, the possible reduction of blur and coding artifacts. A complete overview of the proposed conversion system is illustrated in Fig. 2.

The organization of this paper is as follows: The next section describes the fundamentals of SfM as an initial step to estimate the camera path and to define virtual stereo cameras with constant parallax over time. In 3 Realistic stereo- and multi-view synthesis, 4 Super-resolution stereo- and multi-view synthesis, the RSVS approach for stereo- and multi-view synthesis and the super-resolution extension are outlined, which are the main contributions of our work. Simulation results are presented in Section 5. In Section 6, psycho-visual experiments are carried out to evaluate the performance of RSVS against standard conversion methods. The limitations of our approach are stated in Section 7. Finally, in Section 8, the paper concludes with a summary and a discussion.

Section snippets

SfM fundamentals

The general intention of SfM is the estimation of the external and internal camera parameters and the structure of a 3D scene relative to a reference coordinate system. SfM requires a relative movement between a static scene and the camera.

Finding relations between the views in the video sequence is the initial step of our reconstruction process. The geometric relationship, also known as epipolar geometry, can be estimated with a sufficient number of feature correspondences between the views

Realistic stereo- and multi-view synthesis

Once 3D structure and camera path are determined, multiple virtual cameras can be defined for each frame of the original video sequence as depicted in Fig. 1. A white camera corresponds to an original image of a video sequence and the light grey cameras represent its corresponding multiple virtual views. With the principles of IBR pixel values from temporal neighboring views can be projected to their corresponding positions in the virtual views. Thus, each of the virtual images is just a

Super-resolution stereo- and multi-view synthesis

The previous section described our fundamental RSVS approach to convert a monocular video sequence into a stereo- or multi-view sequence for auto-stereoscopic displays or multi-user 3D displays. Fig. 4 demonstrates that in general more than one view is needed to set up a virtual stereo frame. Thus, the additional views can be used to increase the resolution of the stereo frame as well.

Spatial image super-resolution is a very intensively studied topic because it improves the inherent resolution

Simulation results

Two example figures show the performance of the super-resolution mode of our approach. In Fig. 7, a super-resolution virtual stereo frame (size 1080×864 pixel) of the Statue-sequence is presented. Three close-ups should stress the difference between the super-resolution frame and an up-sampled frame using Lanczos-filtering (original size was 720×576 pixel). Fig. 7d shows some typical artifacts when dealing with interlaced PAL video and up-sampling: Sawtooth pattern can be noticed along edges

Psycho-visual experiments

Two subjective quality tests were carried out to stress the performance of RSVS against standard 2D/3D conversion approaches like DIBR and time-shift. The test conditions are standard and conform to the ITU recommendations [16]. In the first session of the experiment, 15 subjects were asked to rate the quality impression of 25 real test sequences. Since no ground truth data were available (the test sequences were originally captured with a single camera), a single stimulus (SS) method was

Limitations

The RSVS approach presented has some limitations. The most important one is that the scene has to be static, i.e. moving objects within the scene would disturb the depth perception. Furthermore, there are restrictions on camera motion. If the camera moves only in a forward or backward direction, this approach for virtual view synthesis fails. The case of a camera movement in up and down direction can be handled by transposing the frames by 90°. A final limitation is that a larger screen

Summary and conclusions

This paper presented a new approach for generation of super-resolution stereo- and multi-view video from monocular video, i.e. we extended our previous work on RSVS with a super-resolution mode. To our knowledge it was the first time that generation of super-resolution multi-view video from monocular video was addressed. Thus, the algorithm is suitable for offline content creation for conventional and advanced 3D display systems with minimum user assistance.

The main advantage of this approach

Acknowledgment

The authors would like to thank Aljoscha Smolic, Peter Kauff, Ingo Feldmann and Christoph Fehn from FhG-HHI for the fruitful discussions on 2D/3D conversion issues.

References (49)

  • R. Hartley et al.

    Triangulation

    Comput. Vision Image Understanding

    (1997)
  • M. Pollefeys et al.

    Automated reconstruction of 3D scenes from sequences of images

    ISPRS J. Photogrammetry Remote Sensing

    (2000)
  • R. Szeliski et al.

    Recovering 3D shape and motion from image streams using non-linear least-squares

    J. Visual Commun. Image Representation

    (1994)
  • P. Beardsley, P.H.S. Torr, A. Zisserman, 3D model acquisition from extended image sequences, In: Proceedings of...
  • S. Borman et al.

    Super-Resolution from image sequences—a review

    Midwest Symp. Syst. Circuits

    (1998)
  • J.-Y. Bouguet, Pyramidal implementation of the Lucas Kanade feature tracker, Intel Corporation, Microprocessor Research...
  • S. Curti, D. Sirtori, F. Vella, 3D effect generation from monocular view, in: Proceedings of the International...
  • J.-F. Evers-Senne, A. Niemann, R. Koch, Visual reconstruction using geometry guided photo consistency, in:...
  • C. Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV, in: Proceedings...
  • C. Fehn et al.

    An evolutionary and optimised approach on 3D-TV

    Int. Broadcast. Conv. (IBC)

    (2002)
  • M. Fischler et al.

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • B.J. Garcia, Approaches to stereoscopic video based on spatio-temporal interpolation, in: Proceedings of SPIE:...
  • P. Harman, J. Flack, S. Fox, M. Dowley, Rapid 2D to 3D conversion, in: Proceedings of SPIE: Stereoscopic Displays and...
  • C. Harris, M. Stephens, A combined corner and edge detector, in: Fourth Alvey Vision Conference, 1988, pp....
  • R. Hartley et al.

    Multiple View Geometry

    (2003)
  • E. Imre, S. Knorr, A.A. Alatan, T. Sikora, Prioritized sequential 3D reconstruction in video sequences of dynamic...
  • M. Irani, S. Peleg, Super resolution from image sequences, in: Proceedings of International Conference on Pattern...
  • ITU, Methodology for the subjective assessment of the quality of television pictures, Recommendation BT.500-10,...
  • T. Jebara et al.

    3D structure from 2D motion

    IEEE Signal Process. Mag.

    (1999)
  • K.T. Kim, M. Siegel, J.Y. Son, Synthesis of a high-resolution 3D stereoscopic image pair from a high-resolution...
  • S. Knorr, T. Sikora, An image-based rendering (IBR) approach for realistic stereo view synthesis of TV broadcast based...
  • S. Knorr, E. Imre, B. Özkalayci, A.A. Alatan, T. Sikora, A Modular scheme for 2D/3D conversion of TV broadcast, in: 3rd...
  • S. Knorr et al.

    From 2D- to Stereo- to Multi-View Video, 3DTV-CON

    (2007)
  • S. Knorr et al.

    Super-Resolution Stereo- and Multi-View Synthesis from Monocular Video Sequences, 3-D Digital Imaging and Modeling (3DIM)

    (2007)
  • Cited by (23)

    • A framework for estimating relative depth in video

      2015, Computer Vision and Image Understanding
      Citation Excerpt :

      It should be noted that SfM can be used to directly obtain novel views for the purpose of 2D-to-3D conversion. For instance, Zhang et al. [12] and Knorr et al. [13] both use SfM to determine the camera pose (position and orientation) and create new views by warping the existing frames. While this is a more direct approach than using a depth map for rendering, it suffers from the problems inherent with using SfM: the scene must be static.

    • Video super-resolution based on automatic key-frame selection and feature-guided variational optical flow

      2014, Signal Processing: Image Communication
      Citation Excerpt :

      Video super-resolution is a hot topic in computer vision and video processing [1–4].

    • A New Passive 3-D Automatic Target Recognition Architecture for Aerial Platforms

      2019, IEEE Transactions on Geoscience and Remote Sensing
    • Displays: Fundamentals and applications

      2016, Displays: Fundamentals and Applications
    • Stereo-browsing from Calibrated Cameras

      2014, Italian Chapter Conference 2014 - Smart Tools and Apps in Computer Graphics, STAG 2014
    View all citing articles on Scopus

    This work was developed within 3DTV (FP6-PLT-511568-3DTV), a European Network of Excellence funded under the European Commission IST FP6 programme.

    View full text