Copyright © 2006 Elsevier B.V. All rights reserved.
Geometry-assisted image-based rendering for facial analysis and synthesis
Peter Eisert
, a,
,
and Jürgen Rurainskya
Received 24 February 2006;
Abstract
In this paper, we present an image-based method for the tracking and rendering of faces. We use the algorithm in an immersive video conferencing system where multiple participants are placed in a common virtual room. This requires viewpoint modification of dynamic objects. Since hair and uncovered areas are difficult to model by pure 3-D geometry-based warping, we add image-based rendering techniques to the system. By interpolating novel views from a 3-D image volume, natural looking results can be achieved. The image-based component is embedded into a geometry-based approach in order to limit the number of images that have to be stored initially for interpolation. Also temporally changing facial features are warped using the approximate geometry information. Both geometry and image cube data are jointly exploited in facial expression analysis and synthesis.
Keywords: Facial animation; Image-based rendering; Model-based coding; Face tracking; 3-D image analysis and synthesis
Article Outline
- 1. Facial analysis and synthesis for model-based coding and virtual conferencing
- 2. 3-D model-based facial expression analysis and synthesis
- 3. Image-based tracking and rendering
- 3.1. Initialization of the image cube
- 3.2. Rendering of new frames
- 3.3. Representation of eye and mouth movements
- 3.4. Image-based motion estimation
- 3.5. Experimental results
- 4. Conclusions
- References
Image-based rendering is a technique which has received a considerable interest in computer graphics for the realistic rendering of complex scenes. Instead of modeling shape, material, reflection of objects as well as light sources and light exchange with high accuracy and sophisticated physical models, image-based rendering synthesizes new views of a scene by interpolating among multiple images taken with one or multiple cameras. Examples of such approaches are lightfields [21] or concentric mosaics [25] and [30]. The use of real pictures leads to naturally looking scenes and allow the reproduction of fine structures (e.g., hair, fur, leaves) that are difficult to model with polygonal representations. Also, the rendering complexity is independent from the scene content since interpolation is performed on pixels instead of polygons. As a result, sophisticated scenes can naturally be rendered with limited computational complexity.
One drawback of image-based rendering, however, is the high demand on storage and memory capacity. In order to allow free navigation and to avoid rendering artifacts, a very large number of images has to be captured, stored, and used for interpolation. Datasets of hundreds of giga bytes overstrain even today's computers. However, in image-based rendering, it is possible to trade the number of required images with the accuracy of an additionally used approximate geometry model [31]. The more geometry information is used, the less images are needed for a particular quality. No geometry information requires many images whereas a highly accurate model is needed if only a few images are exploited. One extreme is a textured polygonal model with an accurate geometry and a single image as texture map. Other approaches that combine geometry and image information are, e.g., the lumigraph [15] and surface light fields [35].
Image-based rendering techniques are most often used for synthesizing new views from a static scene. In order to reduce the amount of data, the temporal dimension of the seven-dimensional plenoptic function [1] and other parameters in the scene are usually neglected. However, image-based rendering is not restricted to movements of a virtual camera but any degree of freedom can be represented by sampling from stored data.
In this paper, we use image-based techniques in order to realistically animate faces. Face animation, however, has traditionally been addressed by deforming and moving 3-D geometry models. A triangle mesh defines the person's shape [28] and [23] and texture mapping ensures visual quality [33] and [20]. With extensive use of computer graphics techniques, highly realistic head models can be realized [26]. Advantages of this approach are a compact description which can be exploited in model-based video coding [14], [34] and [24] and simple facial animation capabilities by locally moving vertices. However, the realistic rendering of hair and the dynamic representation of facial features is not easy to solve by pure geometry modeling.
Researchers have therefore tried for a long time to use previously captured images in order to increase the quality of the synthesized views. Especially the facial features like eyes and mouth have been modeled by images. An early approach in model-based coding is the clip-and-paste method [3], [34] and [7] where facial expressions are synthesized by copying templates of facial features from a codebook onto a 3-D model representing global head motion. All variations must be stored in a database which can grow significantly if visual artifacts and jitter shall be avoided. The number of templates can be reduced by adding model information to account for motion compensation. Active appearance models [8] and [16], e.g., describe facial feature changes by a combination of image movements and dynamic textures and are successfully employed for realistic speech driven visual synthesis [32], [22], [36] and [6]. Even more geometry information is exploited in morphable head models [4] which interpolate new views from a database of 3-D head scans specifying both geometry and texture information.
In contrast to the above mentioned approaches, we combine geometry warping with image-based rendering in order to describe global head motion and to render a correct outline even in presence of hair. In order to reduce the memory requirements, only head turning with the most dominant image changes is interpolated from a set of initially captured views, whereas other global head motions are represented with a geometry model. Similarly, the jaw movement which affects the silhouette of the person viewed from the side is also represented by geometry deformations. In contrast to [13], local expressions and motion of the mouth and eyes are directly extracted from the video, warped to the correct position using the 3-D head model, and smoothly blended into the global head texture. The additional use of geometry in image-based rendering severely restricts the number of images required but enables head rotation of the person as a postprocessing step in applications like virtual conferencing.
The reminder of the paper is structured as follows. First, we describe possible applications for the presented approach, which are later used as reference for experimental results. We then show the method for pure model-based facial analysis and synthesis which is used for tracking of the face and initialization of the image-based dataset. In Section 3 we then present all extensions for image-based rendering and the modifications to the tracking algorithm that ensures robust estimation in the long term run. Experimental results finally show the applicability and accuracy of the approach.
1. Facial analysis and synthesis for model-based coding and virtual conferencing
Although the algorithms for rendering and tracking can be used for any application related to facial animation like text-driven animation [27], man-machine interfaces, and avatar control [29], we focus in this context on the application of virtual conferencing which has some implications on the settings and the experiments made.
In virtual conferencing, multiple distant participants can meet in a virtual room as shown in Fig. 1. The use of a synthetic 3-D computer graphics scene allows more than two partners to join the discussions even if they are far apart from each other. Each partner is recorded by a single camera and the video objects are inserted into the artificial scene. In order to place multiple participants into a common room, viewpoint modification is necessary which requires information about 3-D structure of the person. This geometry information can be estimated from multiple frames or a priori knowledge is utilized by means of a rough generic head model, as it is done in this work.
A common method for representing head-and-shoulder video sequences three-dimensionally is model-based coding. For this technique, computer models of all objects and people in the scene are created. The models are then animated by motion and deformation parameters. Since temporal changes are described by a small parameter set, very efficient coding and transmission can be achieved. Head-and-shoulder scenes typical for video conferencing applications can for example be streamed at only a few kbit/s [12].
For head-and-shoulder video sequences, a textured 3-D head model describes the appearance of the individual. Facial motion and expressions are modeled by the superposition of elementary action units each controlled by a corresponding parameter. In MPEG-4, there are 66 different facial animation parameters (FAPs) [18] that specify the temporal changes of facial mimics. These FAPs are estimated from the video sequence, encoded and transmitted over a network. At the decoder, the parameters are used to animate the 3-D head model and synthesize the video sequence by rendering the deformed computer model. Fig. 2 shows the structure of a model-based codec [12].
The initialization of a model-based codec usually starts with the fitting of the 3-D head model to the first frame of the video sequence and the extraction of a texture map from the image. Often, a facial mask is adapted to the video content [19], [2], [9] and [17]. This mask represents the facial area without hairs, ears, neck, or the back of the head and models local deformations caused by facial expressions. Areas outside the mask cannot be synthesized which might lead to artificially looking images, especially at the silhouette of the person. In our previous work [12], we have therefore used a complete 3-D head model and represented the fine structures of hair with billboarding techniques. These partly transparent planes, however, are only visually correct from near frontal views. Although more than the frontal facial area is modeled, the maximum range of head rotation is limited.
This limitation restricts applications, where full control of the head motion is desired. For the virtual conferencing scenario with participants meeting in a virtual room as shown in Fig. 1, the viewing direction must be changed afterwards in order to place the people correctly at a shared table. Moreover, head rotations can additionally be emphasized on small displays to allow to follow communication of distant partners. Even head pose changes based on audio signals for visualizing speaker attention is possible. With all these modifications, new facial areas become visible that are not captured with the current frame.
In order to render the people in the virtual room correctly, texture map updates have to be performed during the video sequence. The stitching of different image parts, however, requires an accurate head geometry. The more the head is turned from its original position, the more accurate the geometry has to be. Especially at the silhouette, errors are early visible and hairs with their sophisticated structure make an accurate modeling even more difficult.
These problems can be avoided with the combined image- and geometry-based tracking and animation technique described in Section 3. Fine structures like hair or uncovered areas are simply interpolated from previously recorded frames. In principle, all variations can be described by image-based rendering. Without loss of generality, we restrict the image-based representation in this scenario to head turning, since viewpoint modifications in the virtual room are mainly conducted around a vertical rotation axis. Also head turning shows most occlusions and uncovered areas in comparison to nodding or head rolling. This simplifies the stored image data to a one-dimensional array of images which can be easily handled by today's computers.
In the next section, the model-based tracking is described, which is used for initialization of this image-cube. It also serves as basis for the combined approach and needs only slight modifications in order to incorporate the image-based techniques into the facial analysis component.
2. 3-D model-based facial expression analysis and synthesis
In this section, we will briefly describe our original 3-D model-based coding system [12] and [11]. Although purely geometry-based, it is used in this work to represent global head motion (except for head turns) and jaw movements, which severely effects the silhouette if the person is viewed from a sideways directions. This technique uses a 3-D head model with a single texture map extracted from the first frame of the video sequence. All temporal changes are modeled by motion and deformations of the underlying triangle mesh of the head model according to the given facial animation parameters [18]. These parameters are estimated from the camera image using a gradient-based approach embedded into a hierarchical analysis-by-synthesis framework.
2.1. Gradient-based facial mimic analysis
The estimation of facial animation parameters for global and local head motion makes use of an explicit, parameterized head model describing shape, color, and motion constraints of an individual person. This model information is jointly exploited with spatial and temporal intensity gradients of the images. Thus, the entire area of the image showing the person of interest is used instead of dealing with discrete feature points, resulting in a robust and highly accurate system.
The image information is added with the optical flow constraint equation
The solution of this problem is under-determined since each equation has two new unknowns for the displacement coordinates. For the determination of the optical flow or motion field, additional constraints are required. Instead of using heuristical smoothness constraints, explicit knowledge from the head model about the shape and motion characteristics is exploited. A 2-D motion model can be used as an additional motion constraint in order to reduce the number of unknowns to the number of motion parameters of the corresponding model. The projection from 3-D to 2-D space is determined by camera calibration [10]. Considering in a first step only global head motion, both dx and dy are functions of six degrees of freedom
Combining this motion constraint with the optical flow constraint (1) leads to a linear systems of equations for the unknown FAPs. Solving this linear system in a least squares sense, results in a set of facial animation parameters that determines the current facial expression of the person in the image sequence.
2.2. Hierarchical framework
Since the optical flow constraint [21] is derived assuming the image intensity to be linear, it is only valid for small motion displacements between two successive frames. To overcome this limitation, a hierarchical framework can be used [12]. First, a rough estimate of the facial motion and deformation parameters is determined from sub-sampled and low-pass filtered images, where the linear intensity assumption is valid over a wider range. The 3-D model is motion compensated and the remaining motion parameter errors are reduced on frames having higher resolutions.
The hierarchical estimation can be embedded into an analysis–synthesis loop as shown in Fig. 3. In the analysis part, the algorithm estimates the parameter changes between the previous synthetic frame and the current frame I′ from the video sequence. The synthetic frame
is obtained by rendering the 3-D model (synthesis part) with the previously determined parameters. This approximate solution is used to compensate for the differences between the two frames by rendering the deformed 3-D model at the new position. The synthetic frame now approximates the camera frame much better. The remaining linearization errors are reduced by iterating through different levels of resolution. By estimating the parameter changes with a synthetic frame that corresponds to the 3-D model, an error accumulation over time is avoided.
2.3. Experimental results
In this section some results for the model-based facial expression analysis are presented. A generic head model is adapted to the first frame of a CIF video sequence by varying shape parameters. A texture map is also extracted from this image. For each new frame, a set of 19 facial animation parameters and four motion parameters for the body are estimated using the proposed method. These parameters are transmitted and deform a generic head model in order to model the facial motion. The upper left of Fig. 4 shows an original frame of this sequence; on the right hand side the corresponding synthesized view from the head model is depicted. The lower left image illustrates the triangle mesh representing geometry of this model. As long as the viewing direction is similar to the original camera orientation, synthesized images match the original ones quite accurately. However, if the head model is rotated afterwards in order to simulate viewpoint modifications the silhouette of the model show distortions due to the planar approximation of hair by billboards. This is depicted in the lower right of Fig. 4, where the head is rotated by 20
compared to the correct orientation.
| Full-size image (144K) |
Fig. 4. Upper left: One original frame of sequence Peter. Upper right: Textured 3-D head model with FAPs extracted from the original frame. Lower left: Wireframe representation. Lower right: Synthesized frame with head rotated additional 20 degrees compared to the original, showing silhouette artifacts.
3. Image-based tracking and rendering
In this section, we describe an extension of the pure geometry-based estimation and rendering of Section 2. By adding image-based interpolation techniques, the maximum range of head rotation can be broadened while preserving the correct outline, even in presence of hair. In contrast to other image-based techniques in facial animation like active appearance models [8] and [16] that describe local features like mouth or eyes by a set of images, we use the captured set of video frames to realistically render the non-deformable parts of the head outside the face. In order to keep the number of images used for image-based interpolation low, we only capture the one degree of freedom related to head turning. Other global head movements like pitch, roll or head translation, which usually show less variations, are modeled by geometry-based warping as described in Section 2. However, for other applications, more degrees of freedom can be interpolated from images exactly in the same way. In that case, a multi-dimensional array of images has to be stored. The selection which frame to keep for interpolation can be derived directly from the tracker information and the desired requirements about sampling density.
3.1. Initialization of the image cube
For the initialization of the algorithm, the user has to turn the head to the left and the right as shown in Fig. 5. This way, we capture the appearance of the head from all sides for later interpolation. For simplification, we assume that a neutral expression is kept during this initialization phase; at least no expression altering the silhouette like opening of the jaw is permitted. The person is then segmented from the background and all these images are collated in a 3-D image cube with two axes representing the X- and Y-coordinate of the images. The third axis of the image cube mainly represents the rotation angle Ry which need not be equidistantly sampled due to variations in the head motion.
| Full-size image (86K) |
Fig. 5. Initial sequence with head rotation exploited for image-based rendering of new views.
For each of these frames, the rotation angle needs to be determined approximately using the a priori knowledge of the end position of almost ±90
. For that purpose, the global motion is estimated using the approach described in Section 2. The result is a parameter set for each frame specifying the six degrees of freedom with the main component being head rotation around the y-axis. Fig. 6 shows a result for 110 frames of a video sequence, where the user turns the head around the vertical axis. A parameter of one corresponds to a rotation of 90
. From this estimate, it can easily be seen that head turning is the dominant motion but that other movements are additionally present and need to be considered for the final pose synthesis.
With this parameter set, the position and orientation of the triangle mesh in each frame is also known. For the shape adaptation, only the facial area responsible for modeling facial expressions needs to be quite accurate. The outline at the top and back (which shows up if the head is turned) of the head can be of approximate nature since image content recovers the details. It must only be assured, that the 3-D model covers the entire segmented person. Alpha blending is used to show a detailed outline even with a rough geometry model. The wireframe for this model is illustrated in Fig. 7.
3.2. Rendering of new frames
The rendering of new frames is performed by image-based interpolation combined with geometry-based warping. Given a set of facial animation parameters, the frame of the image cube having the closest value of head rotation is selected as reference frame for warping. Thus, the dominant motion changes are already represented by a real image without any synthetic warping. Deviations of the desired global motion parameters from the stored values of the initialization step are compensated using 3-D geometry. This combination of geometry warping with image-based interpolation allows a very flexible trade-off between accuracy and size of the image cube. In principle, a single texture is sufficient if the underlying generic head model is very precise. At the other end with a very large number of images, no geometry is required at all in order to interpolate natural pose modifications. In our case, we combine a limited number of frames (about 100) with a very rough geometry model used for geometry-based interpolation between those views. An analysis of the trade-off between number of images and depth accuracy can be found in [5].
Head translation and head roll can be addressed by pure 2-D motion, only head pitch needs some depth dependent warping. As long as the rotation angles are small which is true in most practical situations, the quality of the geometry can be rather poor. Also local deformations due to jaw movements are here represented by head model deformations as in the original model-based approach of Section 2. In order to combine both sources, alpha blending is used to smoothly blend between the warped image and the 3-D model.
3.3. Representation of eye and mouth movements
Realistic rendering of moving eyes and mouth is difficult to achieve. In this paper, we therefore use the original image data from the camera to achieve realistic animation of face features. The area around the eyes and the mouth is cut out from the camera frames, warped to the correct position of the person in the virtual scene using the 3-D head model, and smoothly merged into the synthetic representation using alpha mapping. This process requires knowledge of the exact position of eyes and mouth in the original video to prevent jitter of facial features. We use the model-based motion estimation scheme described in Section 2 in order to accurately track the facial features over time. For the tracking, realistic hair is not required and the restricted motion of a person looking into a camera reduces the demands on a highly accurate 3-D model for that purpose. Once the features are localized, the corresponding image parts are cut out and combined with the previous steps.
Thus three different techniques are used for different facial areas. The texture of the main head parts except for eye and mouth regions are taken from the image cube representing the person for all possible head turns. 3-D model-based warping is then applied to model the other five global head movements (Rx, Rz, tx, ty, tz) as well as the opening of the jaw. Finally local eye and mouth motion is represented by image information captured at the current time instant by a video camera. This way, natural looking images can be synthesized showing facial expressions and a correct silhouette even for large modifications of the head rotation angles.
3.4. Image-based motion estimation
Since two different techniques—image- and geometry-based interpolation—are used to render novel views, the estimation of facial animation parameters (head tracking) from camera images must be slightly modified in order to avoid inconsistent values for the two approaches and to obtain a smooth blending of all three sources. The optical-flow constraint equation is therefore replaced by
with head rotation Ry being excluded. With the additional term in the optical flow constraint (4) all parameters can be estimated in the same way as described in Section 2.1. In the hierarchical framework, also the image cube must be downsampled in all three directions. All other components remain the same and allow the estimation of all FAPs consistently with the initially captured frames of the image cube.
3.5. Experimental results
In this section, we show some results obtained with the proposed tracking and rendering technique. A video sequence is recorded showing the head and upper body of a person. In the beginning, the person rotates the face to the left and right as shown in Fig. 5 and then starts talking. From the initial part with about 110 frames, the video cube is created from the segmented images and the global head motion is estimated for each of these frames.
For the rendering of new views in a virtual conferencing scenario, the current position and orientation of a person's head as well as jaw movements are tracked with the method described in Section 3.4. The pose of the person can simply be altered by changing the rigid body motion parameters obtained from the real data. The resulting head turn angle Ry determines which frame to use from the image cube for texture mapping. The remaining motion parameters are used for geometry-based warping using the selected texture map. The resulting image shows the person from a different direction and head orientation compared to the original camera image. This is illustrated in Fig. 8, where different views are rendered from a single camera image by changing the estimated global motion parameters. Please note that also occluded areas like the ears are correctly reproduced due to the usage of an image cube with previous frames. Also people with a lot of hair can be rendered naturally as shown in Fig. 9.
| Full-size image (211K) |
Fig. 8. Different head positions created from a single camera frame using the proposed method. The viewing direction in the virtual scene is not identical to the original camera position.







E-mail Article
Add to my Quick Links

Cited By in Scopus (1)


