University of Southern California Dissertations and Theses

Facial animation by expression cloning

(USC Thesis Other)

Facial animation by expression cloning

PDF

Download a page range

Download transcript

Copy asset link

Request this asset

Transcript (if available)

Content FACIAL ANIMATION BY EXPRESSION CLONING by JunyongNoh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOTUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2002 Copyright 2002 Junyong Noh R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 3094414 Copyright 2002 by Noh, Junyong All rights reserved. ® UMI UMI Microform 3094414 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 900894695 This di s s er t at i on, w r i t t e n b y T u n ^ o n ^ A/oh U nder th e direction o f h i s . . . D issertation C om m ittee, an d approved b y a ll its m em bers, has been p resen ted to and accepted b y The G raduate School, in p a r t i a l f u l f i l l m e n t o f requirem ents fo r th e degree o f DOCTOR OF PHILOSOPHY D ate August 6 , 2002 DISSER TA TION COMMITTEE R X l M r j j c A ___ R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A cknow ledgem ents It was six years of long journey to write this thesis. The journey was challenging and adventurous. Sometimes I was so excited as if I made the most precious finding in the world while, other times, I went astray and lost the direction getting stuck without knowing how to escape from it. Ulrich Neumann, my thesis advisor, was always there to be a guiding light and a source of encouragement. He was very quick to understand the problems I had and his insight always led to a valuable solution. I was truly impressed by his ability to catch the difficult concepts I was trying to explain and to give me immediate feedback with better formulation. He never pushed me to work but constantly provided a motivation to work. He never forced me his ideas but instead discussed all the possibilities. He made me think than blindly implement. I deeply thank him for teaching me how to formulate the ideas and to do research. I also would like to thank the other members of my thesis committee, Mathieu Desbrun and Shrikanth Narayanan. I am glad I could have them on my committee. Their suggestions and comments helped make this thesis complete and thorough. Through their scrutiny, I was able to evaluate and improve the thesis in totally different perspectives. Mathieu helped me understand fundamental mathematical concepts, which can be priceless for my future research. I also like to thank Gaurav Sukhatme and Stefan Schaal for serving as my qual committee. I met Gaurav when I worked at robotics lab for my master’s degree. He was an impressive Ph.D. student making me thinking of becoming a Ph.D. myself. Stefan also helped me enlarge the understanding of neural network and Bayesian theory. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Special thanks goes to J.P. Lewis. At every meeting we had, he constantly introduced new concepts that I wasn’t aware of. His comments were also valuable in refining my papers and thesis. I really enjoyed the life as a member of Computer Graphics and Immersive Technologies laboratory. It was my privilege to work with all the talented people from all over the world, Reyes Enciso, Jong-weon Lee, Tae-yong Kim, Doug Fidaleo, Clint Chua, and Bolan Jiang. They are great individuals as well as able researchers. Whatever idea I came up with, they were always ready to discuss and argue trying to find merits and holes in my idea. It was a true learning process. With endless conversation with them, my knowledge became enriched and idea became concrete. I also thank Anand Srinivasan for preparing all the lab equipments for me whenever necessary and Albin Cheenath for carefully crafted various face models. I deeply thank my family. Without unconditional love and continuous supports of my parents and sisters, none of this would have been possible. To them, I was the best in the world. They never doubted what I could do. No matter how things look bad, their faith was always in me. I appreciate my sisters for truly sharing my feelings whether I was in happiness or in sorrow. I cannot appreciate my parents enough for convincing me to pursue Ph.D. degree and being unconditional supporters. This thesis is not only my achievement but also my parent’s. Hearty thanks goes to Frances Kim. The last six years have been so much fun with her and I hope that the fun last forever with her. All the worry and stress from completing the thesis was so much alleviated by her love, caring and understanding. It is my greatest luck to have her around as my girlfriend during the Ph.D. years. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table of Contents Acknowledgements..................................................................................................................................ii List of Tables......................................................................................................................................... vii List of Figures....................................................................................................................................... viii Abstract.................................................................................................................................................. xii Chapter 1...................................................................................................................................................1 1. Introduction................................................................................................................................1 1.1. Thesis Overview.....................................................................................................................4 1.2. Applications of facial animation systems.............................................................................. 7 Chapter 2................................................................................................................................................. 10 2. Background..............................................................................................................................10 2.1. 2D Facial Animation............................................................................................................. 11 2.2. 3D Facial Animation............................................................................................................. 12 2.3. Performance Driven Facial Animation.................................................................................14 Chapter 3................................................................................................................................................. 16 3. 2D Visual Speech Synthesis.................................................................................................... 16 3.1. Introduction........................................................................................................................... 16 3.2. System Overview..................................................................................................................17 3.3. Image Warping and Blending.............................................................................................. 19 3.4. Text to Speech Module Integration...................................................................................... 20 3.5. Results and Discussion........................................................................................................ 21 3.6. Conclusion............................................................................................................................ 22 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 4................................................................................................................................................ 23 4. 3D Facial Animation Generation.............................................................................................23 4.1. Introduction.......................................................................................................................... 23 4.2. Geometry Deformation Element.......................................................................................... 26 4.3. Search Methods and Distance Metrics................................................................................. 29 4.4. Surface Deformation with Radial Basis Function................................................................ 30 4.5. Generating Expressions....................................................................................................... 32 4.6. Tracking System Integration................................................................................................ 33 4.7. Results and Discussion........................................................................................................ 35 4.8. Conclusion............................................................................................................................38 Chapter 5................................................................................................................................................ 39 5. Facial Animation by Expression Cloning................................................................................39 5.1. Introduction...........................................................................................................................39 5.2. System Overview..................................................................................................................42 5.3. Dense Surface Correspondences.......................................................................................... 44 5.4. Animation with Motion Vectors.......................................................................................... 47 5.5. Lip Contact Line...................................................................................................................51 5.6. Automated Correspondence Selection................................................................................. 54 5.7. Results and Discussion.........................................................................................................56 5.8. Extension 1: Motion Volume Control and Motion Equalizer..............................................63 5.9. Extension2: Direct Animation with Motion Capture Data...................................................69 5.10. Conclusion............................................................................................................................74 Chapter 6.................................................................................................................................................78 6. Gesture Driven Facial Animation............................................................................................78 6.1. Introduction........................................................................................................................... 78 6.2. System Overview.................................................................................................................. 82 6.3. Animation by Model Based Interpolation............................................................................ 83 6.4. Results................................................................................................................................... 87 6.5. Discussion.............................................................................................................................89 6.6. Conclusion............................................................................................................................ 90 Chapter 7.................................................................................................................................................92 7. Summary and Future Work..................................................................................................... 92 Bibliography...........................................................................................................................................96 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. vi Appendix A.......................................................................................................................................... 106 A. A Survey of 3D Facial Modeling and Animation Techniques............................................. 106 A.I. Introduction........................................................................................................................ 106 A.2. Interpolation....................................................................................................................... 109 A.3. Parameterizations............................................................................................................... 110 A.4. 2D & 3D Morphing............................................................................................................ 111 A.5. Facial Action Coding System............................................................................................ 112 A.6. Physics Based Muscle Modeling....................................................................................... 113 A.7. Pseudo or Simulated Muscle.............................................................................................. 116 A.8. Wrinkles............................................................................................................................. 119 A. 9. T exture Manipulation......................................................................................................... 121 A. 10. Fitting and Model Construction......................................................................................... 123 A .ll. Animation by Tracking...................................................................................................... 128 A. 12. Mouth Animation............................................................................................................... 130 A. 13. Conclusion......................................................................................................................... 135 Appendix B...........................................................................................................................................136 B. Radial Basis Functions Fundamentals................................................................................... 136 B.l. Cost Function Minimization.............................................................................................. 137 B.2. Approximation/Interpolation with Radial Basis Functions............................................... 137 B.3. System Solutions................................................................................................................ 138 B.4. Regularization Parameter................................................................................................... 139 B.5. FeedForward Neural Network........................................................................................... 140 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. vii List of Tables 5-1 Models used for the experiments......................................................................................... 57 5-2 Average errors relative to the motion vector size................................................................60 5-3 Average errors relative to the model size............................................................................61 A-1 Sample single facial action units....................................................................................... 112 A-2 Example sets of action units for basic expressions............................................................ 112 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List of Figures 1-1 Immersive teleconferencing system....................................................................................... 8 3-1 Selected features for RBF deformation................................................................................ 17 3-2 2D regular meshes used for image warping......................................................................... 18 3-3 Synthesized mouth shapes for ‘phone’.................................................................................21 4-1 RBF morphing based deformation to mimic muscle based deformation.............................24 4-2 RBF deformation driven by tracked 2D feature points........................................................26 4-3 Geometry Deformation Element defined on the facial surface............................................27 4-4 Relationship between GDE and RBF...................................................................................28 4-5 3D coordinate computation..................................................................................................29 4-6 Comparison between edge based and distance based search method..................................30 4-7 Expression as a collection of Geometry Deformation Elements......................................... 32 4-8 Deformation driven by feature points in the video stream...................................................33 4-9 Sample expressions created by one ore more GDEs............................................................ 34 4-10 Transition from neutral to ‘A’ mouth shape.........................................................................35 4-11 Transition from neutral to angry face...................................................................................36 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4-12 Transition between two expressions.................................................................................... 37 4-13 Video driven facial animation..............................................................................................37 Sample expressions cloned onto Yoda............................................................................... 39 5-1 Expression cloning system...................................................................................................42 5-2 Surface correspondence by morphing and projection..........................................................43 5-3 Notations used in equations................................................................................................. 45 5-4 Side view of two models after the projection.......................................................................46 5-5 Direction and magnitude adjustment of the motion vector..................................................47 5-6 Transformation matrix as a means to adjust a motion vector direction...............................48 5-7 Local bounding box..............................................................................................................50 5-8 Lip contact line alignment....................................................................................................52 5-9 Automated search results..................................................................................................... 56 5-10 Motion capture data and its association with the source model...........................................56 5-11 Deformed models produce dense surface correspondences.................................................58 5-12 Adjusted direction and magnitude after the motion vector transfer.....................................59 5-13 Visually depicted displacement errors................................................................................. 60 5-14 Cloned expressions onto various models............................................................................. 76 5-15 Exaggerated expressions cloned on a wide variety of texture mapped target models..........77 5-16 Cloned expressions produced with different scaling values................................................ 64 5-17 Original source expressions at two different frames with all the gain values set to one... .67 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. X 5-18 Various expressions generated by applying different gain values...................................... 68 5-19 Asymmetric mesh vs. symmetric mesh................................................................................71 5-20 Marker grouping.................................................................................................................. 71 5-21 Lip contact line approximation using a Bezier curve..........................................................72 5-22 Side view of the meshes generated with and without a constraint.......................................73 5-23 Delaunay triangulation performed in 2D space................................................................... 74 5-24 Open mouth after lip contact line split................................................................................ 74 5-25 Expression cloning using the mesh directly generated from the motion capture data 75 6-1 Comparison between a typical conventional PDFA and our GDFA approach.................... 79 6-2 GDFA system architecture...................................................................................................81 6-3 Simplified emotion space diagram.......................................................................................82 6-4 Sample expressions used to train the system........................................................................84 6-5 Sample phonemes used to train the system.......................................................................... 84 6-6 Five expression scores for an 800 frame video sequence....................................................85 6-7 Facial animation driven by expression states.......................................................................86 6-8 Speech animation driven by viseme states.......................................................................... 88 A-1 Classification of facial modeling and animation methods................................................. 107 A-2 Linear interpolation performed on muscle contraction values..........................................109 A-3 Zone of influence of Waters’ linear muscle model............................................................ 114 A-4 Waters’ linear muscles........................................................................................................114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. xi A-5 Triangular skin tissue prism element..................................................................................115 A-6 Free form deformation........................................................................................................116 A-7 Generation of wrinkled surface using bump mapping technique......................................119 A-8 Example construction of a person specific model for animation from a generic model... 124 A-9 Some of the anthropometric landmarks on the face........................................................... 126 A-10 Animation by face tracking................................................................................................ 127 A-11 Muscle placements around the mouth................................................................................ 132 B-1 Radial Basis Function network...........................................................................................141 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Abstract The face plays a vital role in human interactions. It provides a rich set of aural/visual cues to the listener/viewer. Speech facilitates a direct and succinct way to express a person’s intent or knowledge aurally, while the face presents additional information visually. Producing human mouth movements for speech or facial expressions involves complicated muscle dynamics. Although a person makes little effort to orchestrate all the muscles to create facial expressions and mouth movements, it presents a challenging task to simulate the skin deformations and mechanical movements with a computer. A spectrum of approaches has been used to synthesize facial motions. At one extreme lie the completely physics-based approaches. At the other extreme lies an artistic skill that only manipulates the face surface. Due to the extensive computation, manual processes, or talent involved, the production cost can be very high. This thesis presents an alternative that minimizes the computational and manual costs while maintaining the accurate dynamics and visual quality of the facial motions produced by any available methods. Instead of creating new facial animations from scratch for each new model created, existing animation data in the form of vertex motion vectors are utilized. This approach allows animations created by any tools or methods to be easily retargeted to new models. This process is called expression cloning. Expression cloning makes it meaningful to compile a high quality facial animation library since these can be reused for new models. Expression cloning transfers each vertex motion vector from a source face model to a target model having possibly different geometric proportions and mesh structure (vertex number and connectivity). Since expression cloning works on existing animations, this thesis also describes a way to build a source R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. animation, 2D visual speech synthesis, 3D direct surface deformations, and gesture driven facial animation. Radial basis functions, a network with one hidden layer, are employed as an underlying mechanism for all these tasks. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 Chapter 1 1. Introduction The face plays a vital role in human interactions. It provides a rich set of aural/visual cues to the listener/viewer. Speech facilitates a direct and succinct way to express a person’s intent or knowledge aurally, while the face presents additional information visually. Various facial expressions or complexion changes reflect a person’s emotions and feelings. Lip readings can complement degraded voice qualities. Small head movements and eye contact add subtle nuances to communications. Producing human mouth movements for speech or facial expressions involves complicated muscle dynamics. Unlike any other parts in a human body, a number of different types of muscles are concentrated in the face region. Some muscles act linearly while others act circularly. Some muscles are large, occupying a wide region while others are small and interweaved with one another. Velocity and acceleration are also important aspects of muscle motions. Although a person makes little effort to orchestrate all these muscles to create facial expressions and mouth movements, it presents a challenging task to simulate the skin deformations and mechanical movements (i.e. jaw motions) with a computer. A spectrum of approaches has been used to synthesize facial motions. At one extreme lie the completely physics-based approaches. These simulate the anatomy of the face, including the characteristics of the various skin layers, underlying muscle movements, and bone structures. Mathematically represented skin, muscles, and bones then synthesize the computer model’s facial R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. movements. The success of this approach depends on how much the facial anatomy is understood, how accurately the interactions of the skin, muscles, and bones are mathematically formulated, and how feasible it is for the formulation to be realized on a computer. Currently, the anatomy is often simplified for formulation and implementation on a computer. Although only a subset of muscles or an approximated skin structure is usually modeled, the computational cost is still expensive, executing animations at order of magnitude slower than real time, even for a high-end computer. At the other extreme lies an artistic skill that only manipulates the face surface. Typically, artists sculpt key frames interactively on a computer. The key frames are placed at arbitrary positions in a time line and in-between frames are interpolated automatically. For the reproduction of precise facial motions, arbitrarily many key frames need to be sculpted at the artist’s discretion, a task which requires artistic talents rather than the understanding of any profound anatomical mechanisms. This approach typically produces the highest quality facial animation and is widely used in productions. Due to the extensive manual processes and talent involved, the production cost can be very high. The method presented in this thesis seeks to find an alternative that minimizes the computational and manual costs while maintaining the accurate dynamics and visual quality of the facial motions produced by any available methods. This research introduces the issue of transferability — how easily existing animation sequences can be reproduced on different models. Whereas conventional techniques are satisfied with showing the best facial animation they can achieve with one specific model1 , this research addresses a new area, that of easily duplicating any previously created animation sequences onto other models. This concept has never been addressed in any previous approaches. The method adds new utility and value to existing facial animation mechanisms/libraries. Any facial animations can serve as a source for creating similar animations of completely different models. The 1 A deformed model in various ways is still one model in this context unless it has different vertex numbers and connectivity between them. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. efforts and trouble of initial sculpting or physical simulation are entailed only once and easy duplication follows. Four different techniques constitute this thesis. The first method for facial animation was tested in 2D image space. Image manipulations are attractive in that images contain all the visual qualities for realism and their manipulations do not require either extensive simulation or artistic skills. A drawback of the image-based approach is the viewpoint being restricted to that of the captured images. This led to the second technique that animated a 3D model. The model is deformed to create various facial expressions either manually or driven by tracking data. The manipulation is based purely on surface geometry changes and no underlying muscle mechanisms are considered. This approach quickly produces facial animation but high quality animations require painstaking manual operations or high accuracy tracking data due to the lack of constraints imposed by any underlying structures. Like other approaches, the first two techniques do not reuse existing data. Any efforts used to create animations should be repeated to create similar animations for other models. The third technique aims to create new animation from existing animation data. This approach only focuses on surface geometry. There is no need to consider complicated muscle dynamics or artistic decisions because source animations presumably have gone through the careful calculation or artistic processes already. The first and second techniques presented here as well as any other approaches can be considered as processes for building source animation sequences. Source animation duplication onto new models is an efficient process with minimal manual or computational cost. Due to the difficulty in simulating accurate dynamics on a 3D face model, tracking data are utilized to drive facial animation as mentioned in the second technique. However, conventional approaches as well as ours do not explicitly address the problem of conversion from the actor’s observed 2D/3D facial motions to arbitrary model’s facial animation parameters. The fourth technique presents a general solution to this problem by explicitly introducing a high-level gesture layer interface between R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4 sensing and animation. By this abstraction, a non-intuitive signal conversion problem is recast to simpler parameter mapping regardless of sensing and animation techniques employed. Animations produced by this technique can also serve as a source for the third technique. We call the first technique 2D Visual Speech Synthesis (VSS), the second 3D Direct Surface Deformation (DSD), the third Expression Cloning (EC), and the fourth Gesture Driven Facial Animation (GDFA). 1.1. Thesis Overview This thesis consistently makes use of a neural network with one hidden layer, a family of Radial Basis Functions (RBF) to produce facial animation. Abstractly, network learning establishes a smooth mapping between a sparse and typically scattered set of input and output data. In the context of facial animation, input and output data are original and desired locations of the feature points on the face. As new data values are predicted based on the established relationship in neural network learning, the face mesh nodes are interpolated for new positions producing desired deformations. Specifically, in 2D space, a set of base images are warped, blended, and morphed together to synthesize visual speech animation (chapter 3). In 3D space, a model is locally deformed around the control point on the surface to create a gamut of facial expressions (chapter 4). Once an animation sequence is created for any model, it can be easily duplicated onto other models. The source model is morphed to the target model identifying dense surface correspondences and motion vector transfer at each correspondence produces facial animation (chapter 5). The analysis of high-level facial gestures can drive facial animation by interpolating prepared base 3D expression as well as speech models (chapter 6). All these approaches uniformly employ a RBF for various tasks such as 2D warping, 3D volume morphing, or hyperspace interpolation. A RBF is a nice alternative to more complicated neural R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. networks for its simplicity in structure and fast computation with comparable performance. Knowing the capability of RBF in face modeling, it was my attempt to utilize RBF as a simple but useful instrument for facial animation. The usefulness is well demonstrated by four techniques illustrated in this thesis. 2D Visual Speech Synthesis (VSS): An arbitrary sequence of speech animation is synthesized from base viseme images. Utilizing coarse regular shape meshes, images are warped to align the features between two key frames. Blending the aligned image textures differently at each time step produces the transition frames. Snapshot images alone do not contain information about speech dynamics, so a text to Speech (TTS) module provides the duration of each phoneme transition. Spline interpolation approximates co-articulation effects. 3D Direct Surface Deformation (DSD): Direct manipulation on a polygonal model surface creates localized facial deformations. Animations are produced by controlling an arbitrary sparse set of control points defined on the surface of the model. The ability to directly manipulate a face surface with a small number of point motions facilitates an intuitive method for creating quick facial expression animations. A prototype performance driven facial animation system is also built by integrating a vision tracking technique. Expression Cloning2 (EC): Animation is duplicated from one face model to another by transferring vertex motion vectors. A model morphing followed by a cylindrical projection establishes dense surface correspondences between the two models. Directions and magnitudes of motion vectors are adjusted for local surface variations. The models can be of different geometric proportions and mesh structures (vertex number and connectivity). Heuristic correspondence search rules bootstrap the 2 This thesis shows Expression Cloning in 3D space only as the main focus lies in 3D facial animation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6 whole process. Cloned expression animations preserve the relative motions and dynamics of the original facial animations. Gesture Driven Facial Animation (GDFA): Sensing and analysis provide automatic animation control. Unlike conventional performance driven facial animation (PDFA), the explicit use of abstracted high-level gesture values assures independence between the actor and face model. The actor and model do not have to conform in shape and the correspondence specification between the two is no longer necessary. Animation can be as expressive as the underlying animation mechanism allows, with no direct constraints arising from what the sensing system can measure. GDFA also ensures the modularity of the system, allowing the adoption of any known sensing or animation mechanism. This thesis emphasizes expression cloning for its innovative ideas in producing facial animation. It enables the compilation of a high quality facial animation library and its reuse. Reuse is not possible with currently available techniques. The concept of building facial animation library only becomes meaningful by the introduction of expression cloning. The 2D visual speech synthesis, 3D direct surface deformation, or gesture driven facial animation described herein can be considered as methods for a building a facial animation library. Similarly, a carefully tuned model can drive facial animation for arbitrary models with expression cloning. For example, gesture driven facial animation prepares initial base models with painstaking manual intervention but the repetition can be avoided by expression cloning. Chapter 2 summarizes previous works on 2D/3D facial animation and a comprehensive survey on facial modeling and animation is found in appendix A. Chapters 3, 4, 5, and 6 delve into the four major areas of 2D visual speech synthesis, 3D direct surface deformation, expression cloning, and gesture driven facial animation, respectively. In chapter 7, future work beyond the scope of this thesis is presented and my long-term vision in facial animation is addressed. Appendix B describes the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 7 fundamentals on Radial Basis Functions whose variations are repeatedly applied at chapters 3, 4, 5, and 6 to create 2D, 3D, expression cloned and gesture driven facial animations. The remainder of this chapter discusses possible/current applications of facial animation systems. 1.2. Applications of facial animation systems Facial animation finds applications in games, movies, and personalized avatars to name a few. In 2D games, a simple set of pre-drawn bitmap sprites are typically used for a character’s facial expressions. With the rapid development of 3D graphics technology, it becomes plausible to render multiple high- resolution 3D character models, shifting the game field to the 3D space. The characters are designed to walk, run, jump, and interact with each other, which brings about the issue of ‘character realism’ that was absent or often ignored in 2D games. While one factor for character realism is to compute the suitable dynamics for the character’s smooth articulated movements, another important factor is to deploy facial expressions appropriately. Incorporation of well-crafted facial animations to actual game characters would enhance the realism of the 3D games. Similarly, the movie industry demands facial animation techniques for real world films as well as computer-generated animations. It would be possible to replace a real actor/actress with the virtual one allowing seamless interaction with real actors and backgrounds. As demonstrated by recent animation films, Toy Story or Final Fantasy, fully computer generated animations are also possible. In general, different face models are used for each game or movie. A challenge is then how to produce various facial animations on various models efficiently. One choice is to repeat the process requiring the same computational costs and artistic talents for each model’s facial animation. Although sounds very inefficient, it has been common practice in game and movie production. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 1-1 Immersive teleconferencing system (Courtesy Doug Fidaleo) The use of expression cloning, however, can change the paradigm for character facial animation. For example, the artist prepares a number of facial animation snippets with a variety of nuances on the generic template models such as man, woman, baby, alien, cow, dragon and so forth. Once such a large animation pool is constructed, facial animation for new model becomes easy by transferring the ready made sequence from a similar model in the animation pool. With a little bit of alteration during the transfer, animations can be customized to best fit to new model. With this paradigm, the cost for producing facial animations for arbitrarily many models could be as cheap as the cost for one. Avatar originally means the incarnation of the Hindu deity but it is also commonly used now to refer to a person’s virtual representation. A typical application utilizing an avatar would be an immersive environment 3D teleconferencing system. In a shared virtual space, eye contact and gaze gestures between conference participants are possible. Virtual objects can be manipulated and inspected at arbitrary angles. High-level behavior encoding of the scene including face models also leads to high compression ratios. Figure 1-1 illustrates the concept. A typical setting would track the face with a camera and reproduce the 3D animation at a remote site. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 9 Gesture driven facial animation can be applied to build such a system. The user prepares a finite set of person specific facial gesture models unique to the conference participant. Sensing and analysis then drive facial animation to reproduce the actual dynamics and unique expressions of the person. In principle, any type of high-level facial gestures such as eye twitches or expressions of emotion can be captured. The same scheme is also applicable to speech animation. With the increasing number of initially prepared gesture models, the avatar’s animation becomes closer to the actual person’s behavior. Conversely, with a small number of initial models, the avatar’s gestures might be limited but the fidelity of the animation would still remain intact. If person specific gestures are not important, expression cloning can also be utilized to bypass the initial model preparation for different models. In this case, model preparation is done once with a generic model and sensing drives animation for various models with expression cloning. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 10 Chapter 2 2. Background The facial animation synthesis has been tackled both in 2D and 3D space. In the constrained case of animating only the frontal face with small head rotations and translations, image based 2D approach can provide a quick and easy solution avoiding the need for 3D face and head modeling. If the focus is further restricted to mouth animation, 2D animation becomes more attractive because the 3D mouth shapes created during speech are difficult to model and generalize. 3D tongue modeling and behavior also require an elaborate mechanism [Cohen 1993][Stone 1991]. In contrast, image-based approaches simplify these issues because the appearance and animation model can be readily captured in the form of images. The facial animation in 3D space provides M l flexibility. Once a complete person specific head model is created, it can be inspected at arbitrary angles, lit with virtual lightings, and displayed at any image resolution. Integration with a M l body model produces a virtual actor in games or movies. The flexibility demands, however, complicated modeling and animation mechanisms. Typically, a special device to capture the 3D geometry with the texture is required and accurate face dynamics needs to be simulated. The incorporation of computer vision techniques leads to the concept of performance driven facial animation where tracked human actors direct the model to produce animations. It involves issues of the animation control as well as the animation creation. Accurate tracking of feature points or edges R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is important and it often necessitates markers on the face. The tracked 2D or 3D feature motions are filtered or transformed to generate the motion data needed for a specific animation system. Previous work on above three categories will be briefly reviewed in the following sections. Refer to appendix A for the detailed treatment on various facial modeling and animation techniques. In appendix A, facial modeling techniques using 3D laser scanner as well as anthropometry are explained. Facial animation techniques that are not discussed in this chapter such as wrinkle generation, vascular expressions, spline based pseudo muscles, facial action coding system (FACS), and so on are also covered. 2.1. 2D Facial A nim ation Facial animation in 2D space can be classified into three categories: dense pixel based image warping/morphing techniques, coarse mesh based image morphing and texture blending, and approaches to building the image segments database. A text to audiovisual speech animation is synthesized using dense pixel based image warping and morphing between two key frame viseme images in [Ezzat 1998], Despite the fact that co-articulation effects are ignored, the synthesized talking animation appears very natural. Pixel based image warping and morphing are, however, a pricey operation. For example, the 65536 pixels in a small 256 x 256 window are a huge amount of data to process per frame. Otherwise, memory-devouring preprocessing is indispensable. Consequently, animation is restricted to the small mouth region. An extension to this approach of incorporating wrinkles and expressions over the whole face would be extremely difficult. Instead of performing the dense pixel operation, warping/morphing can be performed on a sparse mesh grid defined on the image [Pemg 1998]. For instance, a new mouth shape is synthesized by a linear combination of several base images [Gao 1998], Textures from the base images are blended together with weights determined by the tracked locations of feature points on a triangular patch. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 12 Similarly, slightly rotated views of the face are produced by a linear combination of three basis images captured from different views [Koufakis 1999]. Mesh based texture mapping can be a fast process exploiting the hardware texture mapping capability common in most graphics cards today. Mesh preparation for each individual is, however, a tedious process for most of these approaches. There are automated ways to determine optimal triangulations [Guibas 1992] [Shewchuk 1996], but as pointed out in [Koufakis 1999], automated triangulations do not capture the facial anatomy so manually prepared meshes are often preferred. As an alternative to create new images with warping/morphing and blending, a collection of existing images can be stored into the database and retrieved appropriately to synthesize talking faces. Video rewrite extracts tri-phone images from video footage [Bregler 1997], New utterances are synthesized by concatenations of the tri-phone segments reproducing co-articulation effects. Cosatto and Graf collect various image samples of a segmented face and parameterize them to synthesize a talking face [Cosatto 1998]. Emotional expressions including eye movements and forehead wrinkles are exhibited by displaying different parts of the face with different sample segments. Methods that exploit a collection of existing sample images must search their database for the most appropriate segments to produce a desired animation. The success of the synthesized sequence heavily relies on the pre constructed database. The lack of any ability to generate new images degrades the synthesized animation quality when newly encountered expressions or phrases are not already in the database. 2.2. 3D Facial Animation For 3D facial animation, a person-specific model is typically prepared by deforming a generic model in a preprocessing step. The generic model contains all the animation parameters necessary for the subsequent person specific animations. The model is animated by mesh node displacements according to the motion rales specified by deformation engines such as vector muscle models, layered skin mass spring systems, finite elements methods, free form deformation, or simply interpolation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Texture mapping is also employed to improve realism that is hard to achieve by geometric deformations alone. Vector-based muscle models are adapted widely for their compact representation [Waters 1987][Waters 1995], A delineated deformation field models the action of muscles upon skin. The muscle definition includes the vector field direction, an origin, and an insertion point. The cone shaped field extent is defined by cosine functions and fall off factors. Facial animation is achieved by changing the contraction parameters of the embedded muscles under the face surface. This approach assumes that muscles are placed under the face model in correct locations. Placing muscles in 3D space, however, is not intuitive or consistent from model to model. Layered skin models with an embedded mass spring system mimic the anatomical structure and dynamics of the human face [Lee 1995]. The mesh consists of three-layers corresponding to skin, fatty tissue, and muscles tied to bones. Elastic spring elements connect each mesh node and each layer. Muscle forces propagate through the mesh to create deformations. The computational cost for such spring systems is very high. Facial animation using Finite Element Methods (FEM) [Basu 1998][Essa 1996][Essa 1994][Guenter 1992][Pieper 1992] faithfully reconstructs facial geometry. FEM implicitly defines interpolation functions between nodes based on a description of the physical properties of the material, typically a stress-strain relationship. When external forces are applied, the displacements of the nodes are computed to minimize local stresses and strains imposed onto the nodes. FEM alleviates the intricate control problem occurring in muscle-based approaches because the muscle contraction parameters are automatically approximated in the training stage in general. The difficulty of this approach lies in estimating actual physical properties of the face skin. A simpler way to create facial animation is to manipulate only the skin surface ignoring underlying bone and muscle interactions. One such a way is to use free-form deformation (FFD) [Kalra 1992]. FFD deforms volumetric objects by manipulating control points arranged in three-dimensional cubic R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 14 lattices. Conceptually, a face model is embedded in an imaginary, clear, and flexible control box containing a 3D grid of control points. As the control box is squashed, bent, or twisted into arbitrary shapes, the embedded model deforms accordingly. Imprudent manipulation of the control points may lead to non-facial shape deformations due to the missing constraints from the bone and muscle structures. 2D morphing combined with 3D transformations of a geometric model produces facial animation [Pighin 1998]. The success of this approach depends on how realistically a collection of facial models with various expressions can be created. It requires the selection of a number of feature points and careful preparation of texture maps. The procedure is more like a good modeling technique and animations are limited to interpolations between pre-made models. Various animations only become possible after the preparation of enough number of models. 2.3. Performance Driven Facial Animation Performance driven facial animations (PDFA) have track the face while the actor is performing facial gestures. This recorded or on-line video stream is analyzed to extract the motion of facial features. The motions drive the deformation of the face model to produce the synthesized animation. PDFA was first introduced in 1990 [Williams 1990] and has been used to drive 2D animation [Azarbayejani 1993][Essa 1996][Koufakis 1999] and 3D animation [Basu 1998][Eisert 1998][Guenter 1998][Pighin 1999]. The robustness of a tracking system directly affects the quality of the resulting animation. Many tracking systems are based either on snakes or on optical flow. Snakes, or deformable curves, are widely used to track intentionally marked-up facial features. The recognition of facial features with snakes is primarily based on color sample identification of the highlighted features and edge detection. Many systems use snakes coupled with underlying muscles mechanisms to drive facial animation [Kass 1987][Thalmann 1993] [Terzopoulos 1993] [Terzopoulos R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 15 1991]. Muscle contraction parameters are estimated by tracked facial feature displacements in video sequences. Colored markers painted on the face or lips [Kishino 1994] [Moubaraki 1995][Ohya 1995] [Patterson 1991] [Williams 1990] are extensively used to ease the process of tracking features on the video sequences. However, marking on the face is intrusive. Also relying on the marks restricts the scope of the geometric information to be retrieved only to the marked points. Optical flow with spatio-temporal normalized correlation measurements has been utilized for the feature tracking without the markers on the face in [Essa 1996]. The pixel-by-pixel measurement of the surface motion is coupled to a physically based face model of a muscle control mechanism Optical flow provides the detailed spatio-temporal records of the displacement of each point on the face surface. This detailed geometric measurement serves to recover the muscle control parameters required to generate facial animation in [DeCarlo 2000]. Estimation errors are adjusted by a feedback loop. A similar idea of using optical flow for motion estimation and facial expression analysis is presented by Eisert and Girod [Eisert 1998], who employ a more sophisticated feedback loop. Motion parameter estimation is performed with the low resolution first and repeated with the higher resolutions. This hierarchical feedback scheme allows large displacement vectors to be estimated between two successive video frames. A drawback of optical flow systems is the high computational cost resulting from the huge amount of data to process. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 16 Chapter 3 3. 2D Visual Speech Synthesis3 3.1. Introduction The goal of visual speech synthesis in 2D space is to have a talking agent with real time interaction capability with the user. The agent can represent the newscaster or the intelligent desktop agent. It can also be applied to a low bit rate teleconferencing once the 2D representations of the participants are created. The approach in this chapter employs a coarse 2D mesh based image warping, morphing and blending. Two key frame images are manipulated for each phoneme transition. At each time step, the positions of the mesh nodes are determined by RBF coefficients associated with a set of sparse feature points based on the interpolation coefficients. Texture blending of the two displaced key frames produces a speech animation sequence. There are a couple of rationales behind our mesh-based approach. Dense pixel warping and morphing suffers from high computational cost [Ezzat 1998]. Constructing the database with sample image segments becomes impossible when the database is incomplete. On the contrary, the complete database would be unreasonably big [Bregler 1997][Cossato 1998]. Most of the mesh-based approaches, however, have their own drawback of having to define the mesh whose resolution is 3 IEEE International Conference on Multimedia and Expo [Noh 2000B] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 17 Eight points on the lips, eight points on the face boundary, and one point on the nose j are shown on the viseme CM (left) and AA (right). j Figure 3-1 Selected features for RBF deformation determined by the number and the positions of control points [Gao 1998] [Koufakis 1999][Pemg 1998]. The use of RBF as the image-warping engine obviates the problem of typical mesh-based approaches. With RBF, the number and the locations of the control points do not constrain the mesh shape. Regardless of the control point distribution, the mesh can be a simple regular shape. Therefore, it eliminates the need for manual mesh drawing. The synthesis of facial expression with RBF is not completely novel. There were attempts to apply RBF to create facial expressions [Arad 1994]. These approaches, however, warp only a single image to deform the face. The quality of single image warping degrades as more distortions are required. In addition, single images lack information about the inner mouth region. Our approach makes use of multiple viseme images. 3.2. System Overview A visual speech synthesis system is built with multiple base viseme images. The snapshots of the 26 viseme images are taken. The number of viseme images may differ depending on how the phonemes are classified into viseme groups. We divide the 39 phonemes into 26 visemes, similar to [Bregler 1997]: (1) CH, JH, SH, ZH (2) K, G, N, L (3) T, D, S, Z (4) P, B, M (5) F, V (6) TH, DH (7) W, R (8) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 18 A coarse mesh grid is placed on the face (left) and a finer mesh on the mouth (rignt). j _ _______________ j Figure 3-2 2D regular meshes used for image warping HH (9) Y (10) NG (11) EH (12) EY (13) ER (14) UH (15) AA (16) AO (17) AW (18) AY (19) UW (20) OW (21) OY (22) IY (23) IH (24) AE (25) AH and (26) CM for closed mouth. Corresponding feature points are manually labeled on each image, 8 on the boundary of the face, 1 on the nose, and 8 on the lips (figure 3-1). Feature point selection could be automated with the help of feature tracking systems [Maurer 1996] [You 1996]. To ensure the convexity of the RBF interpolation, the four comers of the mesh are also used as feature points. A 3 x 4 mesh grid for face movements and a 14 x 13 mesh grid for mouth movements are defined implicitly on the image as shown in figure 3-2. One quadrant of the face mesh consists of 60 x 60 pixels and one quadrant of the mouth mesh consists of 6 x 6 pixels. To produce an animation sequence, a user types in an arbitrary text, which is decomposed into phonemes by the CMU dictionary [CMU], Corresponding visemes are then identified by the phoneme-to-viseme table lookup. At each transition from one phoneme to the other, two key frame visemes are warped and blended together following the co-articulation path approximated by spline curves. Interpolation coefficient determining the contribution of each key frame for warping and blending is computed by the time stamp output from the text to speech module. I new = ( l - c ) / I(c) + c /2( l-c ) 0 < c < 1 (3-1) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 19 I new is the synthesized image, I 1, 12 are the two key frames, and c is the interpolation coefficient. The evaluated number at the left of I represents the blending amount and the number at the right represents the warping amount. 3.3. Image Warping and Blending Instead of interpolating key frames linearly, spline4 curves are applied to smooth the visual speech path simulating the co-articulation effect. Each feature point has a temporally associated cubic spline curve with 4 control points extracted from two current key visemes, one proceeding viseme, and one following viseme. At the beginning and the end of the animation, the CM (close mouth) viseme is inserted as the proceeding and the following visemes. At each time step, the temporal splines are evaluated and smoothed feature point locations are determined. Each key viseme image is warped toward the determined feature point locations with the Gaussian -(-)2 RBF, h(r) = e s . Plugging Gaussian RBF into equation (B-3) yields 5 c ‘^ eti = F {xsourcei) = Z Wje S i (3-2) M where s denotes the width of the Gaussian basis function, N the number of feature points, x source key frame feature point locations, and x tw % ,et feature point locations evaluated by splines. The dimension of x is 2, (i.e. image space x , y coordinates). Width s is simply determined by sj = max I I x sourcei - xj || (3-3) J*i 4 X-spline is implemented to approximate B-spline with the end point interpolation property [Blanc 1995]. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 20 Substituting N feature points into equation (3-2) with the addition of regularization parameter X results in a linear system whose solution is given in equation (B-9), w = (H + A /r 1x 'arge; (3-4) where H is the basis matrix constructed with Gaussian function and 3 . is set to 0.01 after a couple of manual trial. The equation is easily solved by matrix inversion to obtain the coefficient set and the mesh nodes are warped accordingly. Two warped key frames are blended together followed by the merge between the mouth region image and the whole face region image. The mouth image alpha values are linearly changed from 0 to 1 over 10 pixels to eliminate the visual artifacts that may occur at the boundary around the mouth region. 3.4. Text to Speech Module Integration The Festival Text to Speech (TTS) system is integrated to provide the visual speech dynamics as well as a synchronized voice [Black 1997], The timing information from TTS determines two key frame visemes and an appropriate interpolation value at each time step. Depending on the duration of the phoneme transition, the number of frames to be synthesized is also determined given the measured frame rate. The equation to compute the interpolation coefficient cin equation (3-1) is as follows [Ezzat 1998]. T - T c = — -------------------------------------------------------- (3-5) T'td f Tcur means the current time, the current frame number divided by the measured frame rate, — . T F denotes the current phoneme start time. It is equal to the accumulated transition durations of the preceding phonemes. Ttd is the phoneme transition duration, or the average of the two phoneme R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 21 Mouth transitions from CM (top left), to F (top right), and to OW (bottom right). Figure 3-3 Synthesized mouth shapes for ‘phone’ durations. Equation (3-5) states that the interpolation coefficient is the time elapsed past the phoneme transition start divided by the phoneme transition duration. If the interpolation coefficient C exceeds one, it signals that key frames should shift to the next viseme pairs. 3.5. Results and Discussion We synthesize speech animations with and without moderate head movements. To create an animation without head movements, all the viseme images are normalized by offline wapping reducing the subject’s translation and rotation discrepancies between images. Only the mouth region R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 22 is synthesized and merged with an arbitrary background image. For an animation with head movements, the image normalization is skipped allowing the subject’s pose variations. Both the mouth region and the whole face are synthesized and merged together. This approach may have the drawback of coupling a specific viseme with a specific head position, but moderate movements of the head during speech are common and livelier than the stationary frontal pose. Figure 3-3 shows examples of the synthesized images for the transition from the viseme CM, to F, and to OW when pronouncing ‘phone’. Base images are not shown. Various lip shapes are well synthesized while the head pose is changing smoothly. No artifacts are visible around the boundary of the mouth region. The computation is performed on a modest PC in real time. A possible extension to this work is to synthesize facial expressions. It is straightforward to apply the same technique to generate wrinkles on the face. Synchronized wrinkle emergence and proper eye blinking would make the speech animation more convincing. Another extension would be to have more systematic head movement decoupling a viseme from a specific head pose. 3.6. Conclusion A way to synthesize 2D visual speech is described. A set of base viseme images is taken and corresponding feature points are defined on each image. The synthesis is based on RBF deformation and texture blending. The speech dynamics are extracted from a Text to Speech synthesis system and approximated by spline curves. The methods produce a voice-synchronized animation in real time. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 23 Chapter 4 4. 3D Facial Animation Generation5 4.1. Introduction It grants a lot of advantages to generate facial animation in 3D space. A virtual camera can be placed at any locations to give the viewing freedom to inspect the model. A virtual light source can be added to arbitrarily change the shadow on the face. With the addition of the body model, face animation can be useful for games or movies. When augmented with tracking systems, immersive environment teleconferencing is possible providing eye contact and gaze gestures between conference participants. Creating facial animation in 3D space is an laborious task. Painstakingly sculpting key frames guarantees the highest quality facial animation. It takes quite a long time even for highly skilled artists to generate animation by sculpting. There have been notable attempts to reduce manual works and automate the facial animation synthesis process. Facial bone and muscle structure models faithfully reconstruct the face dynamics [Basu 1998][Essa 1996][Lee 1995], Surface deformation tools allow the user to manipulate control points to achieve a variety of expressions [Karla 1992], Parameterization of the face model controls a group of vertices with high-level descriptions, i.e. move eye brows up [Parke 1982]. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 24 Top row shows deformation generated by muscle actuation while bottom row shows deformation by RBF morphing with eight control points. Note how closely RBF based morphing can synthesize the top row shapes with a small number of 3D control points. Figure 4-1 RBF morphing based deformation to mimic muscle based deformation Existing facial animation techniques often require tedious preprocessing, method specific tunings, or high computational cost. Furthermore, the way it operates is generally unintuitive to naive animators. Due to the complexity of the face structure and facial expressions, off-the-shelf complete facial animation systems are still seldom available for easy access to public. Given the limitations that current methods have, our design goals for a new facial animation system are summarized as follows. 5 ACM Virtual Reality and Software Technology [Noh 2000A] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 25 • The system should be general enough to work with any facial mesh without any special prior preparation. It is inconvenient to go through preprocessing every time a new mesh is used. • The system should be intuitive/easy to use for naive animators with no understanding of complicated underlying mechanisms. The more unintuitive low-level parameters are exposed to the users, the harder to produce animations. • The system should be fast enough for interactive performance. Immediate visual feedback guides animators through animation production. • The system should be flexible enough to create a variety of expressions. It is desirable to limit the gamut of produced expressions by animator’s imagination rather than method specific parameters. The approach in this chapter creates a variety of facial expressions while attaining the above design goals. Geometry deformation is simply made by clicking and dragging any point on a polygonal face mesh. Nearby vertices in the influenced region are displaced smoothly to generate expressions. This work is most similar to the Free Form Deformation (FFD) approach [Karla 1992] where control point manipulations lead to the facial skin surface deformations. Unlike FFD, however, the control points lie on the surface of the face mesh, allowing the direct manipulations. Radial Basis Function (RBF) is employed as a surface deformation mechanism. We performed a preliminary test to see how well RBF based deformation can approximate the target face model. In figure 4-1, 8 correspondences are specified between the models on the top row and the bottom row, 2 on the eyebrows, 2 near the nose, and 4 on the lips. The top model is deformed by the Waters’ muscle system and the new locations of the 8 feature points are handed over to the second model where a RBF network is embedded. As can be seen from figure 4-1, RBF based morphing smoothly deforms the model to closely synthesize the desired shapes with a small number of feature points. The same technique is applied to synthesize animations from 2D feature points by analyzing video sequences in R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 26 Figure 4-2 RBF deformation driven by tracked 2D feature points figure 4-2. The tracked 2D points are converted into 3D for RBF deformation as before. Note that wrinkles on the forehead and around the mouth are synthesized with a dynamic texture mapping [Fidaleo 2000]. This preliminary test verified that a RBF network could be useful for facial animation. The test, however, requires feature points to be predefined before any deformations are produced. We eliminate this preprocessing and effectively localize the deformation area by allowing a number of small RBF deformation elements instead of having a single global RBF network. The grab and drag paradigm is realized given a plain mesh with no predefined animation parameters. 4.2. Geometry Deformation Element Face mesh geometry is locally deformed by a geometry deformation element (GDE). A GDE is the smallest deformation unit defined on the surface of the face. A GDE consists of a control point, the region of influence around the control point, anchor points that lie in the boundary of the influence region, and an underlying RBF system (figure 4-3, 4-4). The movable control point and the stationary anchor points determine the displacement of the vertices in the influence region. Specifying any point on the face creates a GDE. A control point may be derived from a 2D image by projecting it to the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 27 The red point represents an initial control point position and the blue point a new position. The green points are anchor points and the area inside the green points is the region of influence. The shape of the region of influence is affected by the mesh regularity. Figure 4-3 Geometry Deformation Element defined on the facial surface 3D mesh surface. The region of influence is bound by a distance metric that determines the stationary anchor points. The number of mesh vertices in the influence region can be large or small. An influence region of one vertex reduces deformation to vertex manipulation. To create a geometry deformation element: 1. Specify a control point on the mesh and an influence extent to control the deformation around the point. The selected point does not have to coincide with any of the vertices in the mesh. 2. For points selected in 2D images, convert them to 3D points on the model surface by ray casting. 3. Find the nearest vertex on the mesh to the selected 3D point. This vertex becomes the root for the search tree of mesh edges. 4. Search down the tree of mesh edges with a Breadth First Search, determining all vertices within a specified distance metric. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 28 Input Control point Anchor Points Feature Points output Movable Points £ Mapping Coefficients U ser GDE RBF i | Figure 4-4 Relationships between GDE and RBF 5. Leaf nodes of the search tree become the anchor points and, together with the specified control point, initialize the RBF system associated with the GDE (figure 4-4). To actuate a geometry deformation element: 1. Specify a new position of the control point either by mouse dragging or tracking a facial feature in a video sequence. 2. Convert the new 2D control point to 3D by ray casting. 3. The RBF system computes the new locations of all vertices in the influence region based on the new control point position and the stationary anchor points. Note that the first step in creating and actuating deformation elements can be accomplished by manual input or automation. In either case a 2D point is identified (by mouse click or feature detection and tracking). Unless the selected point coincides exactly with a mesh vertex, its 3D location is unknown. However, we have a 3D face model so we can approximate the 3D point position by ray casting. The R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 29 Recovered 3D point at the intersection , j The intersected Screen with a Camera center j j polygon in the specified 2D or eye j j face mesh control point j Figure 4-5 3D coordinate computation intersection point of the model with the ray emitted from the camera center through the specified 2D image point gives the 3D location on the face model (figure 4-5). Direct specification of 3D control point positions are also possible to handle cases where the control points move out of the mesh or along silhouettes. 4.3. Search M ethods and Distance Metrics Once a 3D control point is specified, the region of influence and anchor points can be determined. We consider the edges of the face mesh to be an arbitrary tree with a root located at the nearest vertex to the specified control point. The GDE influence region and anchor points are determined by searching down the tree using a Breadth First Search [Moore 1959]. During traversal, vertices are tested against a distance metric to see if they fall in or out of the influence region. We experimented with two distance metrics. One is based on edge depths and the other is based on Euclidean distance. The edge depth metric marks all vertices within some integer number N of mesh edges as in the influence region. The distance metric computes the Euclidian distance between traversed vertices and R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 30 Edge based method selects lower mouth region while distance based method selects whole mouth region starting from the same location. Figure 4-6 Comparison between edge based and distance based search method the control point, and when that distance is below threshold, the vertex is marked as within the influence region. The threshold is a real number scaled to the mesh coordinate units. The two metrics serve different purposes. For example, when opening the mouth, the influence regions should be separate in the upper lip area and the lower lip area. In this case, an edge based metric finds the lower part of the mouth mesh for any control point on the lower lip without affecting the upper mouth region (Figure 4-6). In cases where mesh density is very irregular, for example in the eye regions, an edge metric produces very irregular shaped influence regions. The distance metric produces regular shaped influence regions regardless of mesh density variations. In many cases, we find that both metrics produce similar deformations. For large influence regions or very dense meshes, the number of boundary points can become large. In these cases we limit the number of perimeter points to 20 sampled evenly along the boundary vertex set. 4.4. Surface Deformation w ith Radial Basis Function In this section, we refer to a specified GDE control point and its anchor points as “feature” points since their distinctions are meaningless to the RBF system. As depicted in figure 4-4, each GDE has R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 31 one RBF system for displacements computations. A RBF system deforms a facial mesh based upon the motions of feature points. The mappings between initial positions and new positions of the feature points are described in terms of the vector coefficients. We compute this mapping at each frame. The rest of the nodes in the influence region are transformed based upon the coefficients computed. The radial basis function approximation equation (B-3) becomes ftzrget. = p^scurce^ = £ Wj^ ^ S °U rC e , : - Xj ||2 +S / (4-1) M when Hardy multi-quadrics h(r) = -Jr2 + s 2 is used as a RBF. When computing mapping coefficients w , input points are the feature points themselves and when evaluating the new positions, input points are the points in the influence region. The dimension of x is three (i.e. x,y,z coordinates of each feature point). x source denotes the initial positions of the feature points and x taiget denotes new positions of the feature points. The coefficient Sj is simply determined as suggested by [Eck 1991]. sJ =mm\\xS 0 U r c e i - x j \\ (4-2) j*i As in equation (3-4), the system solution is given in equation (B-9), w = (H + 7d)~xx tM & et (4-3) with A set to 0.01. The linear system of equation (4-3) is easily solved by LU decomposition [Press 1992] to obtain the coefficient set w . The LU decomposition of the matrix H happens only once at the initialization of the RBF system for each GDE. Only a back-substitution is computed for deformation frames with new positions of the feature points x tmset. Thus the deformation computation is fast. Once the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 32 2 GDE makes a smile. 2 GDE makes eyes up. 4 GDE makes sadness. Figure 4-7 Expressions as a collection of Geometry Deformation Elements (GDEs) system is solved, the deformed positions for vertices in the mesh influence region are obtained from the computed coefficients using equation (4-1). 4.5. Generating Expressions A geometry deformation element is the smallest unit for surface deformation. One or more deformation elements constitute an expression. For example, two deformation elements make one smile expression (figure 4-7). In this way, a variety of expressions are possible by using various combinations of deformation elements. We can control a set of deformation elements with a single parameter d where d=0 corresponds to a neutral expression, and d= 1 corresponds to the maximum displacements of all the control points of the member deformation elements. The expression can then be animated simply by changing the control parameter over the range of [0, 1], Mouth shapes used for speech synthesis are created and controlled similarly. Mixtures of multiple control parameters are also possible. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 33 Input video frame 3D model with eyebrows up The 3D model overlaid on the video frame transparently to show correct alignment of eyebrows Figure 4-8 Deformation driven by feature points in the video stream 4.6. Tracking System Integration It is easy to create and control geometry deformation elements to produce various expressions all manually. Simply grabbing and dragging the facial surface deforms the face. Once a gallery of expressions is constructed, animation across existing expressions can be achieved by key frame interpolations of expression parameter values. The grabbing and dragging operation can be performed by a tracking system instead of manually. It would be desirable to automate or at least semi-automate the construction of the initial expression database (figure 4-8). The concept of performance driven facial animation (PDFA) is applied to construct an expression database. In PDFA, a human actor is tracked with a camera while generating facial expressions and mouth shapes. This recorded or on-line video stream is analyzed to extract the motion of salient facial features. These motions then drive the deformation of the face model to produce similar expression animations. A major difficulty of using PDFA to automatically generate 3D facial animations lies in R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 34 Figure 4-9 Sample expressions created by one or more GDEs the ambiguous relationship between the tracked features and the animation mechanism. Given a sparse set of tracked displaced points on the face, estimating the animation parameters that invoke the displacements is an inverse problem not easily solved with many existing approaches. In contrast, GDEs can be controlled directly by the feature motion vectors measured in the images. This direct relationship between the tracked feature points and the GDE control points is a major advantage in simplifying a PDFA system. The steps to automate the generation of facial expressions can be summarized as follows: 1. Video streams are captured containing the subject making various expressions and mouth shapes. 2. Salient control points are identified (manually or automatically) on the subject face(s) and tracked over the expressive sequences. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 35 Figure 4-10 Transition from neutral to ‘A’ mouth shape 3. The 2D tracked points from the video streams are treated as GDE control points and converted into 3D points using ray casting. 4. The 3D control point motions are input to the GDEs where deformations are produced as depicted in section 4.2. For the tracking of the feature points and pose estimation of the head in the video streams, the existing work is adapted to suit our purposes [Zhenyun 1997]. Feature tracking and pose estimation methods are likely to produce erroneous results due to analysis errors and non-ideal imaging conditions. Our animation application provides an interactive editing interface that allows the animator to manually correct or override the tracking and pose results to achieve the desired animations. 4.7. Results and Discussion We create a variety of expressions and mouth shapes by choosing different tracking/control points and different influence regions and directly manipulating these points on the face. Figure 4-9 shows sample expressions. Small modification of the lip region conveys a feeling of dissatisfaction (a) or R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 36 Figure 4-11 Transition from neutral to angry face decisiveness (b). Moving one eyebrow upward shows cleverness (c). Pulling lips and eyebrows comers down create sadness (d). For a sly look, only a small number of vertices are displaced around the lip comer and eyebrow using small influence zone (e, f). The same impression may be created by totally different expressions. For instance, facial expressions for anger may vary from person to person, or even within the same person (g, h). Sample animation sequences from a neutral state to full expression (figure 4-10, 4-11) and between two different expressions are shown (figure 4-12). With the 3D model, inspection from an arbitrary viewpoint is possible as shown in figure 4-10 and 4-11. Figure 4-13 shows models generated with the automated process. Without any prior preparation, the same GDE techniques are applied to the subject’s 3D model. Three red points on the eye sockets and the tip of the nose are used for pose estimation while the motions of yellow points are used for deformations of the face. With the model of 1954 polygons, real time (30Hz) animations are achieved on a moderately configured 500MHz PC. Radial basis functions generate smooth surface deformation. However, if a control point is moved too far from its original position, say outside the influence region, large discontinuities occur around the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 37 Figure 4-12 Transitions between two expressions Figure 4-13 Video driven facial animation anchor point. Because anchor points are stationary at the boundary of the influence region, no influence of the control point can propagate through the anchor points. Such large control point motions do not occur in practice and most deformation engines would produce unnatural effects under similar conditions. We eliminate the need of preprocessing by limiting the deformation regions. Currently, we assume that a specified control point can only be moved within the region of influence. As a possible R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 38 adaptation to allow the control point to move beyond the influence region, one can consider a dynamic tree search to regenerate a larger region of influence with new anchor points. Our method estimates the required 3D control points using ray casting. However, erroneous results can occur. We allow manual editing of the 3D control points as needed to compensate for such errors. An ultimate solution for this problem may be to impose constraints on the facial surface such as bone structures. Without the constraints, tracking errors or 3D estimation errors can easily lead the deformation to the non-human shapes, imposing the burdens on the animators. However, this introduces limitations to the use of simple meshes. 4.8. Conclusion A simple way to deform a 3D face model to create facial animation is described. The GDE method is a novel approach to deforming face models by directly manipulating feature points defined on the surface. The RBF based computation produces localized smooth deformations. This approach is applicable to any meshes without special initialization. The process requires minimal and intuitive human intervention and can be automated by the use of a feature tracking system with video streams. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 39 Chapter 5 5. Facial Animation by Expression Cloning6 5.1. Introduction Sample expressions cloned onto Yoda from a model with different geometric proportion and mesh structure. The top row of figure 5-14 shows the source model. Facial animation aims at producing expressive and plausible animations of a 3D face model. As mentioned in the previous chapters, some approaches model the anatomy of the face, deriving facial animation from the physical behaviors of the bone and muscle structures [Lee 1995] [Platt 1981] [Waters 1995][Waters 1987]. Others focus only on the surface of the face, using smooth surface deformation mechanisms to create dominant facial expressions [Guenter 1998][Kalra 1992][Pighin 1998]. These approaches, including our previous works described in the chapter 3 and 4, make little use of existing data for the animation of a new model. Each time a new model is created for ’ ACM SIGGRAPH [Noh 2001] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 40 animation, a method-specific tuning is inevitable or the animation is produced from scratch. Animation parameters do not simply transfer between models. If manual tuning or computational costs are high in creating animations for one model, creating similar animations for new models will take similar efforts. A parametric approach associates the motion of a group of vertices to a specific parameter [Parke 1982]. This manual association must be repeated for models with different mesh structures. Vector based muscle models place the heuristic muscles under the surface of the face [Waters 1995][Waters 1987]. This process is repeated for each new model and no automatic placement strategy has been reported except for the case where a new model has the same mesh structure. Muscle contraction values are transferable between models only when the involved models are equipped with properly positioned muscles. Even then, problems still arise when muscle structures are inherently different between two models, i.e. a human and a cat face. A three-layer mass-spring-muscle system requires extensive computation [Lee 1995]. The final computed parameters are, however, only useful for one model. Free-form deformation manipulates control points to create key facial expressions [Karla 1992], but there is no automatic method for mapping the control points from one model to another. Expression synthesis from photographs can capture accurate geometry as well as textures with a painstaking model fitting process for each key frame [Pighin 1998]. In practice, animators often sculpt key-frame facial expressions for every three to five frames to achieve the best quality animations [Lewis 2000], Obviously, those fitting or sculpting processes must be repeated for a new model even if the desired expression sequences are similar. Our goal is to produce facial animations by reusing motion data. Once high-quality facial animations are created for any model by any available mechanisms, expression cloning (EC) reuses the dense 3D motion vectors of the vertices of the source model to create similar animations on a new target model. Animations of completely new characters can be based on existing libraries of high-quality animations created for many different models. If the animations of the source are smooth and R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 41 expressive, the animations of the target model will also have the same qualities. Another advantage of EC is the speed of the algorithm; source animations created by computationally intensive physical simulations can be quickly cloned to new target models. After some preprocessing, target model animations are produced in real time, making EC also useful for interactive control of varied target models driven from one generic model, e.g., for text-to-speech applications [Ostermann 1998]. Similar to EC, performance driven facial animation (PDFA) and MPEG-4 both use measured motion data [Basu 1998][Escher 1998][Guenter 1998] [Ostermann 1998] [Williams 1990], In PDFA, 2D or 3D motion vectors are recovered by tracking a live actor in front of a camera to drive the facial animation. With this approach, the quality of the animation depends on the quality of feature tracking and correspondences between the observed face and target model. MPEG-4 specifies eighty-four feature points. Accurately identifying corresponding feature points is difficult and a daunting manual task. Degraded animation is expected if only a subset of feature points is identified or tracked. In contrast, EC reuses animations already containing precise dense 3D motion data. A sophisticated mechanism identifies dense surface correspondences from a small set of correspondences. For models with typical human facial structure, a completely automated correspondence search is described in Section 5.6. Expression cloning also relates to 3D metamorphosis research where establishing correspondences between two different shapes is an important issue [Kanai 2000]. Harmonic mapping is a popular approach for recovering dense surface correspondences [Eck 1995]. Difficulty arises, however, when specific points need to be matched between models. For instance, a naiVe harmonic mapping could easily flip the polygons if a user wanted to match the tip of the noses or lip comers between the source and the target models. Proposed methods to overcome this issue include partitioning models into smaller regions [Kanai 2000] or model simplification [Lee 1999] before applying harmonic mapping. A spherical mapping followed by image warping is used in the case of star shaped models [Kent R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 42 Source Model Target Model Motion Captured Data Or Any Animation Mechanism Ow'C* U * ® Cl Dense Surface Deform Correspondences Cloned Vertices Displacements Motion Transfer Source Model Animation Target Model Animation Figure 5-1 Expression cloning system 1992], Our approach to finding dense correspondences starts with specific feature matches, followed by a volume morphing and a cylindrical projection. Our work is also motivated by techniques for retargeting full body animations from one character to another [Gleicher 1998]. While we consign the creative decisions (how does a cat smile?) to the user’s choice of the source animation as in [Gleicher 1998], our technique of cloning a facial animation is significantly different in approach from that dealing with articulated body motions. 5.2. System Overview Expression cloning directly maps an expression of the source model onto the target model. The first step determines which surface points in the target correspond to vertices in the source model. See the arrow labeled with Deform in figure 5-1 where the source model is deformed to the target model’s R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (a) After morphing the generic model to itself with 23 initial correspondences, some features are aligned. However, off-surface edges also arise like these blue edges over the nose. (b) Morphing followed by a cylindrical projection achieves a complete surface match between two models. Figure 5-2 Surface correspondences by morphing and projection shape to find the dense surface correspondences. No assumptions are made about the number of vertices or their connectivity in either model. We compute the dense correspondences by using a small set of initial correspondences to establish an approximate relationship. Identifying initial correspondences requires manual selection of fewer than ten (and possibly zero) vertices after an automated search is applied. Without the automated search, experiments showed that fifteen to thirty five manually selected vertices were required, depending on the shape and the complexity of the model. Automatic correspondence search bootstraps the whole cloning process and detailed heuristic rales are given in section 5.6. The second step transfers motion vectors from source model vertices to target model vertices, labeled as Motion Transfer in figure 5-1. The magnitude and direction of transferred motion vectors are properly adjusted to account for the local shape of the model. Using the dense correspondences computed in the first step, motion transfers are well defined with linear interpolation using barycentric coordinates. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.3. Dense Surface Correspondences Assuming N sparse correspondences are available, dense surface correspondences are computed by volume morphing with Radial Basis Functions (RBF) followed by a cylindrical projection. Volume morphing roughly aligns features of the two models such as eye sockets, nose ridge, lip comers, and chin points. As shown in figure 5-2 a, volume morphing with a small set of initial correspondences does not produce a perfect surface match. A cylindrical projection of the morphed source model onto the target model ensures that all the source model vertices are truly embedded in the target model surface, as shown in figure 5-2 b. See figure 5-11 for more examples. When multi-quadrics is used for an RBF, h(r) = ^ r 2 + s2 , equation (B-3) turns into x m *e‘i = F ( x sourcei) = f iwJ^ \\x sourcei ~ x j I I 2 + s / (5-1) 2=1 This network is trained three times with the 3D coordinates of source correspondences as x so u rc e ,•, and the x, y, or z values of target correspondences as x targe‘i ( i = 1,2,....N ). The distance Sj is measured between Cj and the nearest x t , leading to smaller deformations for widely scattered feature points and larger deformations for closely located points [Eck 1991]. Sj = min || x, - c ■ || (5-2) i*j Given k , the weights w to be computed is given in equation (B-8), w = (H TH + k i y i H Tx ‘^ e,i (5-3) The regularization parameter k is computed by equation (B-10)-(B-14). The iteration is stopped when equation (B-15) converges, i.e. the difference between the previous GCV value (see appendix B) and the current value becomes less than 0.000001. Once all the unknowns are computed, the RBF R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 45 x P’y P’z P. n(nv n 2,n3) Xg the projection center Xp a source model point Xj the intersection point X1 a target model triangle vertex n a target model triangle normal x0,y 0,z0 Figure 5-3 Notations used in equations 5-4,5,6 network smoothly interpolates the non-corresponding points, mapping the source model onto the target model’s shape. After the RBF deformation, each vertex in the source model is projected onto the target model’s surface to ensure a complete surface match. A cylindrical projection centerline is established as a vertical line through the centroid of the head. A ray perpendicular to the projection centerline is passed through each vertex in the source model and intersected with triangles in the target model. The first intersection found is used in cases of multiple valid intersections. Although this could cause a potential problem, visual artifacts are not observed with various models in practice. A reason may be that motions are similar for any of the valid intersections due to their regional proximity. Referring to the notations in figure 5-3, the line equation passing through the center of the projection x0 and a point in the source model xp is x = (xp - x Q )t + x0 (5-4) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 46 Target model ♦ Source model | Points in the source are embedded in the surface I of the target. The opposite is not necessarily true. Figure 5-4 Side view of two models after the projection The plane equation that contains the triangle in the target model is n » { x - x i) = 0 Plugging equation (5-4) into (5-5) and solving for t yield « i ( * i - ^ o ) + » 2 ( T i - T o H ^ O i ~ z o ) (5-5) t = - (5-6) «i(*P - xo) + n2(yP -y o ) + n3(zp - z 0) Then the intersectionx{ is computed with equation (5-4) with t from (5-6) plugged in. To test for intersections within a triangle, compute the barycentric coordinates of the intersection point with respect to the vertices of the target triangle. Computing barycentric coordinates is equivalent to solving a 3 x 3 linear system. (5-7) By a property of barycentric coordinate systems, if 0 < < 1, then the intersection lies inside the triangle. In reality, because of numerical precision limits, we subtract and add 0.005 from zero and one, respectively. '*1 *2 x3 V V Ti T2 T 3 b2 = Ti 3 z2 Z3. p i . _ z l_ R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 47 u Source Model Target Model Direction needs to be adjusted to preserve the motion angle with respect to the local surface. Source Model Target Model Magnitude needs to be adjusted according to the local size variations. Source Motion Target Motion Figure 5-5 Direction and magnitude adjustment of the motion vector 5.4. Animation with Motion Vectors A cloned expression animation displaces each target vertex to match the motion of a corresponding source-model surface point. Since we have dense source motion vectors, linear interpolation with barycentric coordinates is sufficient to determine the motion vectors of the target vertices from the enclosing source triangle vertices. Note that although the source model vertices are embedded in the surface of the target model by the RBF morphing followed by the cylindrical projection, the opposite is not necessarily true (figure 5-4). To obtain the barycentric coordinates needed for motion interpolation, we also project the target model vertices onto the source model triangles. In other words, we do the same operation described in section 5.3, but this time reversing the source and target models. The barycentric coordinates of each target vertex determine both the enclosing source model triangle and the motion interpolation coefficients. Since facial geometry and proportions can vary greatly between models, source motions cannot simply be transferred without adjusting the direction and magnitude of each motion vector. As R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 48 World coordinate system Local coordinate system for (a) an original vertex (b) the same vertex in the deformed model Figure 5-6 Transformation matrix as a means to adjust a motion vector direction shown in figure 5-5, the direction of a source motion vector must be altered to maintain its angle with the local surface when applied to the target model. Similarly, the magnitude of a motion vector must be scaled by the local size variations. Examples are shown in figure 5-12. To facilitate motion vector transfer while preserving the relationship with the local surface, a local coordinate system is attached to each vertex in both the original and deformed source model7 . The transformation between these local coordinate systems defines the motion vector direction adjustment (figure 5-6). The local coordinate system is constructed as follows. First, the X-axis is determined by the average of the surface normals of all the polygons sharing a vertex. To ensure continuous normal (X-axis) variations across the surface, a noise filter [Pratt 1991] is applied by averaging neighbor vertex normals. Second, the Y-axis is defined by the projection of any edge connected to the vertex onto the tangent plane whose normal is the just-determined X-axis. Lastly, the Z-axis is the cross product of the X and Y-axes. To obtain the deformed motion vector in' for a given source 7 A deformed source model is the source model after the morphing and projection described in section 5.3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 49 vector m (figure 5-6), the transformation matrices are computed between the two local coordinate systems and the world coordinate system. Prior to the dot product computation in equation (5-8) and (5-9), each component denoting the direction of X, Y, and Z-axes is normalized. Finally, the transformation matrix is This mapping at each vertex determines the directions of the deformed source model motion vectors given the source model motion vectors. If the source and target face models have similar proportions, the motion vectors may simply be scaled in proportion to the model sizes. However, to preserve the character of animations for models with large geometry differences (e.g. the unusually big ears of Yoda), the magnitude of each motion vector is adjusted by a local scale factor constrained within a global threshold. The local scale at a vertex is determined by a bounding box (BB) around the polygons sharing the vertex. In deforming a source model to fit a target model, the local geometry around a vertex is often scaled and rotated. Rotations are eliminated to facilitate a fair comparison of local scale. The source BB is transformed by the rotation matrix of equation (5-10). For each source model vertex in a BB, we compute its rotated position due to model deformation. x w * X o y w * x o z w * x o w R = xw •J'o y w* y 0 zw my 0 _ X W • z o P w ' Z o z w 9 z o (5-8) xd * xw y<i • xw zd • W n —• — — — — d R= Xd » y w y d » y w zd » y w x d • z w y d z d » z w (5-9) The matrix %R denotes the rotation from a local source vertex coordinates axes to the world coordinate axes, and is the rotation matrix from world axes to the local deformed model axes. (5-10) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 50 (a)’ (b) (c) °d r Local bounding box (a) original (b) transformed (c) deformed Figure 5-7 Local bounding box v'=pRv (5-11) The local scale change due to deformation is the ratio of the rotated source BB and the deformed BB (between b and c in figure 5-7) ' x ,y ,z sizexyz (DeformedSourceModelLocalBoudingBox) sizexyz (SourceModdLocalBoundingBox) (5-12) A protrusion or noise in the local geometry (e.g., a bump on the face in either model) can exaggerate motion vector scaling, making the scaling unnecessarily large or small. One solution is to limit scale factors by a global threshold such as the standard deviation of all scale factors. Scale factors greater than the standard deviation are discarded and replaced by the results of a noise filter [Pratt 1991] that averages neighboring values. The filter is then applied over the whole face to ensure smooth continuous scale factors. The transformation matrix that accounts both for the direction and magnitude adjustments of a motion vector is given by T=S%R (5-13) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 51 where S = 5, 0 0 0 Sy 0 0 0 5, from equation (5-12). During animation, the motion vector for each deformed model vertex is obtained by in' = Tin (5-14) where in is the vertex motion of the source model and in' is the vertex motion of the deformed model. Finally, a vertex in the target model vt is displaced by the following equation. iht = bx ih[ + b2m '2 + b^m '^ (5-15) where bl 2 3 denotes the barycentric coordinates, iht the target vertex motion vector, and in 'l 2 t 2 , the enclosing source triangle motion vectors. 5.5. Lip Contact Line Our models have lips that touch at a contact line. This contact line between the upper and lower lips requires special attention. Although they are closely positioned, motion directions are usually opposite for upper and lower lip vertices. Severe visual artifacts occur when a vertex belonging to the lower lip happens to be controlled by an upper lip triangle, or vice versa. Therefore, careful alignment of the lip contact lines between the two models is very important. Misalignment results in misidentification of the enclosing triangles and subsequent lip vertex motions in the wrong direction. Specific processes are followed to produce artifact-free mouth animations. First, include all the source-model lip contact line vertices in the initial correspondence set for the RBF morphing step. Since source vertices do not usually coincide with target vertices (figure 5-8 a), it is necessary to compute corresponding points in the target model. Compute the sum of the piecewise distances between the left and right comers of the lip contact line and normalize each length to the range [0, 1] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 52 VZ Source Lip Model Target Lip Model Upper lip Lower lip (a) After morphing (b) Aligned two lip contact lines Source model lip contact line Target model lip contact line Figure 5-8 Lip contact line alignment for both models. Corresponding locations on the target lip-line are found at normalized parameters matching those of the source lip-line vertices. Label each vertex parameter in the lip contact line as 5i,2, 3... and h,2, 3... f°r the source and target model, respectively (figure 5-8). If parameter sm falls between tn and tn+1, the corresponding 3D coordinate c on the target lip is interpolated by c = 3 D(t„+ 1) * ~ ~ ~ ~ + 3 D(tn) * t f L= ^ f L (5-16) t« + 1 tn + i t n With the above correspondences, the RBF morphing in section 5.3 brings the source lip vertices into the target model’s surface as shown in figure 5-8 a. Note that there are duplicate vertices at each point - one for the upper lip and one for the lower lip. If we perform the cylindrical projection in section 5.3, the duplicate points represented by t2 , ti , or t4 in figure 5-8 a will be controlled by upper-lip source-model triangles since these points are located above the source-model lip-contact line. Therefore another step is necessary to completely align the lip contact lines of the two models. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 53 Temporarily move the vertices of the target-model lip contact line onto corresponding source-model lip contact points. These corresponding positions are computed with normalized parameters and equation (5-16), as before, but this time the target vertices are moved onto the source-lip contact line as opposed to the source vertices moving onto the target-lip contact line. Figure 5-8 b shows final aligned lip lines. Two issues are noteworthy. First, there is no actual degradation of the fidelity of the target model from aligning its lip-line vertices with the source model. Lip-line alignment is only temporary to facilitate determining the enclosing source-model triangles. The original target-model lip-vertex coordinates are used for animation. Second, by manipulating the contact line vertices for alignment, there may be cases where triangles flip if only the vertices on the lip contact line move. We recursively propagate the same displacements in the contact line neighborhood until no more triangle flipping is detected. The next step determines which vertex at the lip contact points belongs to the upper and lower lip so that each can be assigned to the appropriate enclosing triangle. A naive barycentric coordinate test may indicate both the upper and lower-lip triangles as the enclosing triangles for both points on a lip contact line. We check the neighborhood of each vertex to see if neighbor vertices are located above or below the vertex. Motion-vector transformations also require special attention at the lip contact line. The matrices could easily be different for each of the duplicate vertices at a lip contact point due to their different local neighborhoods. This would cause the two vertices to move to different positions when driven with the same source motion vector. To ensure the same transformation matrices for both vertices on a lip contact point, consider the upper and lower lips connected. Specifically, the normal computations and local BB comparisons include neighbors from the upper and lower lips. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.6. Automated Correspondence Selection A small set of correspondences is needed for the RBF morphing. Since all other EC steps are fully automated, automatic initial correspondence selection would completely automate expression cloning. Automatic correspondences not only reduce tedious manual selection, but also remove the errors and variations produced by mouse clicking and judgment. We present fifteen heuristic rules that identify more than twenty correspondences when applied to most human faces. In some cases, we find that up to ten additional manual correspondences may be added to improve the animation quality. In all cases, an animator can simply edit erroneous automatic correspondences, substituting or adding their own selections. Orient the face model to look in the positive z-direction. The y-axis points through the top of the head, and the x-axis points through the right ear. The model is assumed to have a neutral expression initially with the lips together and the contact line defined by duplicate vertices. For robust behavior during the heuristic correspondence searches, we skip (ignore) degenerate triangles that have one very short edge compared to the other two edges. Heuristic rules 1. Tip of the nose: Find the vertex with the highest z-value. 2. Top of the head: Find the vertex with the highest y-value. 3. Right side of the face (right ear): Find the vertex with the highest x-value. 4. Left side of the face (left ear): Find the vertex with the lowest x-value. 5. Top of the nose (between two eyes): From the tip of the nose, search upward along the ridge of the nose for the vertex with the local minimum z-value. 6. Left eye socket (near nose): From the top of the nose, search down to the left side of the nose for the vertex with the local minimum z-value. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 55 7. Right eye socket (near nose): From the top of the nose, search down to the right side of the nose for the vertex with the local minimum z-value. 8. Bottom of the nose (top of the furrow): From the tip of the nose, search downward to the center of the lips until reaching the vertex with the local minimum z-value. The vertex with the biggest angle formed by two neighbors is the bottom of the nose. 9. Bottom left of the nose: From the tip of the nose, search downward to the left side of the nose until reaching the vertex with the local minimum z-value. The vertex with the biggest angle formed by two neighbors is the bottom left of the nose. 10. Bottom right of the nose: From the tip of the nose, search downward to the right side of the nose until reaching the vertex with the local minimum z-value. The vertex with the biggest angle formed by two neighbors is the bottom right of the nose. 11. Lip contact line: Find the set of duplicated vertices. 12. Top of the lip: From the center of the upper lip contact line, search upward along the centerline for the vertex with the local maximum z-value. 13. Bottom of the lip: From the center of the lower lip, search downward along the centerline for the vertex with the local minimum z-value after passing the vertex with the local maximum z-value. 14. Chin: From the bottom of the lip, search downward along the centerline for the vertex with the local maximum z-value. 15. Throat: From the chin, search downward along the centerline until reaching the vertex with the local minim z-value. Along the search, find two vertices with two maximum angles. The one with smaller z value is the throat (The other one should be near the chin point). R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 23 automatically found feature points including 9 lip contact points Figure 5-9 Automated search results 46 motion-capture data points Motion capture data embedded into the source man model Figure 5-10 Motion capture data and its association with the source model The labels given to these points may not be precise and they are not important. We only seek to locate corresponding geometric points in both models. Figure 5-9 shows the correspondences automatically found with the above rules. 5.7. Results and Discussion The specifications of the test models are summarized in table 5-1. The “source man” model is used as the animation source for all the expressions that are cloned onto the other models. Source animations are created by a) an interactive design system for creating facial animations and b) motion capture data embedded into the source man model (figure 5-10). An algorithm similar to [Guenter 1998] is implemented to animate the source model with the motion capture data. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 57 Model Polygons Vertices Source Man 1954 988 Woman 5416 2859 Man 4314 2227 Rick 927 476 Yoda 3740 1945 Cat 5405 2801 Monkey 2334 1227 Dog 927 476 Baby 1253 2300 Table 5-1 Models used for the experiments For expression cloning onto the woman and man models, only the twenty-three correspondences from the automated search are used. This means that the whole EC process is fully automated for these models. The Yoda model has large eyes and ears. We manually add three additional points on each eye socket and two points on each side of the face. The monkey model is handled similarly. The dog and cat model do not have anything close to human face geometry. Twelve and eighteen points are manually selected for the dog and cat, respectively, to replace erroneous automatic search results. Figure 5-11 shows the deformed source models produced to determine dense surface correspondences from these initial sets of points. The deformations closely approximate each target model. For example, the bumps on the Yoda eyebrows are faithfully reproduced on the deformed source model. The source model cheek is also smoothly bulged for the monkey model. The eyes are properly positioned for the man and woman model. Motion vector adjustments are depicted in figure 5-12. The monkey model has different local geometry from the source model. Motions are widely distributed (column 5) and more horizontal (column 2) in the mouth region. Finer geometry of the forehead produces denser but smaller motions (column 3). Figure8 5-14 and 5-15 show sample expressions from cloned animation sequences. Although the models have different geometric proportions and mesh structures, the expressions are well scaled to fit 8 Figure 14 and 1 5 can be found at the end of this chapter. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 58 l I | First row: The source model after the RBF morphing followed by the cylindrical projection. Second i row: Target models. The source model is shown in figure 5-12. Note that although all the source model | vertices are embedded in the target model, different tessellation makes the deformed cat model I wireframe appear different from the source. In general, deformed source models closely reproduce the \ target model features. For example, look at Yoda’s eyebrows and mouth (column 4). Figure 5-11 Deformed models produce dense surface correspondences. each model. For instance, the smile and nervous expressions are effectively transferred to the woman model (column 3 and 4 in figure 5-14). Frown and surprise expressions are shown on the cat model (column 5 and 6). Moderate intensity expressions cause mostly small motions and these are sometimes hardly distinguishable from neutral expressions in static images. Exaggerated expressions are tested in figure 5-15. A big round open mouth source expression creates a rectangular mouth shape for the monkey due to its much longer lip line. An asymmetric mouth shape is reproduced on the target models and variations arise from differences in the initial target mesh expressions (column 4). The use of human source animations creates many human-like mouth shapes for the dog model rather than expressions more typical of a real dog (last row). Assessing the emotional quality of the expressions produced by EC is clearly subjective, but we can validate the quantitative accuracy of the algorithm by using the “source man” model as both the source and target model. The EC algorithm is applied to find the surface correspondences and adjust R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. First row: Source model motions. Second row: Monkey model motions. The left four expressions in figure 14 are used. The monkey’s wide and bulged mouth has more horizontal motions compared to the source model (solid orange circle). Finer geometry of the monkey forehead leads to denser smaller motions (dotted red circle). Figure 5-12 Adjusted direction and magnitude after the motion vector transfer the motion vectors to any local geometry variation. Ideally, the target vertex displacement should be identical to that of the source model. Table 5-2 and figure 5-13 show error measures for sample expressions. Staring with the automatically found twenty-three points, an additional ten points are included for this test, three on each eye socket and two on each side of the face. These points produce a more accurate surface match that reduces quantitative errors. The error measure is defined as the size ratio between the position error and the size of the motion vector. % E r r o r ,m SiZ‘( ~ Pm,,i0nErr0r' > (5-17) size(Motion Vector) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 60 Angry Talking iw ssss; No displacement error * " —* 10% displacement error Area with no motion Smiling Nervous Surprised % error is determined by equation (5-20). Colors between yellow and red represent values between 0 and 10% . Figure 5-13 Visually depicted displacement errors Angry Talking Smiling Nervous Surprised 5.28% 8.56% 4.77% 4.07% 4.56% Table 5-2 Average errors relative to the motion vector size Figure 5-13 visually depicts displacement errors such that a vertex with zero error is yellow and a vertex position error one-tenth of its motion vector length (10%) is red. Errors between 0 and 10% are colored by interpolation. Vertices with no motion are colored blue. Figure 5-13 shows that central face areas where most expression motions occur have small errors and boundary regions generally have higher errors. The larger boundary-area error percentage occurs because motions are relatively small at the boundary, making the denominator in equation (5-17) small. With very small motions, even numerical errors can adversely affect this error measure. Table 5-2 shows the average errors of all the vertices with motions. To better quantify the visual significance of the errors, the position error is also measured relative to an absolute reference, in this case the size of the model. size (PositionError) %Errorx = 100------------- — ---------------------------- sizex (FaceRe gionBoundingBox) (5-18) R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 61 Note that in this case the error is computed separately along the x , y , and z directions. Table 5-3 indicates that the average errors relative to the size of the model are negligible. Since the motion vectors are dense over the whole face, and their errors are small, visual artifacts are very difficult to perceive, even at high resolutions. Angry Talking Smiling Nervous Surprised X 0.22% 0.14% 0.13% 0.14% 0.16% y 0.18% 0.26% 0.16% 0.11% 0.12% z 0.09% 0.23% 0.06% 0.05% 0.05% Table 5-3 Average errors relative to the model size The experiments are performed on a 550 MHz Pentium-Ill PC. Except for the actual animations, all other processes are performed offline. The automated search takes 0(n) to find the tip of the nose, the top of the head, and other extreme points. Once those initial points are found, the search for other points (i.e. the chin) only requires a local search of neighborhood vertices. Therefore, the feature search is fast, taking only a few seconds in our experience. RBF morphing involves solving for Eigen systems needed for the regularization parameter and the matrix inversion needed for the weight vectors. The size of the matrix is typically less than 30 x 30, so the morphing is also fast. A naive cylindrical projection to find the correspondence between n source vertices and m target triangles takes 0{nm) . Even with this brute-force approach, projection takes less than a minute for our models. This time could be reduced, by using a smarter search exploiting, for instance, spatial coherence. Unnecessary tests in the back of the head could be prevented by limiting the search to the frontal face. The transformation matrix to adjust the motion vector magnitude and direction is constructed per vertex, 0 (n ). Finally, the actual animation using already-computed barycentric coordinates is performed in real time (>30Hz) including rendering time. The manual intervention required for expression cloning is minimal, involving at most the selection of a small set of correspondences. We show that correspondence search can be at least partially R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 62 automated by a heuristic analysis of the geometry. There are some regions, however, for which geometric descriptions are not practical. For example, locating the boundary of the face and finding detailed eye features appear difficult using only geometry. As an extension, automatic search may be expanded to use textures. Additional rules or methods would help identify a greater set of correspondences [Maurer 1996][Shinagawa 1998], This could further automate facial animation cloning and reduce quantitative errors. The EC method currently transfers only motion vectors, but it seems possible to include color or texture changes as well [Fidaleo 2000], Our goal is to easily create quality animations and we assume that dense surface motion vectors are available. However, we also observe that stick figures and cartoons can convey rich expressions from a sparse representation. Future research could explore how sparse source data can become without loss of expressive animation quality. The issue may be addressed by locating the points with the most salient information for conveying the animation while the dense data field is algorithmically decimated. This knowledge may be useful for collecting motion capture data, and at that point EC may also be suitable for applications in compression. Currently, our efforts are focused on transferring exactly the same expressions from a source to targets. It would be useful to put control knobs that amplify or reduce a certain expression on all or part of a face. The control knobs would directly modulate the sizes of the motion vectors. The expression motions could also be transformed to Fourier space where its coefficients could be manipulated [Bruderlin 1995]. It may also be possible to mix the motions of a set of expressions to produce a variety of speech and emotion combinations for any target model. Clearly, the flexibility provided by control knobs could provide varied target animations from just a few source animations. The idea is actually implemented and discussed in more detail in section 5.8 Tongue and teeth model manipulations are not handled by EC at this point. If the source model includes tongue animation, we believe that the EC technique can generate animations for target tongue models [Cohen 1993][Stone 1991], Similarly, teeth models can be rotated from source R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 63 animations providing jaw rotation angles or just motion vectors for the teeth. Finally, assuming an eyeball as a separate model, an eyelid could be treated similar to the lip contact line, or eyelids could be rotated if the rotation angle is provided. 5.8. E xtension 1: Motion Volume Control and Motion Equalizer Section 5.4 shows how to adjust motion vector sizes while transferred from the source model to target. As suggested, one way is to use local bounding boxes coupled with a global threshold. By considering the model shape variation locally, the mechanism reduces the adverse effects on motion vector scaling caused by global shape variation between the two models. Applying a global threshold enforces smooth scaling change across the whole face. Although the mechanism produces well- proportioned expression animation on various target models, it may be desirable to provide animators with a means to change resulting animations for their end animation goal. For example, the EC system equipped with control knobs would allow an animation sequence to be manipulated when cloned onto the target model. This way, diverse target animations can be possible from a source animation. This section delves into the issue of the animator controlled motion vector size manipulation. A quick and intuitive way of varying motion vector size is to directly influence the vertex displacements determined by the EC system. This direct manipulation could be operated on each vertex, a group of vertices, or whole face vertices. Figure 5-16 shows various effects on the resulting animation when each vertex displacement is multiplied by constant scaling values. Varied scaling values amplify or reduce the expressiveness of original expressions at various degrees. This simple operation can be a powerful editing tool especially when applied to a group of vertices locally instead of to the whole face. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Source model x 0.4 x 1.0 x 1.6 Determined each vertex displacement is multiplied by constant scaling values. Scaling value 1.0 is equivalent to the default output from the EC system. Figure 5-16 Cloned expressions produced with different scaling values An interesting way to look at the direct multiplication of scaling values to the vertex displacement is to consider it as a volume control for music. A scaling value is then analogous to a volume gain. Exaggerated expressions correspond to high volumes while reduced expressions correspond to low volumes. This volume control influences the vertex displacements at each frame or spatially. In contrast, an audio equalizer modulates music in frequency domain. Depending on various settings of R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 65 the low frequency (bass) and high frequency (treble) components, the overall feeling that music conveys varies. Analogously, we attempt to modulate the frequency of the vertex displacements in an animation sequence just like the audio counterpart. Motion signal processing [Bruderlin 1995] motivated this work where signals from an articulated body are decomposed into the frequency domain and manipulated. Our basic algorithm is same as that of [Bruderlin 1995], Only the applied signals are different. Instead of joint angles, vertex positions in the face mesh are treated as input signals. Here is the reproduced motion signal filtering algorithm. The number of frames m determines how many frequency bands fb are used. Let 2n <m< 2”+1, then ft> = n . The B-spline filter kernel of width 5 is w1 = [cbabc], where a = 3/8,h = 1/4,c = 1/16. The filter kernel is expanded by inserting zeros, w2 = [cObOaObOc] w3 = [cO O O bO O O aO O O bO O O c], and etc. Now, steps 1 to 4 are performed simultaneously for each vertex motion signal. 1. Convolve the signal with the kernels to calculate lowpass sequence of all fb signals. GQ is the original motion signal and G ^ is the DC or the average intensity. Gk+ l = wk+ l x Gk or 2 , equivalently, Gt+i(0 = 'L ^ M )G k{i + 2 m ). (5-19) m -~ 2 2. Compute the bandpass filter bands, Lk =Gk - Gk+l. (5-20) 3. Multiply Lk ’s by each gain value. jb-i 4. Reconstruct motion signal, G0 = G^ + Lk (5-21) £= 0 Our sample animation consists of 1201 frames, yielding fb = 10. The sample animation decomposes into 11 lowpass sequences, G0 - Gl0 and 10 bandpass sequences, L0- L 9. Multiplying an arbitrary gain value to any of Lk before reconstructing motion signal back to G0 alters the original animation. Suppose a gain value g is multiplied to band L0 . Equation 5-21 becomes R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 66 G0 n ew = Gj), + gL0 + £, + Z2 + . Using 5-20, expanding and rearranging the equation yields G onew = G0 + (g - 1)(G0 - G[). More generally, for a gain value g applied to band Lk, G0 n ew = G0+ (g -l)(G k -G k+ 1 ) (5-22) This equation indicates that new signal is sum of the original signal and difference of the lowpass sequence at level k multiplied by the gain factor minus one. The more the lowpass signals G are different between the two consecutive levels Gk and Gk+l, the more G0«ew gets affected. Next, suppose a gain value g is multiplied to band L0, Lx, and L2 . Equation 5-21 then becomes Gon ew = Gp + gL0 + gLy+ gL2 + Ljf,_!. Using 5-20, expanding and rearranging the equation yields G0 n ew = G0 + (g - 1)(G0 - G3). More generally, for a gain value g applied to m consecutive bands from Lk, G0 n ew = G0 + (g -1 )(G* -G k+m ) (5-23) The interpretation of this equation is similar to above but this time, the more lowpass signals G are different between the discrete levels Gk and Gk+m, the more G0new gets affected. From equation 5-22 and 5-23, it can be seen that G0 n ew = G0 when the gain value equals to 1. When the gain value is greater than 1, the difference between the lowpass bands are added to the original signal G0. This operation somewhat amplifies the motion vector in its original direction. When the gain value is less than 1, the difference between the lowpass bands are subtracted from the original signal G0. This operation also amplifies the motion vector to some degree but this time in the opposite direction. Figure 5-17 shows the source model at two different frames when the gain value is 1 meaning that no motion signal processing happened. Since 10 bandpass sequences are used, L0-Lg, band 10 through 13 are disabled as shown in the picture. In contrast, figure 5-18 shows various effects on the original R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 67 Mol ii>n fiju tli£ri Figure 5-17 Original source expressions at two different frames with all the gain values set to one facial animation depending on the different gain values applied to different frequency bands. The first row shows that the gain value 3 is applied to the bandpass frequency bands 0, 1, and 2. From equation 5-23, it can be seen that the difference between G0 and G3 is multiplied by 3-1=2 and added to the original expression G0. G3, which is the 3 times smoothed version of G0 by the B- spline filter along the temporal domain, is similar to G0 so that the effect is small and not much difference is observed from the original expressions in figure 5-17. The fourth row shows that the gain value -2 is applied to the frequency bands 0, 1, and 2. This time, the difference between G0 and G3 is multiplied by 2+1=3 and subtracted from the original expression G0. The effect is also small due to the small difference between G0 and G3. The second row shows more salient effect. Compared to the original expressions in figure 5-17, the mouth is much more open due to the relatively big difference between G3 and G7 . Similarly, the fifth row shows an interesting effect. The difference between G3 and G7 is subtracted from the original signal resulting in reversal of the motion vector direction. The mouth is firmly closed. The third and sixth rows can be similarly explained. Especially, the comparison of second expressions shows that the mouth is wide and lip comers up in the third row while narrow and lip comers down in the sixth row. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. First row: Gain sets to 3 at band 0, 1 , and 2. Second row: Gain sets to 3 at band 3, 4, 5, and 6. Third row: Gain sets to 3 at band 7, 8, and 9. Fourth row: Gain sets to -2 at band 0, 1 , and 2. Fifth row: Gain sets to -2 at band 3, 4, 5, and 6. Sixth row: Gain sets to -2 at band 7, 8, and 9. i Figure 5-18 Various expressions generated by applying different gain values See figure 5-17 for the original expressions. Although not shown here, much more diverse effects can be obtained by manipulating gain values in different ways. For example, setting the gain value to 0 for the band L0 yields Go n e w = G, . Since the high frequency G0 is removed from the original signal and replaced by the smoothed version Gj, this operation can be employed for the coarticulation effect for speech animation. For smoother effect, the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 69 gain value for Lx can be also set to 0 yielding Go n e w = G2 ■ In fact, setting the gain value to 0 is equivalent to fitting a spline curve to each mouth vertex along the temporal axis. Obviously, greater variations can be observed by individualizing gain values for each bandpass frequency band. In this case, analytic explanation may not be possible due to its complexity. However, random non-intuitive facial expressions are possible by setting the gain value for each band arbitrarily. 5.9. Extension2: Direct Animation with Motion Capture Data One of the well-known approaches to producing facial animation with motion capture data is to use the Guenter’s algorithm [Guenter 1998]. In fact, the technique was utilized to create a source animation in section 5.7. However, it entails tedious manual preprocessing of specifying feature correspondences between the motion capture data and 3D model. In addition, the way to propagate feature point displacements to neighborhood heavily depends on the feature point distribution and the model’s shape. If the actor and model do not conform in shape, the method will suffer. Commercial software by Famous3D takes a different approach to generating facial animations with motion capture data. After the initial feature correspondence specification, a region of influence around each motion capture data is determined at the animator’s discretion. This added manual intervention eliminates the problem of shape conformation between the actor and 3D model and provides flexibility for the resulting animation. Different animation can result with the same motion capture data depending on the animator’s intention. This section illustrates a mechanism to animate a 3D face model given motion capture data utilizing the expression cloning technique. The idea is to triangulate the motion capture data to produce a 3D face mesh and to apply the same technique treating the triangulated motion capture data as a source model. Once the triangulation is done and a source mesh is prepared, animation transfer between the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 70 source and target model is straightforward using expression cloning. So the focus in this section is placed on mesh generation from the provided motion capture data. The steps to generate a plausible mesh are as follow. 1. Adjust the marker locations so that the left and right markers along the face centerline become symmetric. 2. Approximate the lip contact line with a Bezier curve. 3. Specify constraints (if any) to consider in the triangulation step. 4. Project 3D markers onto a 2D plane and Triangulate. 5. Split upper and lower lips from the triangulated mesh. Step 1: Symmetry of the face When facial motion data are captured, the makers are manually attached on the actor’s face. In general, the initial marker positions are approximately symmetric with respect to the face centerline but not precisely. This asymmetry of the markers results in an asymmetric mesh if the triangulation is performed on the initial marker positions (figure 5-19). To produce a nice symmetric mesh, the initial marker positions needs to be adjusted. The markers are divided into two groups. Assuming the orientation of the motion capture data is looking in the positive z-direction, the y-axis points through the top of the head, and the x-axis points through the right ear, the marker on the tip of the nose is the one with the highest z-value. The markers on the centerline are then determined as the ones with a similar x coordinate to the nose marker. A tiny value e defines the similar x coordinate. The markers on the left side of the centerline points are one group and the rest is the other group (figure 5-20). Individual marker correspondence between the two marker groups needs to be found to properly adjust the maker positions. This problem can be cast as an energy minimization problem. After the flipping of one marker group with respect to the centerline through the nose, the correspondence R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 71 Asymmetric marker positions result in an asymmetric mesh as shown in the blue circle. A ' , l . ’ - v - " i / i . ' V - - i ■! " - Figure 5-19 Asymmetric mesh vs. symmetric mesh The left hand group, right hand group and centerline The point in the middle is the nose. Figure 5-20 Marker grouping configuration generating minimum distance energy among all the possible correspondence configurations is the correct individual marker correspondence. More formally, we are looking for a correspondence configuration c such that Ec is minimum where E =|| M L - M R ||. M L denotes the left hand side marker set, M R the right hand side marker set, and ||. || sum of the pair wise Euclidean distance for each correspondence configuration. The number of such configuration is k\ where k is the number of markers in each set. Obviously, with the increase in the number of markers used, the search space becomes easily intractable. Since the two marker groups are roughly symmetric, however, sorting the markers based on the y coordinate values first and considering a subset of markers at a time can dramatically reduce the search space. In our case, 3 points from the left and 5 points from the right are considered each time starting from the top of the sorted markers and the window is shifted down with the unselected 2 points from the right side carried over. Once the desired individual correspondence is found, the initial positions are pair wise averaged. Finally, points in the centerline are also aligned with the nose point. The second picture in figure 5-19 shows the adjusted marker positions for symmetry. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 72 The upper left picture shows a Bezier curve with 4 control points with 2 of them coincident with the lip comers. The user specified 7 points to approximate the lip contact line. The upper right picture shows the resulting mouth mesh while the bottom right picture shows the mouth mesh without lip line approximation for comparison. Figure 5-21 Lip contact line approximation using a Bezier curve Step 2: Lip contact line construction Most of the motion capture data do not contain the lip contact line points. Since the expression cloning technique assumes a mesh with the upper and lower lips fully defined, the lip contact points need to be artificially created. A way to approximate the arc shaped lip contact line is to construct a Bezier curve with four control points [Bartels 1987] [Blanc 1995]. The user specifies two lip comers and the number of points to be added along the lip line. Then a nice lip contour is generated (figure 5- 21). The added points are displaced every frame for animation as the average of the neighbor points displacements. Step 3: Constraint specification (optional) The triangulation at step 4 is performed to maximize the minimum angle. It achieves the best triangulation in terms of the resulting triangle shapes. However, the user may want to have a specific edge connecting specific points. For example, an edge separating the forehead from the lower part of the face might be desirable. Similarly, an edge representing the nose ridge might also be necessary. Although these edges are automatically produced most of the time by the adopted triangulation R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 73 point to force the generation of the nose ridge edge. The left mesh is generated j without constraints yielding j the missing nose ridge edge, j It can be fixed by changing the projection center or by j simply inserting constraint | points. The right mesh is \ produced by inserting a I Figure 5-22 Side view of the meshes generated with and without a constraint method, it can be forced to generate the edges if necessary, by simply inserting a small number of new points between the two end points where the edge is desired. With the inserted points, the inter distance between points becomes smaller forcing an edge between them (figure 5-22). The inserted points can be stationary or displaced for animation as the average of the neighbor point displacements. Step4: Projection and triangulation Usually, a motion capture data is 3 dimensional. Performing a triangulation in 3D space is a difficult task [Edelsbrunner 1992], Therefore 3D marker positions are spherically projected onto a 2D plane prior to the triangulation for a simpler 2D triangulation. The projection center can be interactively adjusted. In general, skinny triangles cause trouble for animation, so a triangulation containing small angles should be avoided. The Delaunay triangulation [Berg 1997] maximizes the minimum angle avoiding sharp triangles in the resulting mesh. Triangulations are compared by their smallest angle and the one with the bigger angle is selected. If the minimum angles of two triangulations are identical, the comparison is performed with the second smallest angle and so on. The connectivity among the points in the 2D plane is maintained while transferred to the 3D points for 3D mesh construction (figure 5-23). R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 74 Figure 5-23 Delaunay triangulation Figure 5-24 Open mouth after lip contact line performed in 2D space split Since the triangulation is performed without consideration for the orientation of each triangle, normal variation might be inconsistent over the surface. For example, some triangles’ normal would point outward while others would point inward. For correct rendering, a triangle is flipped if the angle between the triangle normal and ray from projection center through triangle center is greater than 90 degrees. Step 5: Lip split The mouth should be split open for correct facial animation. To open the mouth, the artificially added lip contact line vertices in step 2 are duplicated and one is assigned to the upper triangle while the other is to the lower triangle (figure 5-24). This step completes the mesh generation from the initial 3D markers. The mesh now can serve as a source mesh containing the source animation. Expression cloning algorithm can be applied to various target meshes for animation transfer (figure 5-25). 5.10. Conclusion The concept of expression cloning provides an alternative to creating animations from scratch. We take advantage of the dense 3D data in (possibly painstakingly created) source model animations to produce animations of different models with similar expressions. Cloning can be completely automatic, or animators can easily alter or add correspondences. Cloning effectively hides unintuitive R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 75 Left: source mesh directly generated from the motion capture data Middle and Right: target models Figure 5-25 Expression cloning using the mesh directly generated from the motion capture data low-level parameters from animators while allowing high-level control through correspondence selection. To naive users, selecting a small number of correspondences is likely to be much more intuitive and easier than dealing with muscles or sculpting. Since EC starts with ground truth data spatially (each frame) and temporally (a sequence of frames), the quality of output animation is very predictable. Because animations use pre-computed barycentric weights and transformations to determine the motion vector of each vertex, the method is fast and produces real time animations. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 76 First row: The source model and expressions. Second row through the last row: The cloned expressions. Models have different shapes but expressions are well scaled to fit each model. Figure 5-14 Cloned expressions onto various models. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 5-15 Exaggerated expressions cloned on a wide variety of texture-mapped target models The Yoda model is provided courtesy of Harry Change, http://Avalon.viewpoint.com. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 78 Chapter 6 6. Gesture Driven Facial Animation9 6.1. Introduction Simulating accurate dynamics on a 3D face model is a difficult challenge for artists. Subtle acceleration and deceleration of the skin motion makes it hard to synthesize visually pleasing animation without a considerable amount of manual effort and artistic skill. Typically, the artists rely on mirror observation of their own facial behaviors to carefully reproduce the observed facial expressions and dynamics onto the target 3D model. The resulting animation naturally echoes the creator’s facial characteristics. A method to automate the tedious process and to alleviate the artists’ burden was developed and termed performance driven facial animation (PDFA) [Guenter 1998][William 1990]. PDFA employs a metaphor related to the artists’ practice. An actor sits in front of a camera and performs desired facial gestures. These gestures are algorithmically analyzed to extract facial motions and in turn meaningful animation parameters, which are applied to a 3D face model for automatic animation. 9 u s e Technical Report [Noh 2002] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 79 i i (a) conventional PDFA: Direct conversion from the actor’s facial motion to the model’s animation parameter is non intuitive | j \ (b) GDFA: A high level gesture layer interfaces the two mechanisms. No conversion is necessary while local interpretation of predefined associations drives the model animation. Figure 6-1 Comparison between a typical conventional PDFA and our GDFA approach Although PDFA research including our prior attempt in chapter 4 mainly focuses on the development of sensing and animation, another crucial factor deserving more attention is the conversion of the observed 2D/3D facial motions to appropriate facial animation parameters. There is no widely accepted standard interface connecting the sensing and animation mechanisms. Consequently, the conversion depends on the specific sensing and animation techniques employed, requiring the creation of new method-specific interfaces between the two components of each PDFA system. In general, the conversion from an actor’s facial motion to a different model’s given animation parameters is an ill- posed problem. For example, muscle based systems are meaningless when the actor and face model (i.e. cat or dragon) vary drastically in muscle structure. Facial m otion Actor Sensing If lo rp li U t I> M u k J Animation f \Gesuin ' ; ^ G e s t u r e , u r n J MilsGc'i % Facial motion ‘ h b h b i.n i Animation R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 80 This conversion problem is illustrated in figure 6-1 (a). As denoted by the purple question-mark boxes, determining animation parameters for various models from an arbitrary actor is non-intuitive. We pose this problem in conventional PDFA as the absence o f a meaningful interface between the sensing and animation mechanisms. Rather than coupling sensing and animation with low-level signal conversion, as in Figure 6-1(a), we create a high-level meaningful interface between sensing and animation. Figure 6-1(b) illustrates our concept. Our approach relies on an animator associating the actor’s gestures with model gestures created by any facial animation mechanism. This allows animators to specify meaningful conversions between sensed information and the range of possible animation parameters used to animate a model. We refer to our approach as gesture1 0 driven facial animation (GDFA). It maintains the spirit of PDFA in that sensing and analysis provide automatic animation control. A distinguishing factor, however, is the high-level abstraction of the conversion between sensing and animation parameters. Consequently, the constraints of shape conformation and correspondence between the actor and model can be lifted and no method-specific conversion must be devised for each new actor or model or animation method employed. In addition, animations can be as expressive as the artist and animation tools can provide without any direct limitations imposed by the sensing approach. Lastly, GDFA maintains the modularity and independence of system components, allowing use of any sensing or animation methods. Although any sensing and animation methods are usable, our system uses a model-based interpolation mechanism for animation [Lewis 2000] and a vision-based gesture classifier for sensing [Noh 2002], Our choices were made to incorporate artistic talents and the artist’s intention into the resulting animation rather than to completely automate the process. Initially, the artist prepares the various 1 0 The term gesture is used to refer to various high-level facial state changes, i.e. eye twitches, visemes, and expressions of emotion. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 81 Video Analysis Emotion Space Diagram Source Model Target Model Gesture Value Interpolation Expression Cloning Figure 6-2 GDFA system architecture gesture target models. The number and shape of the targets are chosen to fit the artist’s end animation goal. Then animations are generated by interpolation of these models based on the actor’s classified gesture states. Flexibility in gesture classification is similarly achieved by defining a gesture space customized to the actor’s unique range of expression. New gestures made during a performance are analyzed with respect to this space. Our classification system is a generalization of the methods in [Fidaleo 2002] applied to facial region analysis. Two examples of this methodology are implemented. In the first, five expressions of emotion are selected and analyzed to define an emotion space. Expression- intensity sequences are used as training data to allow the system to respond to different degrees of a given expression. In the second, the actor’s viseme space is analyzed to drive speech animation. In principle, any other expressions or facial gestures can be used. Among the existing performance driven facial animations, particular attention should be paid to Essa’s work [Essa 1995] and Pighin et. al.’s work [Pighin 1999] that utilize base expression targets for model animation. Essa associates prepared expression models with corresponding images. Normalized correlations from each image are then used to interpolate muscle actuation parameters. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 82 4 - excited alarmed afraid astonished annoyed frustrated angry delighted happy pleased content miserable depressed bored serene calm relaxed tired v sleepy Figure 6-3 Simplified emotion space diagram [Lewis 2000] Similarly, Pighin et. al. utilize base expression models for animation by minimizing the error between the image and the blended 3D model. Morishima also studies emotion space to produce facial animation [Morishima 1995]. These approaches are most related to our work in that animation is based on predetermined rather high-level expressions or emotion targets. Our work is more general and modular, however in that we generalize the conversion process for any sensing and animation mechanisms. 6.2. System Overview As mentioned earlier, any sensing and animation mechanisms abstracting low level parameters into gesture states are usable to build a gesture driven facial animation system. Our system adopts a model-based interpolation and a vision-based gesture classifier [Noh 2002]. The system architecture is shown in figure 6-2. The actor is asked to make facial gestures and the desired 3D models are prepared. Any facial gesture model can be chosen depending on the artist’s intention for subtle or R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 83 exaggerated resulting animation. If needed, the emotion space diagram is provided as a model- positioning guide (figure 6-3) [Lewis 2000]. After the model preparation, a radial basis interpolation function (RBF) [Powell 1987] trains this configuration with gesture values as inputs and the abstract facial states as outputs. The gesture values are represented as a low dimensional vector according to the chosen parameterization [Russel 1980]. In an extreme case, the facial state could be each vertex position1 1 of the 3D model but we choose to use principal component analysis (PCA) coefficients that characterize the deformation of the model. PCA extracts orthogonal components from the redundantly prepared 3D models by an artist. Once the video sequence is analyzed and gesture values are estimated, the values are fed into the trained interpolation function for the corresponding facial state output. The orange line in figure 6-2 denotes the animation path determined by the facial gesture analysis. The source model is then deformed accordingly and finally expression cloning onto other models occurs as explained in chapter 5. Expression cloning is adopted as an efficient alternative to possible repetition of gesture targets preparation for new models. 6.3. Animation by Model Based Interpolation Our classification method involves a training phase that derives a gesture signature basis from a set of training samples using independent component analysis [Donato 1999], New image samples are transformed into the signature space and classified with respect to the training samples deriving a gesture state vector used by the animation system to drive expressive models1 2 . Two example applications of the gesture classification system are utilized: expression and viseme classification. The two applications differ only in the gesture space defined by the actor and subsequently, the 1 1 Depending on the animation mechanism employed, any parameters can be used, i.e. muscle actuation values. 1 2 Detailed description of the method can be found in [Noh 2002], R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 84 Gestures: = Angry = Surprised e2 = Happy e4 = Afraid Set of 5 facial gestures actuated from a neutral face state (top). Input and pose-normalized input image, (bottom left). Face mask, cropped image, and Laplacian image (bottom right). AA, AE, AH, AO, AW, AY, B, CH, D, DH, EH, ER, EY, F, G, HH, IH, IY, JH, K, L, M, N, NG, OW, OY, P, R, S, SH, T, UH, UW, V, W, Y, Z, ZH Figure 6-4 Sample expressions used to train the system Phoneme set used for viseme training (top). Input image for phoneme UW and mouth mask (center). Masked viseme sample and its Laplacian image (bottom). Figure 6-5 Sample phonemes used to train the system gesture animation space defined by the modeler/animator. The expression training data consists of images of an actor making the facial gestures shown in figure 6-4. Phoneme training data is acquired by prompting the subject to speak 39 words that cover the phoneme space defined in figure 6-5. The images are normalized and preprocessed to reduce the noise from pose and lighting variations and to retain only the relevant information. The classifier interprets facial state without regard for temporal coherence and hence, small jitters in state estimation can arise. A better estimate is made by applying a Kalman filter [Maybeck 1997] to the expression and phoneme state vector streams prior to use by the animation system. Then the classified gesture state is locally interpreted by the animation system to produce the corresponding animation. Our system interprets an input gesture state by Radial Basis Function based interpolation. The output is a 3D model facial state described by the chosen parameterization, PCA coefficients in our case. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Surface peaks show dominant expressions. Figure 6-6 Five expression scores for an B O O frame video sequence Gaussian is one possible choice for the basis function, hj (x,-) = e S j . Plugging Gaussian RBF into equation (B-3) yields (P "‘ "V *,,II)I ^ m . = F{xsou rcei) = Y,Wje s> (6-1) M The variables Wj denote the weight to be computed, N the number of training states, X the input gesture state vector, and F(x) the corresponding facial state vector. Although the Gaussian function kernel size Sj can be optimized algorithmically, we leave it as a user controllable variable [Lewis 2000]. Given any small value X , the weights w to be computed is given in equation (B-8), w = (Hr II + X iy lH Tx td - rgeti (6-2) The initially prepared gesture targets redundantly share similar geometric shapes, which can be remedied by principal component analysis (PCA) [Jollife 1986]. PCA extracts most meaningful R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Surprised — Angry Afraid Happy ■ * * * * " Sad The model is interpolated based on the analyzed expression values. Figure 6-7 Facial animation driven by expression states orthogonal components from the initially prepared models and reduces the number of RBF functions to be trained. Assume that N gesture models S{ are available with v vertices each. The mean shape of these models is R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 87 (6-3) leading to the difference shape for each model, A S ,- = St - S . Construct the covariance matrix Cs = DDt where D denotes the matrix with each difference shape as a column [A5j.. AS^]. The eigenvectors of Cs constitute the principal components that span the gesture model data space orthogonally. A large value of the eigenvalue represents a large variance of the data in the corresponding eigenvector direction. Since the size of Cs can be huge depending on the size of v , compute Cs = DTD instead, to recover N eigenvectors instead of v . Typically N is much smaller than v yielding the dimension reduction and computational efficiency. Finally, a basis U can be obtained by U - DV1 m, where V \ ...m = [vj - --v w ] is the eigenvector matrix and m < N . Given a new model S , the compact vector representation is obtained by c = UT(S-$) (6-4) Given a vector representation, the model can be reconstructed by S = S+Uc (6-5) 6.4. Results The GDFA system is tested with 5 expressions of emotion: happy, angry, surprised, sad, and afraid1 3 . Note that any facial expression would be equally usable. In fact, depending on the artist’s end goal, many more expression models could be easily incorporated. Figure 6-6 shows a plot of the expression 1 3 Previous work uses the similar or subset o f these expressions o f emotion models for animation [Essa 1995] [Pighin 1999], R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 88 i The model is interpolated based on the analyzed 19 j | viseme values. f Figure 6-8 Speech animation driven by viseme states state vectors over time for a video sequence progressing through sad, angry, surprised, happy, sad, surprised, sad, afraid over 800 frames. Scores are normalized to the range [0,1]. Peak values represent the dominant expression. The graph accurately reflects the transition between expression states as intended by the subject. Figure 6-7 shows the actor’s classified current expression state, which interpolates the basis models for the 3D face synthesis. The wire frame is the initially prepared model and Yoda is the expression- cloned model. It can be seen that the models and actors do not share the same shape and their geometric proportions are different. In addition, the male actor’s happy, afraid and angry expressions are different from those of the model (i.e. the closed mouth vs. open mouth). However, the 3D model still exhibits the corresponding expressions following the artist’s initial preparation. Figure 6-8 demonstrates 3D visual speech synthesis. Instead of the facial expressions, viseme images are analyzed and the corresponding models are prepared. Nineteen visemes represent thirty-nine R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. phonemes. PCA further extracts the orthogonal components from the viseme models and only seven coefficients are kept for 99% reconstruction of the model in our case. The correlation between the viseme images and reconstructed models can be observed. The movie files contain sample animations demonstrating the actor’s expression state mapped onto the 2D emotion diagram as well as onto the prepared wire frame model. The result is in turn expression cloned onto Yoda and Monkey models. When the expressions are more exaggerated than the original training data, the model starts to be extrapolated. Nonetheless, the deformations are still robust. Also note that the actor’s closed mouth happy, angry, and afraid are still capable of driving the Monkey model’s open mouth happy, angry and afraid. In contrast, the speech animation shows close similarity of the actor and model’s mouth shapes. Taking all the viseme states into account at each time step enforces the co-articulation effect on the resulting animation. 6.5. Discussion Merging of facial expressions and speech animation was not attempted in this work. While simple ad hoc blending might be sufficient, a more systematic way to superimpose an expression over the speech might be to perform motion signal processing [Bruderlin 1995] where the analysis results of the actor’s expression and speech sequence would first be decomposed into frequency bands. We speculate that the addition of the corresponding bands would successfully mix the expressions with speech. The orange line in figure 6-2 denotes the animation path determined by the expression analysis. It may be desirable to modify the animation as post processing. Spline curve fitting to the path would allow the animator to manipulate the curve influencing only a local time frame. As an example, the animator would push away or pull toward the happy expression the path curve at a particular time step. The curve would then be locally adjusted to reflect the modified animation path. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 90 The vision based expression classifier outputs normalized expression values in order to drive the facial animation. In principle, any mechanism that can reflect the characteristics of the data in a similar manner could also be employed. For example, audio analysis might produce calm or relaxed emotion for classic music whereas rather excited emotion for rock. Similarly, sizzling weather might yield irritated emotion whereas a cloudy day might represent depression. The degree of these emotion values could easily he interpolated to drive the facial animation. How many gesture targets are enough to cover the whole facial gesture space is still an open question. Our current approach simply leaves the decision to an individual artist. However, further qualitative and quantitative investigation might be necessary to reveal universally applicable gesture sets that sufficiently cover various facial motion trajectories. The inclusion of such components as head rotations, eye blinking, and eyeball movements is not the main focus of this work. However we acknowledge that they are indispensable ingredients for realistic animation. In addition, skin color variation for different facial expressions also seems necessary. The model preparation taking all these factors into account by a professional artist would produce a high quality animation. 6.6. Conclusion Gesture driven facial animation (GDFA) introduces an explicit high-level meaningful layer, called a gesture layer, between sensing and animation. A high-level gesture analysis space is defined by an actor and a corresponding gesture animation space is defined by an animator. Merging the two spaces enables seamless flow of abstract state information from the actor to the model. We have demonstrated the flexibility of the method on facial expressions and mouth shapes. As in performance driven facial animation, GDFA automates the animation creation process. However, GDFA also assures the independence between an actor and 3D face model, obviating the need for R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 91 feature correspondence specification and unintuitive signal conversion from observed facial motions to method specific animation parameters. GDFA is general and modular enough to allow any known sensing and animation techniques to be incorporated as long as low-level parameters can be abstracted to high level gesture states. Our system is built with a vision based gesture classifier and a model based interpolation animation technique. The gesture state drives facial animation by interpolating the artist prepared gesture targets. The expression cloning technique is also employed as an efficient alternative to repeated model preparation for new gesture models. Gesture classification and animation synthesis are developed on separate machines and both achieve real time performance (»30H z) on a PC with Pentium III 550 MHz. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 92 Chapter 7 7. Summary and Future Work Summary My facial animation research started with 2D visual speech synthesis where text inputs are converted to corresponding speech animation. The technique requires initial viseme images with feature correspondences specified across them. The viseme images are warped using the correspondences and blended together based on the dynamics provided by a text to speech module to generate a speech animation sequence. The research focus was then shifted to 3D space to overcome the problems inherent to 2D space. For example, in 3D space, a model can be inspected at arbitrary angles and the lighting can be altered if desired. My first approach in 3D facial animation was to build an intuitive face deformation tool. The system allows the user to grab any surface point to deform the face model. It was extended to incorporate tracking data replacing manual animation. Automatic animation control became possible by applying 3D displacements estimated from the tracked 2D features directly to the face surface. All the known facial animation techniques including my own described above try hard to embed animation mechanisms into one chosen model. The efforts are not transferable between models forcing the same process over and over from scratch for each new model created. In contrast, expression cloning (EC) saves previous efforts of computational cost or manual intervention by R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 93 efficiently and semi-automatically transferring animation from one model to another. For cloning, dense surface correspondences between the two models are identified and the motion vectors are accordingly transferred after the proper adjustments reflecting local surface geometry variation. Gesture driven facial animation (GDFA) tackles the problems of conventional performance driven facial animation from a different angle. By separating out sensing and animation and loosely connecting them with a high-level gesture layer, the low level signal conversion problem changes to a simple mapping problem. This high level abstraction eliminates the constraints of shape conformation between the actor and model, the need for feature correspondence specification, and a method specific parameter conversion routine while ensuring the modularity and independence of system components. An example system was built with a model based interpolation method for animation to incorporate artistic talents instead of completely relying on automation. Future Work Three major future research directions unfolded during the development of the aforementioned techniques. The success of parameter mapping between the gesture states and facial states in GDFA motivates the first direction. GDFA demonstrates that the corresponding facial state can be estimated by interpolation from a new gesture state once the training is done with an initial set of corresponding gesture and facial states. Although the parameters sets were the gesture and facial states in GDFA, in principle, any parameters can be used. For example, 2D coordinates of feature points on the image can replace the gesture values from sensing. 3D positions of motion capture data would do as well. Principal component analysis (PCA) coefficients representing an arbitrary mesh’s expressions can be another candidate. This line of thinking is very intriguing in that it might lead to a radical solution for facial animation using 2D/3D motion capture data or might lead to a different form of EC replacing motion vector transfer with simpler parameter mapping. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 94 Similarly, facial states for animation do not have to be the ones describing geometry states. Texture states or, if necessary, even texture coordinates can be similarly encoded for parameter mapping to produce more complete facial animation. One problem of dealing with texture images lies in the huge number of pixels. In that case, a multi resolution approach using either wavelet or Fourier decomposition to resample the image to a low-resolution version for computation before reconstructing the original image would provide a viable solution. The second research direction deals with motion capture data. So far, few character animation films have been made with facial animation driven by motion capture data. Likely reasons for that may be the low animation quality resulting from a small number of markers used and the inconvenience caused by repeated initial setups for every capture session. I believe that using more markers in the future with the increase of processing power will easily solve the first issue. The second issue is more challenging. I suggest constructing a facial motion database. Once a large size database is constructed, an arbitrary facial motion trajectory may possibly be synthesized instead of repeatedly resorting to capturing data for new motion. The database approach was successfully demonstrated by reordering mouth images [Bregler 1997] in 2D space. Altering motion trajectories [Brand 1999] was also successful for new visual speech synthesis. Employing the similar idea of reordering the motion capture data retrieved from the database and changing the motion trajectory would eliminate the trouble caused by repeated motion capture. The third research direction was motivated by expression cloning (EC). EC makes it meaningful to compile a high quality facial animation library, a concept never addressed before. This naturally introduces the question of what would be the proper form of an animation library. Currently, the animation library is nothing but the 3D vertex positions at each frame. The file size becomes easily unmanageable with increasing number of frames. Fortunately, the vertex positions are well correlated both spatially and temporally. A compression scheme seems not only necessary but also possible to reduce the size of the animation library. Without knowing much detail and outcome, my first guess R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 95 would be applying either MPEG or JPEG style compression techniques to geometry data treating each vertex as a pixel in image. Utilizing PCA would be another idea. A small number of bases with decimated PCA coefficients may be enough to describe a large number of frames containing redundant geometry. When I first started research on facial animation and read the paper by Lee [Lee 1995], I thought there was no facial animation problem left for me to solve. However, facial animation was not only about deforming a mesh with a mass spring system or simulated muscles after all. The animation transfer problem was there for me to pick up, and as described above, there are still control issues, database issues, and compression issues waiting for a nice solution. I would predict that work involving in facial motion capture data would be a next trend in facial animation research. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 96 Bibliography [Adjoudani 1995] A. Adjoudani, C. Benoit. On the Integration of Auditory and Visual Parameters in an HMM-based ASR. In NATO Advanced Study Institute Speech reading by Man and Machine, 1995 [Akimoto 1993] T. Akimoto, Y. Suenaga, R. Wallace, Automatic creation of 3D facial models. IEEE computer Graphics and Application, 1993, vol. 13(5), pp. 16-22 [Arad 1994] N. Arad, N. Dyn, D. Reisfeld, Y, Yeshurun, Image Warping by Radial Basis Functions: Application to Facial Expressions, CVGIP: Graphical Models and Image Processing, vol. 56, No. 2, March, 1994, pp. 161-172 [Arai 1996] K. Arai, T. Kurihara, K. Anjyo, Bilinear Interpolation for Facial Expression and Metamorphosis in Real-Time Animation, The Visual Computer, 1996 vol. 12 pp. 105-116 [Azarbayejani 1993] A. Azarbayejani, T. Stamer, B. Horowitz, A. Pentland, 'Visually Controlled Graphics, IEEE Transaction on Pattern Analysis and Machine Intelligence, June 1993, vol. 15, No 6, pp. 602-605 [Bartels 1987] R. Bartels, J.Beatty, B. Barsky, An Introduction to Splines for Computer Graphics and Geometric Modeling, Morgan Kaufmann, 1987 [Basu 1998] S. Basu, N. Oliver, A. Pentland, 3D Modeling and Tracking of Human Lip Motions, ICCV, 1998, 337-343 [Bathe 1982] Klaus-Jurgen Bathe. Finite Element Procedures in Engineering Analysis. Prentice-Hall, 1982 [Beier 1992] T. Beier, S. Neely, Feature-based image metamorphosis, Computer Graphics (Siggraph proceedings 1992), vol. 26, pp. 35-42 [Berg 1997] M. D. Berg, M. V. Kreveld, M. Overmars, O. Schwarzkopf, Computational Geometry, Springer-Verlag, 1997 ISBN 3-540-61270-X [Black 1997] A. Black and P. Taylor, The Festival Speech Synthesis System, University of Edinburgh, 1997 [Blanc 1995] C. Blanc, C. Schlick, X-Splines: A Spline Model Designed for the End-User, Siggraph proceedings, 1995, pp. 377-386 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 97 [Blinn 1978] J. F. Blinn, Simulation of wrinkled surfaces, Siggraph, 1978 pp. 286-292 [Brand 1999] M. Brand, Voice puppetry, Siggraph, 1999 pp. 21-28 [Bregler 1997] C. Bregler, M. Covell, M. Slaney, Video Rewrite: Driving Visual Speech with Audio, Siggraph proceedings, 1997, pp. 353-360 [Brooke 1983] N. Brooke and Q. Summerfield, Analysis, Synthesis, and Perception of visible articulatory movements. Journal of Phonetics, 1983, vol. 11, pp. 63-76 [Browman 1985] C. Browman, L. Goldstein, Dynamic modeling of phonetic structure. In V. Fromkin, editor, Phonetic Linguistics, 1985, pp. 35-53, Academic Press, New York [Bruderlin 1995] A. Bruderlin, L. Williams, Motion Signal Processing, SIGGRAPH 95 Proceedings, 1995, 97-104 [Catmnll 1978] E. Catmull, J. Clark, Recursively generated b-spline surfaces on arbitrary topological meshes, Computer Aided Design, 1978, vol. 10(6), pp. 350-355 [Choe 2001] B.W. Choe, H.S. Ko, Analysis and Synthesis of Facial Expressions with Hand-Generated Muscle Actuation Basis, Proceedings of Computer Animation 2001, November 2001 [CMU] CMUDictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict [Cohen 1993] M.M. Cohen, D.W. Massaro, Modeling Coarticulation in Synthetic Visual Speech, in Models and Techniques in Computer Animation, M. Magnenat-Thalmann and D. Thalmann (eds.), Tokyo, 1993, Springer Verlag [Cohen 1990] M. Cohen and D. Massaro, Synthesis of visible speech. Behavior Research Methods, Instruments & Computers, 1990, vol. 22(2), pp. 260-263 [Coquillart 1990] S. Coquillart, Extended Free-Form Deformation: A Sculpturing Tool for 3D Geometric Modeling, Computer Graphics, 1990, vol. 24, pp. 187 - 193 [Cosatto 1998] E. Cosatto, H. P. Graf, Sample-Based Synthesis of Photo-Realistic Talking Heads, In proceedings of Computer Animation 1998, pp. 103-110 [Darrell 1993] T. Darrell, A. Pentland, Space-time gestures. In Computer Vision and Pattern Recognition, 1993 [DeCarlo 2000] D. DeCarlo, D. Metaxas, Optical Flow Constraints on Deformable Models with Applications to Face Tracking, IJCV, July 2000, 38(2), pp. 99-127 [DeCarlo 1998] D. DeCarlo, D. Metasas and M. Stone, An Anthropometric Face Model using Variational Technique, 1998, Siggraph proceedings, pp. 67-74 [Derose 1998] T. Derose, M. Kass, T. Truong, Subdivision Surfaces in Character Animation, Siggraph proceedings, 1998, pp. 85-94 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 98 [Dipaola 1991] S. Dipaola, Extending the range of facial types, The Journals of Visualization and Computer Animation, 1991, vol 2(4), pp. 129-131 [Donato 1999] G. Donato, M. Bartlett, J. Hager, P. Ekman, T. Sejnowski, Classifying Facial Actions, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, No. 10, October 1999 [Eck 1995] M. Eck, T. DeRose, T. Duchamp, Multiresolution Analysis of Arbitrary Meshes, Siggraph proceedings, 1995, 173-182 [Eck 1991] M. Eck, Interpolation Methods for Reconstruction of 3D Surfaces from Sequences of Planar Slices, CAD und Computergraphik, Vol. 13, No. 5, Feb. 1991, 109 - 120 [Edelsbmnner 1992] H. Edelsbrunner, E. Mucke, Three-dimensional alpha shapes, Proceedings of the Workshop on Volume Visualization 1992, pp. 75-82 [Ekman 1978] P. Ekman, W. V. Friesen, Facial Action Coding System. Consulting Psychologists Press, Palo Alto, CA, 1978 [Eisert 1998] P. Eisert and B. Girod, Analyzing Facial Expressions for Virtual Conferencing, IEEE, Computer Graphics and Applications, 1998, vol. 18, no. 5, pp. 70-78 [Essa 1996] I. A. Essa, S. Basu, T. Darrell, A. Pentland, Modeling, Tracking and Interactive Animation of Faces and Heads using Input from Video, Proceedings of Computer Animation June 1996 Conference, Geneva, Switzerland, IEEE Computer Society Press [Essa 1995] I. A. Essa, Analysis, Interpretation, and Synthesis of Facial Expressions , PH.D. Thesis, MIT, 1995 [Essa 1994] I. A. Essa, T. Darrell, A. Pentland, Tracking Facial Motion, Proceedings of the IEEE Workshop on Non-rigid and Articulate Motion, Austin, Texas, November, 1994 [Ezzat 1998] T. Ezzat, T. Poggio, Mike Talk: A Talking Facial Display Based on Morphing Visemes, In proceedings of Computer Animation 1998, pp. 96-102 [Farkas 1994] L. Farkas, Anthropemetry of the Head and Face, Raven Press, 1994 [Fidaleo 2002] D. Fidaleo, U. Neumann, Co-articulation Region Analysis for Control of 2D Faces, IEEE Computer Animation Proceedings, 2002. [Fidaleo 2000] Classification and Volume Morphing for Performance-Driven Facial Animation, D. Fidaleo, J-Y Noh, T. Kim, R. Enciso, U.Neumann, Digital and Computational Video (DCV) 2000 [Franke 1982] R. Franke, Scattered Data Interpolation: Tests of Some Method, Math. Comp., 1982, 38(5): 181-200 [Gao 1998] L. Gao, Y. Mukaigawa, Y. Ohta, Synthesis of Facial Images with Lip Motion from Several Real Views, In proceedings of Automatic Face and Gesture Recognition, 1998, pp. 181-186 [Gleicher 1998] M. Gleicher, Retargetting Motion to New Characters, Siggraph proceedings, 1998, 3 3 -4 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 99 [Golub 1979] G.H. Golub, M. Heath, G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215-223, 1979 [Gortler 1995] S. Gortler, M. Cohen, Hierarchical and variational geometric modeling with wavelets, Symposium on Interactive 3D Graphics, 1995, pp. 35-42 [Guenter 1998] B. Guenter, C. Grimm, D. Wood, H. Malvar, F. Pighin, Making Faces, Siggraph proceedings, 1998, 55 - 66 [Guenter 1992] B. Guenter, A system for simulating human facial expression. In State of the Art in Computer Animation, 1992, 191-202 [Guibas 1992] L.Guibas, D. Knuth, M. Sharir, Randomized incremental construction of delaunay and voronoi diagrams, Algorithmica 7 (4) 1992, pp. 381-413 [Hardy 1971] R.L. Hardy, Multiquadric Equations of Topography and Other Irregular Surfaces, J. Geophys, Res, 1971, 76:1905-1915 [Horn 1989] B.K.P. Horn, and M.J. Brooks, (Eds.), Shape from Shading, Cambridge: MIT Press 1989. ISBN 0-262-08159-8. [Horn 1981] B. K. P. Horn, B. G. Schunck, Determining optical flow. Artificial Intelligence, 1981, vol. 17, pp. 185-203 [Jollife 1986] I.T. Jollife, Principal Component Analysis, Springer-Verlag, New York, 1986 [Kajiwara 1993] S. Kajiwara, H. Tanaka, Y. Kitamura, J. Ohya, F. Kishino, Time-Varying Homotopy and the Animation of Facial Expression for 3D Virtual Space teleconferencing, SPIE, 1993, vol. 2094/37 [Kalra 1994] P. Kalra, N. Magnenat-Thanmann, Modeling of Vascular Expressions in Facial Animation, Computer Animation, 1994, pp. 50 -58 [Kalra 1992] P. Kalra, A. Mangili, N. M. Thalmann, D. Thalmann, Simulation of Facial Muscle Actions Based on Rational Free From Deformations, Eurographics 1992, vol. 11(3), pp. 59-69 [Kanai 2000] T. Kanai, H. Suzuki, F. Kimura, Metamorphosis of Arbitrary Triangular Meshes, Computer Graphics and Applications, March 2000, 62-75 [Kass 1987] M. Kass, A. Witkin, and D. Terzopoulos, Snakes: Active contour models. International Journal of Computer Vision, 1987, vol. 1(4), pp. 321-331 [Kato 1992] M. Kato, I. So, Y. Hishinuma, O. Nakamura, T. Minami, Description and Synthesis of Facial Expressions based on Isodensity Maps, In L. Tosiyasu (Ed.), Visual Computing, Springer- Verlag, Tokyo, 1992, pp. 39-56 [Kelso 1985] J. Keslo, E. Vatikiotis-Bateson, E. Saltzman, B. Kay, A qualitative dynamic analysis of reiterant speech production: Phase portraits, kinematics, and dynamic modeling. J. Acoust. Soc. Am, 1985, vol. 1(77), pp. 266-288 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 100 [Kent 1992] J. R. Kent, W.E. Carlson, R.E. Parent, Shape Transformation for Polyhedral Objects, Siggraph proceedings, 1992, 47-54 [Kent 1977] R. D. Kent and F. D. Minifxe, Coarticulation in recent speech production models, Journal of Phonetics, 1977, vol. 5 pp. 115 -135 [Kishino 1994] F. Kishino, Virtual Space Teleconferencing System - Real Time Detection and Reproduction of Fluman Images, Proc. Imagina 1994, 109-118 [Koch 1996] R.M. Koch, M.H. Gross, F.R. Carls, D.F. von Buren, G. Fankhauser, Y.I.H. Parish, Simulating Facial Surgery Using Finite Element Models, Siggraph proceedings, 1996, pp. 421 -428 [Komatsu 1989] K. Komatsu, Surface model of face for animation, Trans. IPSJ, 30. 1989 [Koufakis 1999] I. Koufakis, B.F. Buxton, Very low bit rate face video compression using linear combination of 2D face view and principal components analysis, Image and Vision Computing 17, 1999, pp. 1031-1051 [Kuo 1997] C. J. Kuo, R. S. Huang, T. G. Lin, Synthesizing Lateral Face from Frontal Facial Image Using Anthropometric Estimation, proceedings of International Conference on Image Processing, 1997, Vol. 1 , pp. 133 -136 [Lee 1999] A. W. F. Lee, D. Dobkin, W. Sweldens, P. Schroder, Multiresolution Mesh Morphing, Siggraph proceedings, 1999, 343-350 [Lee 1995] Y. C. Lee, D. Terzopoulos, K. Waters. Realistic face modeling for animation. Siggraph proceedings, 1995, pp. 55-62 [Lewis 2000] J.P. Lewis, M. Cordner, N. Fong, Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Drive Deformation, Siggraph proceedings, 2000, 165-172 [Lewis 1987] J. P. Lewis, F. I. Parke. Automated lipsynch and speech synthesis for character animation. In Proceedings Human Factors in Computing Systems and Graphics Interface 1987, pp. 143-147 [Li 1993] H. Li, P. Roivainen, R. Forchheimer, 3-D Motion Estimation in Model Based Facial Image Coding, IEEE Transaction on Pattern Analysis and Machine Intelligence, June 1993, vol. 15, No 6, pp. 545-555 [Marigny 1996] T. Guiard-Marigny, N. Tsingos, A. Adjoudani, C. Benoit, M. P. Gascuel, 3D Models of the Lips for Realistic Speech Animation, IEEE proceedings of Computer Animation, 1996, pp. 80- 89 [Masse 1990] K. Masse, A. Pentland, Automatic Lip reading by Computer, Trans. Inst. Elec., Info. And Comm. Eng. 1990. Vol. J73-D-II, No.6. pp.796-803 [Maurer 1996] T. Maurer, C. von der Malsburg, Tracking and learning graphs of image sequence of faces, In Proceedings of International Conference on Artificial Neural Networks, Bochum, Germany, 1996 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 101 [Maybeck 1997] P.S. Maybeck, Stochastic Models Estimation and Control, 1997, Academic press, ISBN 0124807038 [Moghaddam 1994] B. Moghaddam, A. Pentland, Face Recognition using View-Based and Modular Eigenspaces, In Automatic Systems for the Identification and Inspection of Humans, SPIE, 1994 [Moody 1992] J.E. Moody, The Effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems, Neural Information Processing Systems 4, 847-854, Morgan Kaufmann, California, 1992 [Moore 1959] Edward F. Moore. The shortest path through a maze, In Proceedings of the International Symposium on the Theory of Switching, Harvard University Press, 1959, 285 - 292 [Morishima 1995] S. Morishima, Synthesis of facial expression using emotion model, Proceedings of the sixth Western Computer Graphics Symposium, March 1995, pp. 96-103 [Morishima 1990] S. Morishima, K. Aizawa, H. Harashima, A real-time facial action image synthesis system driven by speech and text. SPIE Visual Communications and Image Processing, 1360:1151- 1157, 1990 [Moubaraki 1996] L. Moubaraki, J. Ohya, Realistic 3D Mouth Animation Using a Minimal Number of Parameters, IEEE International Workshop on Robot and Human Communication, 1996 pp. 201- 206 [Moubaraki 1995] L. Moubaraki, J. Ohya, F. Kishino, Realistic 3D Facial Animation in Virtual Space Teleconferencing, 4th IEEE International workshop on Robot and Human Communication, 1995, pp. 253-258 [Moubaraki 1994] L. Moubaraki, H. Tanaka, Y. Kitamnra, J. Ohya, F. Kishino, Homotopy-Based 3D Animation of Facial Expression Technical Report of IEICE, IE 94-37, 1994 [Noh 2002] J.Y. Noh, D. Fidaleo, U. Neumann, Gesture Driven Facial Animation, USC Technical Report 02-761, 2002 [Noh 2001] J.Y. Noh, U. Neumann, Expression Cloning, ACM SIGGRAPH, 2001 pages 277-288 [Noh 2000A] J.Y. Noh, D. Fidaleo, U. Neumann, Animated Deformations with Radial Basis Functions, ACM Virtual Reality and Software Technology (VRST), 2000, pages 166-174 [Noh 2000B] J.Y. Noh, U. Neumann, Talking Face, IEEE International Conference on Multimedia and Expo (ICME), 2000, volume2, pages 627-630 [Noh 1998] J.Y. Noh, U. Neumann, A Survey of Facial Modeling and Animation Techniques, USC Technical Report 99-705, 1998 [Ohya 1995] J. Ohya, Y. Kitamura, H. Takemura, H. Ishi, F. Kishino, N. Terashima, Virtual Space Teleconferencing: Real-Time Reproduction of 3D Human Images", Journal of Visual Communications an Image Representation, 1995, vol. 6, No.l, March, pp. 1-25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 102 [Oka 1987] M. Oka, K. Tsutsui, A. ohba, Y. Jurauchi, T. Tago, Real-time manipulation of texture- mapped surfaces. In Siggraph 21, 1987, pp. 181-188. ACM Computer Graphics [Orr 1998] M. J. L. Orr, Optimizing the Widths of RBFs, Fifth Brazilian Symposium on Neural Networks, Brazil, 1998 [Ostermann 1998] J. Ostermann, Animation of Synthetic Faces in MPEG-4, IEEE Computer Animation, 1998, 49 - 55 [Overveld 1993] C. W. A. M. van Overveld and B. Wyvill, Potentials, polygons and penguins: An adaptive algorithm for triangulating and equi-potential surface, 1993 [Pandzic 1994] I. S. Pandzic, P. Kalra, N. Magnenat-Thalmann, Real time Facial Interaction, Displays (Butterworth-Heinemann), Vol. 15, No. 3, 1994 [Parke 1996] F. I. Parke, K. Waters, Computer Facial Animation, 1996, ISBN 1-56881-014-8 [Parke 1991] F. I. Parke, Control parameterization for facial animation. In N. Magnenat-Thalmann, D. Thalmann, editors, Computer Animation 1991, pp. 3-14, Springer-Verlag [Parke 1982] F. I. Parke, Parameterized models for facial animation. IEEE Computer Graphics and Applications, 1982, vol. 2(9) 61 - 68 [Parke 1974] F. I. Parke, A Parametric Model for Human Faces, Ph.D. Thesis, University of Utah, Salt Lake City, Utah, 1974, UTEC-CSc-75-047 [Parke 1972] F. I. Parke, Computer Generated Animation of Faces. Proc. ACM annual conf., 1972 [Patel 1992] M. Patel (1992), FACES, Technical Report 92-55 (Ph.D. Thesis), University of Bath, 1992 [Patterson 1991] E. C. Patterson, P. C. Litwinowicz, N. Greene, Facial Animation by Spatial Mapping, Proc. Computer Animation 1991, N. Magnenat-Thalmann, D. Thalmann (Eds.), Springer- Verlag, pp. 31-44 [Pearce 1986] A. Pearce, G. Wyvill, D. Hill, Speech and expression: A Computer solution to face animation. Proceedings of Graphics Interface 1986, Vision Interface 1986, pp. 136-140 [Pelachaud 1994] C. Pelachaud, C.W.A.M. van Overveld, C. Seah, Modeling and Animating the Human Tongue during Speech Production, IEEE, Proceedings of Computer Animation, 1994, pp. 40- 49 [Pelachaud 1991] C. Pelachaud, N. Badler, M. Steedman, Linguistic issues in facial animation. InN. Magnenat-Thalmann, D. Thalmann, editors, Proceedings o f Computer Animation 1991, pp. 15-29, Tokyo, Springer-Verlag [Penrose 1955] R. Penrose. A Generalized Inverse for Matrices, Proc. Cambridge Philos., Soc., 51:406-413, 1955 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 103 [Pentland 1994] A. Pentland, B. Moghaddam, T. Stamer, View-Based and Modular Eigenspaces for Face Recognition. In Computer Vision and Pattern Recognition Conference, 1994, pp. 84-91. IEEE Computer Society, 1994 [Pemg 1998] W. Pemg, Y. Wu, M. Ouhyoung, Image Talk: A Real Time Synthetic Talking Head Using One Single Image with Chinese Text-To-Speech Capability. IEEE 1998, pp. 140-148 [Pieper 1992] S. Pieper, J. Rosen, and D. Zeltzer, Interactive Graphics for plastic surgery: A task level analysis and implementation. Computer Graphics, Special Issue: ACM Siggraph, 1992 Symposium on Interactive 3D Graphics, 127-134 [Pighin 1999] F. Pighin, R. Szeliski, D. Salesin, Resynthesizing Facial Animation through 3D Model- Based Tracking, International Conference on Computer Vision, 1999 [Pighin 1998] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, D. H. Salesin, Synthesizing Realistic Facial Expressions from Photographs, Siggraph proceedings, 1998, 75-84 [Platt 1985] S. M. Platt, A Structural Model of the Human Face, Ph.D. Thesis, University of Pennsylvania, 1985 [Platt 1981] S. Platt, N. Badler, Animating facial expression. Computer Graphics, 1981, vol. 15(3) pp. 245-252 [Pratt 1991] W. Pratt, Digital Image Processing, Second Edition, A Wiley-Interscience Publication, ISBN 0-471-85766-1, 1991 [Poggio 1989] T. Poggio, F. Giros, A theory of networks for approximation and learning. Technical Report A.I. Memo No. 1140, Artificial Intelligence Lab, MIT, Cambridge, MA, July 1989 [Powell 1987] M. J. D. Powell, Radial basis functions for multivariate interpolation: a review. In J.C. Mason and M.G. Cox, editors, Algorithms for Approximation, Clarendon Press, Oxford, 1987 [Press 1992] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C, Cambridge University Press, ISBN 0-521-43108-5 [Russel 1980] J.A. Russel, A Circomplex Model of Affect, Journal of Personality and Social Psychology, vol. 39, pp. 1161-1178, 1980 [Saintourens 1990] M. Saintourens, M-H. Tramus, H. Huitric, and M. Nahas. Creation of a synthetic face speaking in real time with a synthetic voice. In Proceedings of the ETRW on Speech Synthesis, pp. 249 - 252, Grenoble, France, 1990. ESCA [Saji 1992] H. Saji, H. Hioki, Y. Shinagawa, K. Yoshida, T. Junii, Extraction of 3D Shapes from the Moving Human Face using Lighting Switch Photometry, in N. Magnenat-Thalmann, D. Thanlmann (Ed.), Creating and Animating the Virtual World, Springer— Verlag Tokyo 1992, pp. 69-86 [Saulnier 1995] A. Saulnier, M. L. Viaud, D. Geldreich, Real-time facial analysis and synthesis chain. In International Workshop on Automatic Face and Gesture Recognition, 1995, pp. 86-91, Zurich, Switzerland, Editor, M. Bichsel R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 104 [Sederberg 1996] T. W. Sederberg, S. R. Parry, Free-Form deformation of solid geometry models, Computer Graphics (Siggraph 1996), vol. 20(4), pp. 151 - 160 [Sera 1996] H. Sera, S. Morishma, D. Terzopoulos, Physics-based Muscle Model for Moth Shape Control, IEEE International Workshop on Robot and Human Communication, 1996, pp. 207-212 [Shewchuk 1996] J. Shewchuk, Triangle: Engineering a 2D quality mesh generator and delaunay triangulation, in Proceedings of the First Workshop of Applied Computational Geometry, ACM, 1996, pp. 124-133 [Shinagawa 1998] Y. Shinagawa, T. L. Kunii, Unconstrained Automatic Image Matching Using Multiresolution Critical-Point Filters, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 9, 1998, 994-1010 [Singh 1998] K. Singh, E. Fiume, Wires: A Geometric Deformation Technique, Siggraph proceedings, 1998, pp. 405 - 414 [Stone 1991] M. Stone, Toward a model of three-dimensional tongue movement. Journal of Phonetics, 1991, vol. 19, pp. 309-320 [Strub 1995] L. Strub, et al, Automatic facial conformation for model-based videophone coding, IEEE ICIP, 1995 [Terzopoulos 1993] D. Terzopoulos, R. Szeliski, Tracking with Kalman snakes, In A. Blake and A. Yuille, editors, Active Vision, 1993, pp. 3-20. MIT Press [Terzopoulos 1991] D. Terzopouls, K. Waters, Techniques for Realistic Facial Modeling and Animation, Proc. Computer Animation 1991, Geneva, Switzerland, Springer-Verlad, Tokyo, pp. 59- 74 [Terzopoulos 1990] D. Terzopoulos and K. Waters, Physically-based facial modeling, analysis, and animation. J. of Visualization and Computer Animation, March, 1990, vol. 1(4), pp. 73-80 [Terzopoulos 1988] D. Terzopoulos, K. Fleisher (1988), Modeling Inelastic Deformation: Visco elasticity, Plasticity, Fracture, Computer Graphics, Proc. Siggraph 1988, Vol. 22, No. 4, pp. 269-278 [Thalmann 1996] N. Magnenat-Thalmann, D. Thalmann Editors, Interactive Computer Animation, Prentice Hall, 1996, ISBN 0-13-518309-X [Thalmann 1993] N. Magnenat-Thalmann, A. Cazedevals, D. Thalmann, Modeling Facial Communication Between an Animator and a Synthetic Actor in Real Time, Proc. Modeling in Computer Graphics, Genova, Italy, June 1993, (Eds. B. Falcidieno and L. Kunii), pp. 387-396. [Thalmann 1988] N. Magnenat-Thalmann, N. E. Primeau, D. Thalmann, Abstract muscle actions procedures for human face animation. Visual Computer, 1988, vol. 3(5), pp. 290-297 [Thalmann 1987] N. Magnenat-Thalmann, D. Thalmann. The direction of synthetic actors in the film rendez-vous a montreal. IEEE Computer Graphics and Applications, 1987, pp. 9-19 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 105 [Thikhonov 1977] A.N. Tikhonov and V.Y. Arsenin, Solution of Ill-Posed Problems and The Regularization Method, Soviet Math. Dokl., 1963,4:1035-1038 [Ulgen 1997] F. Ulgen, A step Toward universal facial animation via volume morphing, 6th IEEE International Workshop on Robot and Human communication, 1997, pp. 358-363 [Viad 1992] M. L. Viad and H. Yahia, Facial animation with wrinkles. In D. Forsey and G. Hegron, editors, Proceedings of the Third Eurographics Workshop on Animation and Simulation, 1992 [Wang 1994] C. L. Y. Wang, D. R. Forsey, Langwidere: A New Facial Animation System, proceedings of Computer Animation, 1994, pp. 59-68 [Waters 1995] K. Waters, J. Frisbie, A Coordinated Muscle Model for Speech Animation, Graphics Interface, 1995 pp. 163 - 170 [Waters 1993] K. Waters, T. M. Levergood, Decface: An Automatic Lip-Synchronization Algorithm for Synthetic Faces, 1993, DEC. Cambridge Research Laboratory Technical Report Series [Waters 1991] K. Waters, S. Terzopoulos, Modeling and Animating Faces using Scanned Data, Journal of Visualization and Computer Animation, 1991, Vol. 2, No. 4, pp. 123-128 [Waters 1987] K. Waters. A muscle model for animating three-dimensional facial expression. In Maureen C. Stone, editor, Computer Graphics (Siggraph proceedings, 1987) vol. 21 pp. 17-24 [Welch 1992] W. Welch, A. Witkin, Variational surface modeling. Siggraph proceedings, 1992 pp. 157-166 [Williams 1990] L. Williams, Performance Driven Facial Animation, Siggraph proceedings, 1990, 235 - 242 [Wu 1994] Y. Wu, N. Magnenat-Thalmann, D. Thalmann, A Plastic-Visco-Elastic Model for Wrinkles in Facial Animation and Skin Aging, Proc. 2nd Pacific Conference on Computer Graphics and Applications, Pacific Graphics, 1994 [Wyvill 1986] G. Wyvill, C. McPheeters, B. Wyvill, Data structure for Soft Objects. The Visual Computer, 1986, vol. 2(4), pp. 227-234 [Xu 1990] G. Xu et al., Three-dimensional Face Modeling for virtual space teleconferencing systems, Trans. IEICE, E73, 1990 [You 1996] S. You, Y. Zhang, Z. Pen, G. Xu, "A Multi-Pose Face Recognition System", Journal of Software, 11, 1996. [Zhenyun 1997] P. Zhenyun, Suya You and Guangyou Xu, "A Fast Method for Detection Facial Features Under Varied Poses", China Journal of Image and Graphics, 2 (4), 1997. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 106 Appendix A A.A Survey of 3D Facial Modeling and Animation Techniques3 Realistic facial animation is achieved through geometric and image manipulations. Geometric deformations usually account for the shape and deformations unique to the physiology and expressions of a person. Image manipulations model the reflectance properties of the facial skin and hair to achieve small-scale detail that is difficult to model by geometric manipulation alone. Modeling and animation methods often exhibit elements of each realm. This appendix summarizes the theoretical approaches used in published work and describes their strengths, weaknesses, and relative performance. Taxonomy groups the methods into classes that highlight their similarities and differences. A.I. Introduction Since the pioneering work of Parke in 1972 [Parke 1972], many research efforts have attempted to generate realistic facial modeling and animation. Because of the complexity of human facial anatomy, and our natural sensitivity to facial appearance, it is still considered a hard problem to 1 3 USC Technical Report [Noh 1998] R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 107 I n t e r p o l a t i o n A n t h r o p o m e t r y P a r a m e t e r i z a t i o n H a i r A n i m a t i o n ( L a y e r e d ) S p r i n g m e s h B i l i n e a r i n t e r p o l a t i o n I m a g e m a n i p u l a t i o n s I m a g e m o r p h i n g M a s s s p r i n g s y s t e m M o d e l a c q u i s i t i o n a n d f i t t i n g G e o m e t r y m a n i p u l a t i o n s V a s c u l a r e x p r e s s i o n s P s e u d o m u s c l e m o d e l I n d i v i d u a l M o d e l C o n s t r u c t i o n S c a t t e r e d d a t a i n t e r p o l a t i o n P h y s i c s b a s e d m u s c l e m o d e l F i n i t e E l e m e n t M e t h o d s f a c i a l m o d e l i n g / a n i m a t i o n P u r e v e c t o r b a s e d m o d e l S p l i n e m o d e l F r e e f o r m d e f o r m a t i o n W r i n k l e g e n e r a t i o n Figure A -l Classification o f facial modeling and animation methods analyzes and synthesizes subtle expressions and emotions with great accuracy and with fast performance. Although some recent work [Guenter 1998] [Pighin 1998] produces realistic results with relatively fast performance, the process for generating facial animation entails extensive human intervention or tedious tuning. Recent interest in facial modeling and animation is spurred by the increasing appearance of virtual characters in film and video, inexpensive desktop processing power, and the potential for a new 3D immersive communication metaphor for human-computer interaction. Much of the facial modeling and animation research is published in specific venues that are relatively unknown to the general graphics community. There are few surveys or detailed historical treatments of the subject [Parke R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 108 1996]. This survey is intended as an accessible reference to the range of reported facial modeling and animation techniques. Facial modeling and animation research falls into two major categories, those based on geometric manipulations and those based on image manipulations (figure A-l). Each realm comprises several sub-categories. Geometric manipulations include parameterizations [Cohen 1993][Parke 1982], finite element methods [Basu 1998][Essa 1996][Guenter 1992][Pieper 1992], muscle based modeling [Platt 1981 ][Terzopoulos 1990] [Waters 1987], visual simulation using pseudo muscles [Kalra 1992][Thalmann 1988], and spline models [Viad 1992][Wang 1994], Image manipulations include vascular expressions [Karla 1994], texture based wrinkle synthesis [Moubaraki 1995] dynamic texture mapping [Oka 1987], and image morphing and blending [Pighin 1998], At the preprocessing stage, a person-specific individual model may be constructed using anthropometry [DeCarlo 1998], scattered data interpolation [Ulgen 1997], or mass spring system simulations [Lee 1995], Such individual models are often animated by feature tracking or performance driven facial animation [Basu 1998] [Eisert 1998][Pandzic 1994] [Patterson 1991] [Williams 1990]. This taxonomy in figure A-l illustrates the diversity of approaches to facial animation. Exact classifications are complicated by the lack of exact boundaries between methods and the fact that recent approaches often integrate several methods to produce better results. The survey proceeds as follows. Section A.2 and A.3 introduce the interpolation techniques and parameterizations followed by the animation methods using 2D and 3D morphing techniques in section A.4. The Facial Action Coding System, a frequently used facial description tool, is summarized in section A.5. Physics based modeling and simulated muscle modeling are discussed in sections A.6 and A.7, respectively. Techniques for increased realism, including wrinkle generation, and texture manipulation, are surveyed in sections A.8 and A.9. Individual modeling and model fitting are described in section A. 10, followed by animation from tracking data in section A .ll. Section A. 12 describes mouth animation research, followed by general conclusions and observations. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 109 A.2. Interpolation Interpolation techniques offer an intuitive approach to facial animation. Typically, an interpolation function specifies smooth motion between two key-frames at extreme positions, over a normalized time interval (figure A-2). neutral face interpolated image smiling face p interpolated (t) =(1 — t) * p neutral + 1 * p smile 0 <= t <= 1 Figure A-2 Linear interpolation performed on muscle contraction values Linear interpolation is commonly used for simplicity [Pighin 1998], but a cosine interpolation function or other variations can provide acceleration and deceleration effects at the beginning and end of an animation [Waters 1993], When four key frames are involved, rather than two, bilinear interpolation generates a greater variety of facial expressions than linear interpolation [Parke 1974], Bilinear interpolation, when combined with simultaneous image morphing, creates a wide range of facial expression changes [Arai 1996], Varying the parameters of the interpolation functions generates interpolated images. Geometric interpolation directly updates the 2D or 3D positions of the face mesh vertices, while parameter interpolation controls functions that indirectly move the vertices. For example, Sera et al. perform a linear interpolation of the spring muscle force parameters, rather than the positions of the vertices, to R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 110 generate mouth animation [Sera 1996], Figure A-2 shows two key frames and an interpolated image using linear interpolation of muscle contraction parameters. Although interpolations are fast, and they easily generate primitive facial animations, their ability to create a wide range of realistic facial configurations is severely restricted. Combinations of independent face motions are difficult to produce. Interpolation is a good method to produce a small set of animations from a few key-frames. A.3. Parameterizations Parameterization techniques for facial animation [Cohen 1993][Parke 1982] overcome some of the limitations and restrictions of simple interpolations. Ideal parameterizations specify any possible face and expression by a combination of independent parameter values [Parke 1996 pp. 188]. Unlike interpolation techniques, parameterizations allow explicit control of specific facial configurations. Combinations of parameters provide a large range of facial expressions with relatively low computational costs. As Waters indicates [Waters 1995], there is no systematic way to arbitrate between two conflicting parameters to blend expressions that affect the same vertices, hence parameterization rarely produces natural human expressions or configurations when a conflict between parameters occurs. For this reason, parameterizations are designed to only affect specific facial regions, however this often introduces noticeable motion boundaries. Another limitation of parameterization is that the choice of the parameter set depends on the facial mesh topology and, therefore, a complete generic parameterization is not possible. Furthermore, tedious manual tuning is required to set parameter values, and even after that, unrealistic motion or configurations may result. The limitations of parameterization led to the development of diverse techniques such as morphing between images, (pseudo) muscle based animation, and finite element methods. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I l l A.4. 2D & 3D Morphing Morphing effects a metamorphosis between two target images or models. A 2D image morph consists of a warp between corresponding points in the target images and a simultaneous cross dissolve1 4 . Typically, the correspondences are manually selected to suit the needs of the application. Morphs between carefully acquired and corresponded images produce very realistic facial animations. Beier et al. demonstrate 2D morphing between two images with manually specified corresponding features (line segments) [Beier 1992]. The warp function is based upon a field of influence surrounding the corresponding features. Realism, with this approach, requires extensive manual interaction for color balancing, correspondence selection, and tuning of the warp and dissolve parameters. Variations in (he target image viewpoints or features complicate the selection of correspondences. Realistic head motions are difficult to synthesize since target features become occluded or revealed during the animation. To overcome the limitations of 2D morphs, Pighin et al. combine 2D morphing with 3D transformations of a geometric model [Pighin 1998]. They animate key facial expressions with 3D geometric interpolation, while image morphing is performed between corresponding texture maps. This approach achieves viewpoint independent realism, however, animations are still limited to interpolations between pre-defined key-expressions. The 2D and 3D morphing methods can produce relatively realistic facial expressions, but they share similar limitations with the interpolation approaches. Selecting corresponding points in target images is manually intensive, dependent on viewpoint, and not generalizable to different faces. Also, the animation viewpoint is constrained to approximately that of the target images. 1 4 In cross dissolving, one image is faded out while another is simultaneously faded in. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 112 AU FACS Name AU FACS Name 1 Inner Brow Raiser 12 Lid Comer Puller 2 Outer Bow Raiser 14 Dimpler 4 Brow Lower 15 Lip Comer Depressor 5 Upper Lid Raiser 16 Lower Lip Depressor 6 Check Raiser 17 Chin Raiser 7 Lid Tightener 20 Lip Stretcher 9 Nose Wrinkler 23 Lip Tightener 10 Upper Lid Raiser 26 Jaw Drop Table A -l Sample single facial action units Basic Expressions Involved Action Units Surprise AU1, 2, 5, 15, 16, 20, 26 Fear AU1, 2, 4,5, 15, 20, 26 Disgust AU2, 4, 9, 15, 17 Anger AU2, 4, 7, 9, 10, 20, 26 Happiness AU1, 6, 12, 14 Sadness AU1, 4, 15, 23 Table A-2 Example sets o f action units for basic expressions A.5. Facial Action Coding System The Facial Action Coding System (FACS) is a description of the movements of the facial muscles and jaw/tongue derived from an analysis of facial anatomy [Ekman 1978]. FACS includes 44 basic action units (AUs). Combinations of independent action units generate facial expressions. For example, combining the AU1 (Inner brow raiser), AU4 (Brow Raiser), AU15 (Lip Comer Depressor), and AU23 (Lip Tightener) creates a sad expression. A table of the sample action units and the basic expressions generated by the actions units are presented in tables A-l and A-2. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 113 Animation methods using muscle models or simulated (pseudo) muscles overcome the correspondence and lighting difficulties of interpolation and morphing techniques. Physical muscle modeling mathematically describes the properties and the behavior of human skin, bone, and muscle systems. In contrast, pseudo muscle models mimic the dynamics of human tissue with heuristic geometric deformations. Approaches of either type often parallel the Facial Action Coding System and Action Units developed by Ekman and Friesen [Ekman 1978], A.6. Physics Based Muscle Modeling Physics-based muscle models are divided into three categories: mass spring systems, vector representations, and layered spring meshes. Mass-spring methods propagate muscle forces in an elastic spring mesh that models skin deformation. The vector approach deforms a facial mesh using motion fields in delineated regions of influence. A layered spring mesh extends a mass spring structure into three connected mesh layers to model anatomical facial behavior more faithfully. Spring Mesh Muscle The work by Platt and Badler is a forerunner of the research focused on muscle modeling and the structure of the human face [Platt 1981], Forces applied to elastic meshes through muscle arcs generate facial expressions. Platt’s later work presents a facial model with muscles represented as collections of functional blocks in defined regions of the facial structure [Platt 1985], The model consists of 38 regional muscle blocks interconnected by a spring network. Action units are employed to deform the spring network by applying muscle forces. Vector Muscle A successful muscle model was proposed by Waters [Waters 1987]. A delineated deformation field models the action of muscles upon skin. A muscle definition includes the vector field direction, an R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Origin of the muscle Deformation dprmases in the ions o f the rrows. Insertion of the muscle Figure A-3 Zone o f influence o f Waters’ linear muscle model. Figure A-4 Waters’ linear muscles origin, and an insertion point (figure A-3). The field extent is defined by cosine functions and fall off sphincter muscles as a simplified parametric ellipsoid. The sphincter muscle contracts around the center of the ellipsoid and is primarily responsible for the deformation of the mouth region. Waters animates human emotions such as anger, fear, surprise, disgust, joy, and happiness using vector based linear and orbicularis oris muscles implementing the FACS. Figure A-4 shows Waters’ muscles embedded in a facial mesh. The positioning of vector muscles into anatomically correct positions can be a daunting task. No automatic way of placing muscles beneath a generic or person-specific mesh is reported. The process involves manual trial and error with no guarantee of efficient or optimal placement. Incorrect placement results in unnatural or undesirable animation of the mesh. Nevertheless, the vector muscle model is widely used because of its compact representation and independence of the facial mesh structure. An example of vector muscles is seen in Billy, the baby in the movie “Tin Toy”, who has 47 Waters’ muscles on his face. factors that produce a cone shape when visualized as a height field. Waters also models the mouth R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 115 epidermal surface Epidermal nodes: 1,2,3 dermal fatty layer Fascia nodes: 4, 5, 6 4 , muscle layer Bone nodes: 7, 8, 9 skull surface Both dotted lines and solid lines indicate elastic spring connections between nodes. Figure A-5 Triangular skin tissue prism element Layered Spring Mesh Muscles Terzopoulos and Waters propose a facial model that models detailed anatomical structure and dynamics of the human face [Terzopoulos 1990], Their three-layers of deformable mesh correspond to skin, fatty tissue, and muscle tied to bone. Elastic spring elements connect each mesh node and each layer. Muscle forces propagate through the mesh systems to create animation. This model achieves decent animation, but simulating volumetric deformations with three-dimensional lattices requires extensive computation. A simplified mesh system reduces the computation time [Wu 1994]. Lee et al. [Lee 1995] present models of physics-based synthetic skin and muscle layers based on earlier work [Terzopoulos 1990]. The face model consists of three components: a biological tissue layer with nonlinear deformation properties, a muscle layer knit together under the skin, and an impenetrable skull structure beneath the muscle layer. The synthetic tissue is modeled as triangular prism elements that are divided into the epidermal surface, the fascia surface, and the skull surface (figure A-5). Spring elements connecting the epidermal and fascia layers simulate skin elasticity. Spring elements that effect muscle forces connect the fascia and skull layers. The model achieves R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 116 1 A controlling box and an embedded object are shown. When the controlling box i is deformed by manipulating control points, so is the embedded object. Figure A-6 Free form deformation. improved results. Tremendous computation and extensive tuning is needed, however, to model a specific face or characteristic. A.7. Pseudo or Simulated M uscle Physics-based muscle modeling produces decent results by approximating human anatomy, but it is daunting to consider the exact modeling and parameters tuning necessary to simulate a specific human’s facial structure. Simulated muscles offer an alternative approach by deforming the facial mesh in muscle-like fashion, but ignoring the complicated underlying anatomy. Deformation usually occurs only at the thin-shell facial mesh. Muscle forces are simulated in the form of splines [Viad 1992][Wang 1994], wires [Signh 1998], or free form deformations [Coquillart 1990][Karla 1992], Free form deformation Free form deformation (FFD) deforms volumetric objects by manipulating control points arranged in a three-dimensional cubic lattice [Sederberg 1996]. Conceptually, a flexible object is embedded in an imaginary, clear, and flexible control box containing a 3D grid of control points. As the control box is squashed, bent, or twisted into arbitrary shapes, the embedded object deforms accordingly (figure A- R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 117 6). The basis for the control points is a trivariate tensor product Bernstein polynomial. FFDs can deform many types of surface primitives, including polygons; quadric, parametric, and implicit surfaces; and solid models. Extended free form deformation (EFFD) allows the extension of the control point lattice into a cylindrical structure [Coquillart 1990]. A cylindrical lattice provides additional flexibility for shape deformation compared to regular cubic lattices. Rational free form deformation (RFFD) incorporates weight factors for each control point, adding another degree of freedom in specifying deformations. Hence, deformations are possible by changing the weight factors instead of changing the control point positions. When all weights are equal to one, then RFFD becomes a FFD. A main advantage of using FFD (EFFD, RFFD) to abstract deformation control from that of the actual surface description is that the transition of form is no longer dependent on the specifics of the surface itself [Thalmann 1996 pp. 175], Kalra et al. interactively simulates the visual effects of the muscles using Rational Free Form Deformation (RFFD) combined with region-based approach [Karla 1992], To simulate the muscle action on the facial skin, surface regions and control volumes corresponding to the anatomical description of the muscle actions are defined. The skin deformations corresponding to stretching, squashing, expanding, and compressing inside the volume are simulated by interactively displacing the control points and by changing the weights associated with each control point. Displacing a control point is analogous to actuating a physically modeled muscle. RFFD (FFD, EFFD) does not provide a precise simulation of the actual muscle and the skin behavior so that it fails to model furrows, bulges, and wrinkles in the skin. Abstract Muscle Action (AMA) proposed by Thalmann et al. [Thalmann 1988] is used instead of FACS [Ekman 1978], R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 118 Spline Pseudo Muscles Albeit polygonal models of the face are widely used, they often fail to adequately approximate the smoothness or flexibility of the human face. Fixed polygonal models do not deform smoothly in arbitrary regions, and planar vertices cannot be twisted into curved surfaces without sub-division. An ideal facial model has a surface representation that supports smooth and flexible deformations. Spline muscle models offer a solution. Splines are usually up to C2 continuous, hence a surface patch is guaranteed to be smooth, and they allow localized deformation on the surface. Furthermore, affine transformations are defined by the transformation of a small set of control points instead of all the vertices of the mesh, hence reducing the computational complexity. Pixar used bicubic Catmull-Rom spline1 5 patches to model Billy, the baby in animation “Tin Toy”, and recently, used a variant of Catmull-Clark [Catmull 1978] subdivision surfaces to model Geri, a human character in short film Geri’s game. This technique is mainly adapted to model sharp creases on a surface or discontinuities between surfaces [Derose 1998], Eisert and Girod use triangular B- splines to overcome the drawback that conventional B-splines do not refine curved areas locally since they are defined on a rectangular topology [Eisert 1998]. A hierarchical spline model reduces the number of unnecessary control points. Wang et al. show hierarchical spline models with simulated muscles based on local surface deformations [Wang 1994]. Muscles coupled with hierarchical spline surfaces are capable of creating bulging skin surfaces and a variety of facial expressions with high rendering speed. 1 5 A distinguishing property o f Catmull-Rom splines is that the piecewise cubic polynomial segments pass through all the control points except the first and last when used for interpolation. Another is that the convex hull property is not observed in Catmull-Rom spline. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 119 \ direction o f the original Viormal normal wrinkled surface wrinkle function + smooth surface Figure A - l Generation o f wrinkled surface using bump mapping technique A.8. Wrinkles Wrinkles are important for realistic facial animation and modeling. They aid in recognizing facial expressions as well as a person’s age. There are two types of wrinkles, temporary wrinkles that appear for a short time in expressions, and permanent wrinkles that form over time as permanent features of a face [Wu 1994], Wrinkles and creases are difficult to model with techniques such as simulated muscles or parameterization, since these methods are designed to produce smooth deformations. Physically based modeling with plasticity or viscosity, and texture techniques like bump mapping are more appropriate. Wrinkles with Bump Mapping Bump mapping produces perturbations of the surface normals that alter the shading of a surface. Arbitrary wrinkles can appear on a smooth geometric surface by defining wrinkle functions [Blinn 1978]. This technique easily generates wrinkles by varying wrinkle function parameters. Bump mapping technique is relatively computationally demanding as it requires about twice the computing effort needed for conventional color texture mapping. A bump mapped wrinkled surface is depicted in figure A-l. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 120 Moubaraki et al. present a system exploiting the bump mapping technique to produce synthetic wrinkles [Moubaraki 1995]. The method synthesizes and animates wrinkles by morphing between wrinkled and unwrinkled textures. Bump map gradients are extracted in orthogonal directions and used to perturb the normals of the unwrinkled texture. Until recently, bump mapping was difficult to compute in real time. Physically Based Wrinkles Physically based wrinkle models using the plastic-visco-elastic properties of the facial skin and permanent skin aging effects are reported by Wu et al. [Wu 1994]. Viscosity and plasticity are two of the canonical inelastic properties. Viscosity is responsible for time dependent deformation while plasticity is for non-invertible permanent deformation that occurs when an applied force goes beyond a threshold. Both viscosity and plasticity add to the simulations of inelasticity that moves the skin surface in smooth facial deformations. For generating immediate expressive wrinkles, the simulated skin surface deforms smoothly from muscle forces until the forces exceed the threshold; plasticity comes into play, reducing the restoring force caused by elasticity, and forming permanent wrinkles. The plasticity does not occur at all points simultaneously, rather it occurs at points that are most stressed by muscle contractions. By the repetition of this inelastic process over time, the permanent expressive wrinkles become increasingly salient on the facial model. Other Wrinkle Approaches Simpler inelastic models developed by Terzopoulos [Terzopoulos 1988] compute only the visco elastic property of the face. Spline segments model the bulges for the formation of wrinkles [Viad 1992]. Moubaraki et al. [Moubaraki 1994] show the animation of facial expressions using a time- R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 121 varying homotopy1 6 based on the homotopy sweep technique [Kajiwara 1993] where the emphasis was placed on the forehead and mouth motions accounting for the generation of wrinkles. A.9. Texture Manipulation Synthetic facial images derive color from either shading or texturing. Shading computes a color value for each pixel from the surface properties and a lighting model. Because of the subtlety of human skin coloring, simple shading models do not generally produce adequate realism. Textures enable complex variations of surface properties at each pixel, thereby creating the appearance of surface detail that is absent in the surface geometry. Consequently, textures are widely used to achieve facial image realism. Image Morphing and Blending Using multiple photographs, Pighin et al. develop a photorealistic textured 3D facial model [Pighin 1998]. Both view-dependent and view-independent texture maps exploit weight-maps to blend multiple textures. Weight maps are dependent on factors such as self-occlusion, smoothness, positional certainty, and view similarity. A view-independent fusion of multiple textures often exhibits blurring from sampling and registration errors. In contrast, a view-dependent fusion dynamically adjusts the blending weights for the current view by rending the model repeatedly, each time with different texture maps. The drawback of view-dependent textures is their higher memory and computing requirements. In addition, the resulting images are more sensitive to lighting variation in the original texture photographs. 1 6 Homotopy is the notion that forms the basis o f algebraic topology. Readers should refer to [Kajiwara 1993] for more information on Homotopy theory. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 2 2 Dynamic Texture Mapping Oka et al. demonstrate a dynamic texture mapping system for the synthesis of facial expressions and their animation [Oka 1987]. When the geometry of the 3D objects or the viewpoint change, new texture mapping occurs for the optimal display. A mapping function from the texture plane into the output screen is approximated by a locally linear function on each of the small regions that form the texture plane altogether. Facial expressions and their animations are synthesized by interpolation and extrapolation among multiple 3D facial surfaces and the dynamic texture mapping onto them depending on the viewpoint and geometry changes. Vascular Expressions Also important is a simulation of skin color changes that depend on a person’s emotional state. The first notable computational model of vascular expression was reported by Karla et al. [Karla 1994] although simplistic approaches were conceived earlier [Patel 1992], Patel added a skin tone effect to simulate the variation of the facial color by changing the color of all the polygons during strong emotion. Kalra et al. developed a computational model of emotion that includes such visual characteristics as vascular effects and their pattern of change during the term of the emotions. They define emotion as a function of two parameters in time, one tied to the intensities of the muscular expressions and the other with the color variations due to vascular expressions. Modeling the color effects directly from blood flow is complicated. Texture maps and pixel valuation offer a simpler means of approximating vascular effects. Pixel valuation computes the parameter change for each pixel inside the Bezier planar patch mask that defines the affected region of a Minimum Perceptible Color Action (MPCA) in the texture image. This pixel parameter modifies the color attributes of the texture image. With this technique, pallor and blushing of the face are demonstrated. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A.10. Fitting and Model Construction An important problem in facial animation is to model a specific person, i.e., modeling the 3D geometry of an individual face. A range scanner, digitizer probe, or stereo disparity can measure three-dimensional coordinates. However, the models obtained by those processes are often poorly suited for facial animation. Information about the facial structures is missing; measurement noise produces distracting artifacts; and model vertices are poorly distributed. Also, many measurement methods produce incomplete models, lacking hair, ears, eyes, etc. An approach to person-specific modeling is to painstakingly prepare a prototype or generic animation mesh with all the necessary structure and animation information. This generic model is fitted or deformed to a measured geometric mesh of a specific person to create a personalized animation model. The geometric fit also facilitates the transfer of texture if it is captured with the measured mesh. If the generic model has fewer polygons than the measured mesh, decimation is implicit in the fitting process. Person-specific modeling and fitting processes use various approaches such as scattered data interpolations [Pighin 1998][Ulgen 1997], mass spring system [Lee 1995], and anthropometry techniques [DeCarlo 1998] [Kuo 1997], Some methods attempt an automated fitting process, but most require significant manual intervention. Figure A-8 depicts the general fitting process. Bilinear interpolation Parke uses bilinear interpolation to create various facial shapes [Parke 1974]. His assumption is that a large variety of faces can be represented from variations of a single topology. He creates ten different faces by changing the conformation parameters of a generic face model. Parke’s parametric model is restricted to the ranges that the conformation parameters can provide, and tuning the parameters for a specific face is difficult. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 124 ( a ) s c a n n e d i n r a n g e d a t a C o n t a i n s d e p t h i n f o r m a t i o n . ( b ) s c a n n e d i n r e f l e c t a n c e d a t a ( c ) g e n e r i c m e s h t o b e d e f o r m e d C o n t a i n s c o l o r i n f o r m a t i o n . C o n t a i n s s u i t a b l e i n f o r m a t i o n f o r a n i m a t i o n . S e e f i g u r e A - 4 f o r e x a m p l e . ( d ) g e n e r i c m e s h p r o j e c t e d o n t o ( e ) f i t t e d m e s h ( f ) m e s h b e f o r e f i t t e d c y l i n d r i c a l c o o r d i n a t e s f o r f i t t i n g M a s s - s p r i n g s y s t e m i s u s e d f o r f i n a l t u n i n g . S h o w n f o r c o m p a r i s o n w i t h ( e ) . Figure A-8 Example construction o f a person specific model for animation from a generic model Scattered data interpolation Radial basis functions are capable of closely approximate or interpolate smooth hyper-surfaces [Powell 1987], Some approaches morph a generic mesh into specific shapes with scattered data interpolation techniques based on radial basis functions. The advantages of this approach are as follows. First, the morph does not require equal numbers of nodes in the target meshes since missing points are interpolated. Second, mathematical support ensures that a morphed mesh approaches the target mesh, if appropriate correspondences are selected. Scattered data interpolation for model fitting often works in three stages [Ulgen 1997][Pighin 1998]. First, biologically meaningful feature points are selected around the eyes, nose, lips, and perimeters of both the source and target face models. Second, the landmark points define the coefficients of the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 125 radial basis function used to morph the volume. Finally, points in the source model are interpolated using the coefficients computed from the feature points. More points can also be added for final tweaking. Anthropometry In individual model acquisition, laser scanning and stereo images are widely used because of their abilities to acquire detailed geometry and fine textures. However, as mentioned earlier, these methods also have several drawbacks. Scanned data or stereo images often miss regions due to occlusion. Spurious data and perimeter artifacts must be touched up by hand. Existing methods for automatically finding corresponding feature points are not robust, they still require manual adjustment if the features are not salient in the measured data. The generation of individual models using anthropometry1 7 attempts to solve many of these problems for applications where facial variations are desirable, but absolute appearance is not important. Kuo et al. propose a method to synthesize a lateral face from only one 2D gray-level image of a frontal face with no depth information [Kuo 1997], Initially, a database is constructed, containing facial parameters measured according to anthropomorphic definitions. This database serves as a priori knowledge. The lateral facial parameters are then estimated from frontal facial parameters by using minimum mean square error (MMSE) estimation rules applied to the database. Specifically, the depth of one lateral facial parameter is determined by the linear combination of several frontal facial parameters. The 3D generic facial model is adapted according to both the frontal plane coordinates extracted from the image and their estimated depths. 1 7 the science dedicated to the measurements o f the human face R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 126 gn Figure A-9 Some o f the anthropometric landmarks on the face [Farkas 1994] Whereas Kuo’s approach uses anthropometry with one frontal image, Decarlo et al. constructs various facial models purely based on anthropometry without assistance from images [DeCarlo 1998], This system constructs a new face model in two steps. The first step generates a random set of measurements that characterize the face. The form and values of these measurements are computed according to face anthropometry (see figure A-9). The second step constructs the best surface that satisfies the geometric constraints using a variational constrained optimization technique [Gortler 1995][Welch 1992], In this technique, one imposes a variety of constraints on the surface and then tries to create a smooth and fair surface while minimizing the deviation from a specified rest shape, subject to the constrains. Decarlo et al. uses anthropometric measurements as the constraints, and the remainder of the face is determined by minimizing the deviation from the given surface objective function. Variational modeling enables the system to capture the shape similarities of faces, while allowing anthropometric differences. Although anthropometry has potential for rapidly generating R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 127 4 (a) initial tracking of the features (b) features are tracked in real time while (c) avatar mimics the behavior o f the subject o f the face. subject is moving. Face tracking is performed without markups on the face using Eyematic Inc.’s face tracking system. Real | time animation o f the synthesized avatar is achieved based on the 11 tracked features. ! plausible facial geometric variations, the approach does not model realistic variations in color, wrinkling, expressions, or hair. Other Methods Essa et al. [Essa 1996] tackle the fitting problem using Modular Eigenspace methods1 8 [Moghaddam 1994][Pentland 1994]. This method enables the automatic extraction of the positions of feature points such as the eyes, nose, and lips in the image. These features define the warping of a specific face image to match the generic face model. After warping, deformable nodes are extracted from the image for further refinement. DiPaola’s Facial Animation System (FAS) is an extension of Parke’s approach [Dipaola 1991]. New facial models are generated by digitizing live subjects or sculptures, or by manipulating existing 1 8 Modular Eigenspace Methods are primarily used for the recognition and detection o f rigid, roughly convex objects, i.e. faces. It is modular in that it allows the incorporation o f important facial features such as the eyes, nose, and mouth. The eigenspace method computes similarity from the image eigenvectors. Figure A-10 Animation by face tracking R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 128 models with free form deformations, stochastic noise deformations, or vertex editing, Akimoto, et al. uses front and profile images of a subject to automatically create 3D facial models [Akimoto 1993], Additional fitting techniques are described in [Komatsu 1989][Strub 1995][Xu 1990], A .ll. Animation by Tracking The difficulties in controlling facial animations led to the performance driven approach where tracked human actors guide the animation. Real time video processing allows interactive animations where the actors observe the animations they create with their motions and expressions. Accurate tracking of feature points or edges is important to maintain a consistent and decent quality animation. Often the tracked 2D or 3D feature motions are filtered or transformed to generate the motion data needed for driving a specific animation system. Motion data can be used to directly generate facial animation [Essa 1996] or to infer AUs of FACS in generating facial expressions. Figure A-10 shows animation driven from a real time feature tracking system. Snakes Snakes, or deformable minimum-energy curves, are widely used to track intentionally marked facial features [Kass 1987]. The recognition of facial features with snakes is primarily based on color samples and edge detection. Many systems couple tracked snakes to underlying muscles mechanisms to drive facial animation [Pandzic 1994] [Thalmann 1993] [Terzopoulos 1993] [T erzopoulos 1991][Terzopoulos 1990][Waters 1991], Muscle contraction parameters are estimated from the tracked facial displacements in video sequences. Optical Flow Tracking Colored markers painted on the face or lips [Kishino 1994] [Moubaraki 1995][Ohya 1995] [Patterson 1991] [Williams 1990] are extensively used to aid in tracking facial expressions or recognizing speech R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 129 from video sequences. Markings on the face are intrusive. Also, reliance on markings restricts the scope of acquired geometric information to the marked features. Optical flow [Horn 1981] and spatio-temporal normalized correlation measurements [Darrell 1993] perform natural feature tracking and therefore obviate the need for intentional markings on the face. Essa et al. utilize optical flow and physically based observations [Essa 1994], The primary visual measurements of the system are sets of peak normalized correlation scores against a set of previously trained 2D templates. The normalized correlation matching [Darrell 1993] process allows the user to freely translate side-to-side and up-and-down, and minimizes the effects of illumination changes. A 3D finite element mesh is adapted as a facial model, onto which muscles are attached based on the work of Pieper [Pieper 1992] and Waters [Waters 1991]. In an offline process, the muscle parameters associated with each facial expression are first determined using Finite Element Methods [Bathe 1982], Eisert and Girod derive motion estimation and facial expression analysis from optical flow over the whole face [Eisert 1998]. Since the errors of consecutive motion estimates tend to accumulate over multiple frames, a multi-scale feedback loop is employed in the motion estimation process. First the motion parameters are approximated between consecutive low resolutions frames. The differences between a motion compensated frame and the current target frame is minimized. The procedure is repeated at higher resolutions, each time producing more accurate facial motion parameters. This iterative repetition at various image resolutions measures large displacement vectors between two successive video frames. Other Methods Kato et al. employ isodensity maps for the description and the synthesis of facial expressions [Kato 1992]. An isodensity map is constructed from the gray level histogram of the image based on the brightness of the region. The lightest gray level area is labeled the level-one isodensity line and the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 130 darkest is called the level-eight isodensity line. Together, these levels represent the 3D structure of the face. This method, akin to general shape-ffom-shading methods [Horn 1989], is proposed as an alternative to feature tracking techniques. Saji et al. introduce the notion of Lighting Switch Photometry to extract 3D shapes from the moving face [Saji 1992]. The idea is to take the time sequence images illuminated in turn by separate light sources from the same viewpoint. The normal vector at each point on the surface is computed by measuring the intensity of radiance. The 3D shape of the face at a particular instant is then determined from these normal vectors. Even if the human face moves, the detailed facial shapes such as the wrinkles on a face are extracted by Lighting Switch Photometry. Azarbayejani et al. use an extended Kalman filter to recover the rigid motion parameters of a head [Azarbayejani 1993]. Saulnier et al. report a template-based method for tracking and animation [Saulnier 1995]. Li et al. use the Candide model for 3D-motion estimation for model based image coding, which can handle both rigid head motion and non-rigid expressions [Li 1993], Masse et al. use optical flow and principal direction analysis for automatic lip reading [Masse 1990]. A.12. Mouth Animation Among the regions of the face, the mouth is the most complicated in terms of its anatomical structure and its deformation behavior. Its complexity leads to considering the modeling and animation of the mouth independent from the remainder of the face. Many of the basic ideas and methods for modeling the mouth region are optimized variations of general facial animation methods. Research specifically involved in the modeling and the animation of the mouth is categorized as muscle modeling with mass spring system, finite element methods, and parameterizations. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 131 Mass Spring Muscle Systems In mouth modeling and speech animation, mass-spring systems often model the phonetic structure of speech animation. Kelso et al. qualitatively analyze a real person’s face in reiterant speech production and models it with a simple mass spring system [Kelso 1985], Browman et al. show the control of vocal-tract simulation with two mass spring systems. One spring controls the lip aperture and the other the protrusion [Browman 1985]. Waters et al. develop a two-dimensional mouth muscle model and animation method [Waters 1995]. Since mouth animation is generated from relatively few muscle actions, motion realism is largely independent of the number of surface model elements. Waters et al. also attempt to synchronize computer generated faces with synthetic speech driven by the text inputs [Waters 1993], Two different mouth animation approaches are presented. First, each viseme1 9 mouth node is defined with positions in the topology of the mouth. Intermediate node positions between consecutive visemes are interpolated using a cosine function to produce acceleration and deceleration effects at the start and end of each viseme animation. During fluent speech, mouth shape rarely converges to discrete viseme targets due to the continuity of speech and the physical properties of the mouth. To emulate fluent speech, the calculation of co-articulated2 0 visemes is needed. The second animation method exploits Newtonian physics, Hookean elastic force, and velocity dependent damping coefficients to construct the dynamic equations of nodal displacements. The dynamic system adapts itself as the rate of speech increases, thus reducing lip displacement as it tries to accommodate each new position. This behavior is characteristics of real lip motion. 1 9 a group o f phonemes with similar mouth shapes when pronounced. 20 Rapid sequences of speech require that the posture for one phoneme anticipate the posture for the next phonemes. Conversely, the posture for the current phoneme is modified by the previous phonemes. This overlap between phonetic segments is referred to as co-articulation [Kent 1977]. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 132 7 1. levator labii superioris alaeque nasi 2. levator labii superioris 3. zygomaticus minor 4. zygoma ticus major 5. depressor anguli oris 6. depressor labii inferioris 7. mentalis 8. risorius 9. levator anguli oris 10. orbicularis oris Although 5, 6, and 7 are attached to the mouth radially in reality, they are modeled linearly here. Figure A -l 1 Muscle placements around the mouth Layered Spring Mesh Muscles Sera et al. [Sera 1996] add a mouth shape control mechanism to the facial skin modeled as a three- layer spring mesh with appropriately chosen elasticity coefficients (following the approach of [Lee 1995]). Muscle contraction values for each phoneme are determined by the comparison of corresponding points on photos and the model (see figure A-l 1 for muscle placements around the mouth). During speech animation, intermediate mouth shapes are defined by a linear interpolation of the muscle spring force parameters. Finite Element Method The finite element method (FEM) is a numerical approach to approximating the physics of an arbitrary complex object [Bathe 1982]. It implicitly defines interpolation functions between nodes for the physical properties of the material, typically a stress-strain relationship. An object is decomposed into area or volume elements, each endowed with physical parameters. The dynamic element relationships are computed by integrating the piecewise components over the entire object. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 133 Basu et al. built a Finite Element Method (FEM) 3D model of the lips [Basu 1998]. The model parameters are determined from a training set of measured lip motions to minimize the strain felt throughout the linear elastic FEM structure. The difficult control problem associated with muscle- based approaches are minimized by the training stage, as are the accuracy problems that result from using only key-frames for mouth animation. Parameterization Parametric techniques for mouth animation usually require a significant number of input parameters for realistic control. Mouth animation with two parameters is proposed [Moubaraki 1996]. The width and height of the mouth opening are the parameter pair that determine the opening angle at the comers of the mouth as well as the protrusion coefficients. The lip shape is obtained from a piecewise spline interpolation. For each of a set of scanned facial expression, the opening angle at the lip comer and the z-components of protrusion are measured and associated with the measured height and width of the mouth opening. This set of associations is the training set for a radial basis neural network. At run time, detected feature points from a video sequences are input to the trained network that computes the lip shape and protrusion for animation. Tongue modeling In most facial animation, the tongue and its movement is omitted or oversimplified. When modeled, it is often represented as a simple parallelepiped [Lewis 1987][Thalmann 1987][Parke 1991]. Although only a small portion of the tongue is visible during normal speech, the tongue shape is important for realistic synthesized mouth animation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 134 Stone proposes a 3D model of the tongue defined as five segments in the coronal plane and five segments in the sagittal plane2 1 [Stone 1991], This model may deform into twisted, asymmetric, and groove shapes. Pelachaud et al. carefully simplify this relatively accurate tongue model for speech animation [Pelachaud 1994]. They model the tongue as a blobby object [Wyvill 1986], This approach assumes a pseudo skeleton comprised of geometric primitives (nine triangles) that serve as a charge distribution mechanism, creating a spatial potential field. Modifying the skeleton modifies the equi-potential surface that represents the tongue shape. The shape of the tongue changes to maintain volume preservation. Equi-potential surfaces are expensive to render directly, but an automatic method adaptively computes a triangular mesh during animation. The adaptive method produces triangles of sizes that are inversely proportional to the local curvature of the equi-potential surface. In addition, isotropically curved surface areas are represented by equilateral triangles and anisotropically curved surface areas produce acute triangles [Overveld 1993], The palate is modeled as a semi-sphere and the upper teeth are simulated by a planar strip. Collision detection is performed using implicit functions. Other Methods Lip modeling by algebraic functions adjusts the coefficients of a set of continuous functions to best fit the contours of 22 reference lip shapes [Marigny 1996]. Five parameters are measured from video sequences to predict all the algebraic equations of the various lip shape contours. The model computes the contact forces during the lip interaction by virtue of a volumetric model created from an implicit surface. High-resolution lip animation is produced with this method. 2 1 In anatomy, the coronal plane divides the body into front and back halves while the sagittal plane cuts through the center o f the body dividing it into right and left halves. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 135 Adjoudani et al. associate a small set of observed mouth shape parameters with a polygonal lip mesh [Adjoudani 1995], Other methods for synthesized speech and modeling of lip shapes are found in [Brooke 1983] [Cohen 1990][Morishima 1990] [Pearce 1986] [Pelachaud 1991][Saintourens 1990], A.13. Conclusion Generation of facial modeling and animation can be summarized as follows. First, an individual specific model is obtained using a laser scanner or stereo images and fitted into the prearranged generic mesh by scattered data interpolation technique or by some others as discussed in section A. 10. Second, the constructed individual facial model is deformed to produce facial expressions based on (simulated) muscles mechanism, Finite Element Method, or 2D & 3D morphing technique etc. Wrinkles and vascular effects are also considered for added realism. Third, the complete facial animation is performed by Facial Action Coding System or by tracking human actor in the video footage. We described and surveyed the issues associated with facial modeling and animation. We organize a wide range of approaches into categories that reflect the similarities between methods. Two major themes in facial modeling and animation are geometry manipulations and image manipulations. Balanced and coupled in various ways, variations of these themes often achieve realistic facial animations. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 136 Appendix B B. Radial Basis Functions Fundamentals A continuous multivariate function f(x) can be approximated by a function F(x,w) with an appropriate choice of parameter set iv where x and w are real vectors [Poggio 1989]. Finding a parameter set w is referred to as learning or training in the neural network sense. In the training stage, a goal is to figure out w given an approximation function F and a set of training examples, which will provide the best approximation of / . Radial Basis functions (RBF) are often chosen as an approximation function F for its power to deal with irregular sets of data in multi-dimensional space in approximating high dimensional smooth surfaces. Radial basis functions are named because of their radially symmetric distances parameters. The most frequently used RBF is the Gaussian -(-)2 function h(r) = e c . The characteristic of the Guassian function is that the function converges to zero for a large distance r . Therefore, the function response is predictable even if the input set is fairly different from the training set. In addition, Gaussian function produces decent results without the shortcut connections between input and output layer when implemented with the neural network. Another RBF that extensively used in hyper surface interpolation is the Hardy multi-quadrics h(r) = Vr2 + c 2 [Hardy 1971]. Hardy multi-quadrics are not stoutly affected by the distribution of the data points and the constructed surface is very smooth [Franke 1982]. Thin plate spline R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 137 h(r2) = r2 logr with an additional linear term is another useful RBF. The simplest RBF is the linear function h(r) = r , which is rarely used in practice. B .l. Cost Function Minimization Since the number of data is not sufficient and noisy in general, a priori assumptions about the mapping are needed. Unless specified otherwise, the most general and weakest constraint that makes the approximation possible is the smoothness of the function. Smoothness of the function limits the change in output to be small when the change in input is small. Techniques that exploit such an a priori assumptions in approximation are formulated in the context of regularization. The cost function can be written as C(w)=ere+A.wrw (B-l) e is the error vector showing the difference between the actual value and the estimated value, ei — y; -F(xh w). Second term contains a prior knowledge on F(xt, w). The regularization parameter X . is added to avoid overfitting by penalizing the large weights w , occurred from the first term minimization. Therefore, the minimum C (vt> ) is determined at the balance of the two terms. B.2. Approximation/Interpolation with Radial Basis Functions The principle of radial basis functions derives from the theory of the approximation of the multivariate functions [Poggio 1989], Given N pairs ( x , y ) {xt e Rn \ i = 1, 2, ..... , N } and { yt e R | i = 1,2, ....., N } as a training set, we look for a parameter set w with an approximation function of the form R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 138 M y = F ( x ) = E W .h(II X; -Cj II) (B-2) 7=1 where h is the radial basis function and Cj are the M centers of the function h . || xt - Cj || denotes the Euclidian distance between each center Cj and data xf . In the simplest case, the given data x are also used as the centers c , then equation (B-2) becomes y i = F ( x i) = t™ jh{\\xi -xj \\) (B-3) 7=1 More generally, depending on the RBF used, the equation can be of the form y: = F(xi) = 'ZwjhQ Xf -xj ||) + X dj P j (*i) K ^ N (B‘4^ 7=1 7=1 where dj are the coefficients to be computed and pj are the polynomial terms. Since the number of unknowns is N+K of Wj and dj but only N linear equations are available, an additional constraint is required. jtwjPk(xJ) = 0 * - 1 , 2 , ......., K (B-5) 7=1 B.3. System Solutions For the equations in section B.2 to have the solution, the basis matrix H should be invertible where (H)y = hj(xi) - h{\\ Xj\-Xj ||). If H is a N x N square matrix, equation (B-3), the solution is w = H~ly (B-6) If H is a Nx M rectangular matrix, equation (B-2), the solution is R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 139 w = H +)> = ( H TH y xH Ty (B-7) where H + is the pseudo inverse [Penrose 1955]. Due to the presence of the spurious data in general, and to incorporate a priori knowledge about the function, the cost function in equation (B-l) is minimized to solve for w , yielding w = (Ht H + XI)~l H 1 y (B-8) or it can also he simplified to w = (H + XI)~ly (B-9) if H is a square matrix [Thikhonov 1977] and/ is the identity matrix. The magnitude of the regularization parameter X is known to be proportional to the amount of noise. Equations (B-8) and (B-9) become identical to (B-7) and (B-6), respectively, by setting X to 0 ignoring the regularization parameter in the cost function (B-l). B.4. Regularization Parameter With generalized cross-validation (GCV) [Golub 1979] as a model selection criterion, an iterative estimation formula is given for the regularization parameter [Orr 1998], a = — A - (B4°) N - y w A V T where N is the number of training data, A - H H + X I , and y is the effective number of parameters [Moody 1992], y = M - Xtr(A~l) . Without the regularization parameter, the effective number of parameter becomes the number of basis functions, M . To solve for the iterative equation (B-10), it is necessary to first compute the eigenvalues eigenvectors «, of HHT , and the R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 140 projection of the target vectors y onto the eigenvectors, zi = y Tui . Then each term in equation (B- 10) can be computed by t N X2z-2 (B-H ) i= i (m ( - + X) N U:Z: ^ = 1 — ^ (B-l 2) i=l (lit + A) The iteration is stopped when 7 = 1 — ^ (B-l 3) /» i (Uf + A ) N 1 N - r = T r (b -14) NeTe GCV = , (B-l 5) (^v-Y)2 converges, i.e. the difference between the previous GCV value and the current value becomes negligible. B.5. FeedForward Neural Network Radial Basis Functions can be considered a feedforward network with one hidden layer. The input layer comprises N elements of vector x . The second layer, the nonlinear hidden layer, is fully connected to the input layer. The connection between the input layer and the hidden layer is denoted by center vector c,-. The M components of the summation in equation (B-2) are represented as an individual unit in the hidden layer. Each hidden unit computes the Euclidian distance between input data and the center c( -. The output layer, again fully connected from the hidden layer, composed of R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 141 I I I n Figure B-l Radial Basis Function network one or more linear units. The connection between hidden layer and output layer is represented by wi , the unknown coefficients of the RBF expansion. There are also direct connections between the input layer and the output layer that are represented by d j in equation (B-4). The activation of the output neuron is determined by the weighted sum of its inputs, w; - and d j . For the classification of the input patterns, a fixed nonlinear invertible function a (i.e. sigmoid) is often used for the activation of the output unit. Figure B-l shows the architecture of the 3-layered feedforward neural network. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

Abstract (if available)

Linked assets

University of Southern California Dissertations and Theses

Conceptually similar

PDF

Data -driven facial animation synthesis by learning from facial motion capture data

PDF

Algorithms for compression of three-dimensional surfaces

PDF

G -folds: An appearance-based model of facial gestures for performance driven facial animation

PDF

Analysis, recognition and synthesis of facial gestures

PDF

A modular approach to hardware -accelerated deformable modeling and animation

PDF

A voting-based computational framwork for visual motion analysis and interpretation

PDF

An implicit-based haptic rendering technique

PDF

Compiler optimizations for architectures supporting superword-level parallelism

PDF

Data-driven derivation of skills for autonomous humanoid agents

PDF

Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems

PDF

A model for figure -ground segmentation by self -organized cue integration

PDF

Energy efficient hardware-software co-synthesis using reconfigurable hardware

PDF

Adaptive routing services in ad-hoc and sensor networks

PDF

Active learning with multiple views

PDF

Content -based video analysis, indexing and representation using multimodal information

PDF

A dynamic method to reduce the search space for visual correspondence problems

PDF

Directed diffusion: An application -specific and data -centric communication paradigm for wireless sensor networks

PDF

Heterogeneous view integration and its automation

PDF

Design issues in large-scale application -level routing

PDF

A foundation for general-purpose natural language generation: Sentence realization using probabilistic models of language

Asset Metadata

Creator Noh, Junyong (author)

Core Title Facial animation by expression cloning

School Graduate School

Degree Doctor of Philosophy

Degree Program Computer Science

Publisher University of Southern California (original), University of Southern California. Libraries (digital)

Tag Computer Science,OAI-PMH Harvest

Language English

Contributor Digitized by ProQuest (provenance)

Advisor Neumann, Ulrich (committee chair), Narayanan, Shrikanth (committee member), Desbrun, Mathieu (committee member)

Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-552335

Unique identifier UC11339317

Identifier 3094414.pdf (filename),usctheses-c16-552335 (legacy record id)

Legacy Identifier 3094414.pdf

Dmrecord 552335

Document Type Dissertation

Rights Noh, Junyong

Type texts

Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection)

Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...

Repository Name University of Southern California Digital Library

Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA