doi:10.1016/j.imavis.2005.10.009
Copyright © 2006 Elsevier B.V. All rights reserved.
Fast stochastic optimization for articulated structure tracking
aComputer Vision Laboratory, Swiss Federal Institute of Technology (ETH), Sternwartstrasse 7, 8092 Zürich, Switzerland
bNational ICT Australia, Canberra, NSW 2000, Australia
Received 15 October 2004;
revised 5 August 2005;
accepted 11 October 2005.
Available online 18 April 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Recently, an optimization approach for fast visual tracking of articulated structures based on stochastic meta-descent (SMD) [7] has been presented. SMD is a gradient descent with local step size adaptation that combines rapid convergence with excellent scalability. Stochastic sampling helps to avoid local minima in the optimization process. We have extended the SMD algorithm with new features for fast and accurate tracking by adapting the different step sizes between as well as within video frames and by introducing a robust cost function, which incorporates both depths and surface orientations. The advantages of the resulting tracker over state-of-the-art methods are supported through 3D hand tracking experiments. A realistic deformable hand model reinforces the accuracy of our tracker.
Keywords: Stochastic meta-descent; Hand tracking; Deformable hand model
Fig. 1. The hand model as polygonal surface (top). The hand model and its degrees of freedom (bottom), where DIP indicates the ‘distal interphalangeal’ joints, PIP the ‘proximal interphalangeal’ joints, MP the ‘metacarpophalangeal’ joints, IP the ‘interphalangeal’ and TM the ‘trapeziometacarpal’ joints.
Fig. 2. Input image of the ShapeSnatcher (a). Depth map extracted from the scene (a) and masked out by skin color segmentation (b).
Fig. 4. Sketch of the assumed displacement of xc given a displacement of
.
Fig. 5. On the left (a), evolution of the tracker's speed according to the number of sample points and the sampling mode. On the right (b), illustration of a random subsampling performed on the hand model with 2 points per phalanx and 15 points for the visible part of the palm. When a hand part is not visible, the points expected on it are randomly chosen on any other visible part.
Fig. 6. Original sequence and 3D reconstructions of the experience presented in Fig. 5(a).
Fig. 7. Comparison of the accumulated number of iterations when tracking with or without inter-frame step size adaptation. The solid line is with inter-frame initialization, the dashed line when step sizes are simply reset at the beginning of each frame.
Fig. 8. Evolution of the SMD algorithm while tracking. The top row shows this evolution projected on the translation in X and rotation in Z of the palm and the bottom rows present the resulting sequence.
Fig. 9. Evolution of the gradient descent algorithm ((a) and (b)) and a deterministic sampling ((c) and (d)) while tracking the same sequence as in Fig. 8. (b) Is the final result of the gradient descent after 16 iterations: the hand model is slightly shifted from the target. (d) Is the final result of the deterministic sampling after nine iterations: this approach clearly diverges from the correct solution.
Fig. 10. Comparison between SMD and alternative optimization algorithms. From top to bottom: the results of the SMD using a cost function E evaluating the distances and the differences in surface orientations, SMD using a cost function E only evaluating the distances, gradient descent, Powell and annealed particle filter (APF). The model is visualized as the vertices of the skin polygonal representation. The pictures have been cropped for better visibility, but see Fig. 11 for examples of similar, complete frames.
Fig. 11. Hand tracking with self-occlusions. Far right: 3D model for frame 103 (top view).
Fig. 12. Spelling the word ‘FLY’ in American sign language with SMD: top row represents the ‘F’ letter, the second row the ‘L’ letter and the last row the ‘Y’ letter. The last column is the 3D reconstruction from the third one.
Table 1.
Comparison of the computation time of the Fig. 10
