doi:10.1016/j.imavis.2005.12.018
Copyright © 2006 Elsevier B.V. All rights reserved.
Estimating 3D hand pose using hierarchical multi-label classification
aToshiba Cambridge Research Laboratory, 1 Guildhall Street, Cambridge CB2 3NH, UK
bUniversity of Cambridge, Department of Engineering, Cambridge CB2 1PZ, UK
cOxford Brookes University, Department of Computing, Oxford OX33 1HX, UK
Available online 6 October 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
This paper presents an analysis of the design of classifiers for use in a hierarchical object recognition approach. In this approach, a cascade of classifiers is arranged in a tree in order to recognize multiple object classes. We are interested in the problem of recognizing multiple patterns as it is closely related to the problem of locating an articulated object. Each different pattern class corresponds to the hand in a different pose, or set of poses. For this problem obtaining labelled training data of the hand in a given pose can be problematic. Given a parametric 3D model, generating training data in the form of example images is cheap, and we demonstrate that it can be used to design classifiers almost as good as those trained using non-synthetic data. We compare a variety of different template-based classifiers and discuss their merits.
Keywords: Computer vision; Pose estimation; Hand detection; Multi-class classification; Human–computer interaction
Fig. 1. Cascade of classifiers. (a) A cascade of classifiers for a single object class where each classifier has a high detection and moderate false positive rate. (b) Classifiers in a tree structure; in a tree-based object recognition scheme each leaf corresponds to a single object class. When objects in the subtrees have similar appearance, classifiers can be used to quickly prune the search. A binary tree is shown here, but the branching factor can be larger than 2.
Fig. 2. Examples of training and test images. The following are example image regions used for training and testing a linear classifier: (a) positive training examples, (b) test images, (c) negative training examples containing a hand in different poses, (d) negative examples containing office scenes as background.
Fig. 3. Templates and feature maps used for classification. This figure shows the different choices of classifiers A (left) and the corresponding feature maps B (right) used in the experiments. (a) Centre template with DT of edge image (chamfer). (b) Centre template with dilated edge image (Hausdorff). (c) Averaged template with edge image. (d) Union template with edge image. (e) Template learnt from data with edge image.
Fig. 4. Including edge orientation improves classification performance. This example shows the classification results on a test set using a marginalized template. (a) Histogram of classifier output using edges without orientation information: hand in correct pose (red/light), hand in incorrect pose (blue) and background regions (black,dashed line). (b) Histogram of classifier output using edges with orientation information. The classes are clearly better separated, (c) the corresponding ROC curve.
Fig. 5. ROC curves for classifiers. This figure shows the ROC curve for each of the classifiers. (a) Edge features alone, and (b) oriented edges. Note the difference in scale of the axes. The classifier trained on real image data performs best, the marginalized templates all show similar results, and chamfer matching is slightly better than Hausdorff matching in this experiment. When used within a cascade structure, the performance at high detection rates is important.
Fig. 6. Efficient evaluation of colour likelihoods. (a) The skin-colour log-likelihood image encoded as a greyscale image. Higher intensity corresponds to higher likelihood of skin vs. non-skin colour. (b) The sum table contains the cumulative sum of values in image (a) along the x-direction. The sum of values in within an area can be efficiently computed by adding and subtracting values at silhouette points only. The greyscale intensities are scaled to the range [0, 255] in both images.
Fig. 7. Determining the weighting factor in the cost function. The distribution of edge and colour cost values for a number of positive (lower left) and negative training examples (upper right) is shown, and the linear classifier found using a maximum margin linear classifier. The weighting factor is set to the negative inverse of the slope of this line.
Fig. 8. Detection with integrated edge and colour features. This illustrative example shows how a cost function that uses edge and colour features improves detection. For each input image the best match is shown in the last column. (Top row) Hand in front of cluttered background, and (bottom row) hand in front of face.
Fig. 9. Detection results using edge and colour information. This figure shows successful detection of an open hand moving with 4 DOF. The first two columns show the frame number and the input frame, the next two columns show the Canny edge map and the skin-colour likelihood. The last column shows the best match superimposed, if the likelihood function is above a constant threshold. The sequence is challenging because the background contains skin-coloured objects (frame 0) and motion is fast, leading to motion blur and missed edges. The detection handles some partial occlusion (frame 100), recovers from loss of track (frames 257 and 390), can deal with lighting changes (frame 516) and unsteady camera motion (frame 598).
Fig. 10. Error comparison between UKF tracking and detection. This figure shows the error performance of the UKF tracker and detection on the image sequence in Fig. 9. The hand position error was measured against manually labelled ground truth. The shaded areas (blue) indicate intervals in which the hand is either fully occluded or out of camera view. The detection algorithm successfully finds the hand in the whole sequence, whereas the UKF tracker using skin-colour edges is only able to track the hand for a few frames. The reasons for the loss of track is that the hand motion is fast between two frames and that skin-colour edges cannot be reliably found in this input sequence.
Fig. 11. Hierarchical detection of a pointing hand. Left: Input image, Next: Images with classification results super-imposed. Each square represents an image location which contains at least one positive classification result. Higher intensity indicates larger number of matches. Face and second hand introduce ambiguity. Regions are progressively eliminated, the best match is shown on the right.
Fig. 12. Search results at different levels of the tree. This figure shows typical examples of accepted and rejected templates at levels 1–3 of the tree, ranked according to matching cost shown below. As the search is refined at each level, the difference between accepted and rejected templates decreases.
Table 1.
Computation times for correlating templates

The execution times for computing the dot product of 10,000 image patches of size 128 × 128, where only the non-zero coefficients are correlated for efficiency, measured on a 2.4 GHz Pentium IV machine. The last column shows the false positive rates for each classifier at a fixed detection rate of 0.99.