ABSTRACT
Abstract. MocapNETs are state of the art Neural Network (NN) ensembles that estimate 3D human pose based on visual input in the form of an RGB image. They do so by deriving a 3D Bio Vision Hierarchy (BVH) skeleton from estimated 2D human body joint projections. BVH output makes MocapNETs directly compatible with a large variety of 3D graphics engines, where virtual avatars can be directly animated from RGB sources and off-the-shelf webcam input. MocapNETs have satisfactory accuracy and state of the art computational performance that, however, prior to this work was not sufficient for their deployment on embedded devices. In this paper we explore dimensionality reduction via the use of Principal Components Analysis (PCA) as a means to optimize their size and make them applicable to mobile and edge devices. PCA allows (a) reduction of input dimensionality, (b) fine-grained control over the variance covered by the maintained dimensions and, (c) drastic reduction of the total number of model/network parameters without compromising regression accuracy. Extensive experiments on the CMU BVH dataset provide insight on the effective receptive fields for densely connected networks. Moreover, PCA-based dimensionality reduction results in a 35% smaller NN compared to the baseline (original NN without any dimension reduction) and derives BVH skeletons without accuracy degradation. As such, the proposed compact NN solution becomes deployable on the Raspberry Pi 4 ARM CPU @ 23Hz.
- Hervé et al. Abdi. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.Google Scholar
- Caglar Aytekin. 2022. Neural Networks are Decision Trees. https://doi.org/10.48550/ARXIV.2210.05189Google ScholarCross Ref
- David Barber. 2012. Bayesian reasoning and machine learning. Algorithm 21.1.Cambridge University Press.Google Scholar
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/ARXIV.2005.14165Google ScholarCross Ref
- Zhe et al. Cao. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR.Google Scholar
- Sai Kumar et al. Dwivedi. 2021. Learning To Regress Bodies From Images Using Differentiable Semantic Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11250–11259.Google Scholar
- Jonathan Frankle, David J Schwab, and Ari S Morcos. 2020. The early phase of neural network training. arXiv preprint arXiv:2002.10365 (2020).Google Scholar
- Jonathan et al. Frankle. 2020. Pruning neural networks at initialization: Why are we missing the mark?arXiv preprint arXiv:2009.08576 (2020).Google Scholar
- Google. 2022. Tensorflow Model Pruning comprehensive guide. https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide.Google Scholar
- John C Gower. 1975. Generalized procrustes analysis. Psychometrika 40, 1 (1975), 33–51.Google ScholarCross Ref
- Rıza Alp et al. Güler. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7297–7306.Google Scholar
- B. Hahne. 2010. The Daz-friendly BVH release of CMU motion capture database. https://sites.google.com/a/cgspeed.com/cgspeed/motion-capture/daz-friendly-release.Google Scholar
- Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53, 2 (2011), 217–288.Google Scholar
- Aapo Hyvarinen. 1999. Fast and robust fixed-point algorithms for independent component analysis. IEEE transactions on Neural Networks 10, 3 (1999), 626–634.Google ScholarDigital Library
- Ian T Jolliffe. 2002. Principal component analysis for special types of data. Springer.Google Scholar
- Sven et al. Kreiss. 2019. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Yann et al. LeCun. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.Google ScholarCross Ref
- Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791.Google Scholar
- Subhash Lele and Joan T Richtsmeier. 1991. Euclidean distance matrix analysis: A coordinate-free approach for comparing biological shapes using landmark data. American journal of physical anthropology 86, 3 (1991), 415–427.Google Scholar
- Kevin et al. Lin. 2021. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1954–1963.Google Scholar
- Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018).Google Scholar
- Matthew et al. Loper. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34, 6 (2015), 1–16.Google Scholar
- Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning. 689–696.Google ScholarDigital Library
- Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 2011. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis 30, 1 (2011), 47–68.Google ScholarCross Ref
- Maddock Meredith, Steve Maddock, 2001. Motion capture file formats explained. Department of Computer Science, University of Sheffield 211 (2001), 241–244.Google Scholar
- Ammar et al. Qammaz. 2019. MocapNET: Ensemble of SNN Encoders for 3D Human Pose Estimation in RGB Images. In British Machine Vision Conference (BMVC 2019). BMVA, Cardiff, UK.Google Scholar
- Ammar et al. Qammaz. 2021. Occlusion-tolerant and personalized 3D human pose estimation in RGB images. In IEEE International Conference on Pattern Recognition (ICPR 2020), (to appear).Google Scholar
- Ammar et al. Qammaz. 2021. Towards Holistic Real-time Human 3D Pose Estimation using MocapNETs. In BMVC 2021. BMVA.Google Scholar
- Atefeh Shahroudnejad. 2021. A survey on understanding, visualizations, and explanation of deep neural networks. arXiv preprint arXiv:2102.01792 (2021).Google Scholar
- Paul Tassi. 2022. Mark Zuckerbergs metaverse legs demo was staged with motion capture. Forbes. https://www.forbes.com/sites/paultassi/2022/10/14/mark-zuckerbergs-metaverse-legs-demo-was-staged-with-motion-capture/.Google Scholar
- Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems 34 (2021), 24261–24272.Google Scholar
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).Google Scholar
- Ashish et al. Vaswani. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Bastian et al. Wandt. 2021. ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses. https://doi.org/10.48550/ARXIV.2112.07088Google ScholarCross Ref
- Donglai et al. Xiang. 2019. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10965–10974.Google Scholar
- Ailing et al. Zeng. 2020. Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In ECCV. Springer, 507–523.Google Scholar
- Ce et al. Zheng. 2021. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11656–11665.Google Scholar
Index Terms
- Compacting MocapNET-based 3D Human Pose Estimation via Dimensionality Reduction
Recommendations
Dimensionality reduction-based spoken emotion recognition
To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised ...
Random projection in dimensionality reduction: applications to image and text data
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningRandom projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using ...
Supervised Dimensionality Reduction via Nonlinear Target Estimation
DaWaK 2013: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057Dimensionality reduction is a crucial ingredient of machine learning and data mining, boosting classification accuracy through the isolation of patterns via omission of noise. Nevertheless, recent studies have shown that dimensionality reduction can ...
Comments