ABSTRACT
In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion and insufficient annotations. In particular, we address two main problems: 1) how to design a simple yet effective pipeline for densepose estimation; and 2) how to equip this pipeline with the ability of handling the issues of limited annotations and class-imbalanced labels. To tackle these problems, we develop a novel densepose estimation framework based on a two-stage pipeline, called Knowledge Transfer Network (KTN). Unlike existing works which directly propagate the pyramidal base features of regions, we enhance their representation power by a multi-instance decoder (MID). MID can well distinguish the target instance from other interference instances and background. Then, we introduce a knowledge transfer machine (KTM), which improves densepose estimation by utilizing the external commonsense knowledge. Notably, with the help of our knowledge transfer machine (KTM), current densepose estimation systems (either based on RCNN or fully-convolutional frameworks) can be improved in terms of the accuracy of human densepose estimation. Solid experiments on densepose estimation benchmarks demonstrate the superiority and generalizability of our approach. Our code and models will be publicly available.
Supplemental Material
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML.Google Scholar
- Lianli Gao, Xuanhan Wang, Jingkuan Song, and Yang Liu. 2019. Fused GRU with semantic-temporal attention for video captioning. Neurocomputing (2019).Google Scholar
- Spyros Gidaris and Nikos Komodakis. 2018. Dynamic Few-Shot Visual Learning Without Forgetting. In CVPR.Google Scholar
- Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level Human Parsing via Part Grouping Network. In ECCV. 805--822.Google Scholar
- Riza Alp Gü ler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. In CVPR.Google Scholar
- Yuyu Guo, Lianli Gao, Jingkuan Song, Peng Wang, Wuyuan Xie, and Heng Tao Shen. 2019. Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild. In ACM MM. 356--364.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017. Mask R-CNN. In ICCV.Google Scholar
- Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning Deep Representation for Imbalanced Classification. In CVPR. 5375--5384.Google Scholar
- Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2020. A Context Knowledge Map Guided Coarse-to-Fine Action Recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 2742--2752.Google ScholarCross Ref
- Buyu Li, Yu Liu, and Xiaogang Wang. 2019 c. Gradient Harmonized Single-stage Detector.. In AAAI.Google Scholar
- Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. 2019 d. CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark. In CVPR.Google Scholar
- Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019 b. Learnable aggregating net with diversity learning for video question answering. In ACM MM.Google Scholar
- Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019 a. Scale-Aware Trident Networks for Object Detection. In ICCV 2019.Google Scholar
- Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. 2020. PaStaNet: Toward Human Activity Knowledge Engine. In CVPR.Google Scholar
- Tsung-Yi Lin, Piotr Dollá r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017a. Feature Pyramid Networks for Object Detection. In CVPR.Google Scholar
- Tsungyi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017b. Focal Loss for Dense Object Detection. In ICCV. 2999--3007.Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.Google Scholar
- Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In CVPR. 2537--2546.Google Scholar
- Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML.Google Scholar
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. In ECCV. 185--201.Google Scholar
- Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In CVPR.Google Scholar
- Tomas Mikolov, Kai Chen, Greg S Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICML.Google Scholar
- George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards Accurate Multi-person Pose Estimation in the Wild. In CVPR.Google Scholar
- Fumin Shen, Chunhua Shen, Qinfeng Shi, Anton Van Den Hengel, Zhenmin Tang, and Heng Tao Shen. 2015. Hashing on Nonlinear Manifolds. IEEE Transactions on Image Processing (2015), 1839--1851.Google Scholar
- Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross-modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020).Google Scholar
- Li Shen, Zhouchen Lin, and Qingming Huang. 2016. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV. 467--482.Google Scholar
- Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical Networks for Few-shot Learning. arXiv preprint arXiv:1703.05175 (2017), 4077--4087.Google Scholar
- Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. In CVPR. 5693--5703.Google Scholar
- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NIPS.Google Scholar
- Guanan Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjinzhou, and Jian Sun. 2020. High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification.. In CVPR.Google Scholar
- Xuanhan Wang, Lianli Gao, Peng Wang, Xiaoshuai Sun, and Xianglong Liu. 2018a. Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length. IEEE Transactions on Multimedia, Vol. 20 (2018), 634--644.Google ScholarDigital Library
- Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. 2018b. Repulsion Loss: Detecting Pedestrians in a Crowd. In CVPR. 7774--7783.Google Scholar
- Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018c. Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. In CVPR.Google Scholar
- Yuxiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to Model the Tail. In NIPS. 7029--7039.Google Scholar
- Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple Baselines for Human Pose Estimation and Tracking. In ECCV.Google Scholar
- Hang Xu, Chenhan Jiang, Xiaodan Liang, Liang Lin, and Zhenguo Li. 2019. Reasoning-RCNN: Unifying Adaptive Global Reasoning Into Large-Scale Object Detection. In CVPR. 6419--6428.Google Scholar
- Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. 2019. Parsing R-CNN for Instance-Level Human Analysis. In CVPR.Google Scholar
- Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. 2019. Feature Transfer Learning for Face Recognition With Under-Represented Data. In CVPR. 5704--5713.Google Scholar
- Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. 2018. Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd. In ECCV. 637--653.Google Scholar
- Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2018. Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing. In ACM MM. 792--800.Google Scholar
Index Terms
- KTN: Knowledge Transfer Network for Multi-person DensePose Estimation
Recommendations
Semantic-aware Transfer with Instance-adaptive Parsing for Crowded Scenes Pose Estimation
MM '21: Proceedings of the 29th ACM International Conference on MultimediaCrowded scenes human pose estimation remains challenging, which requires joint comprehension of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism, which is a detect-then-estimate pipeline, has become the mainstream ...
Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild
MM '19: Proceedings of the 27th ACM International Conference on MultimediaDense human pose "in the wild'' task aims to map all 2D pixels of the detected human body to a 3D surface by establishing surface correspondences, i.e., surface patch index and part-specific UV coordinates. It remains challenging especially under the ...
KTN: Knowledge Transfer Network for Learning Multiperson 2D-3D Correspondences
Human densepose estimation, aiming at establishing dense correspondences between 2D pixels of human body and 3D human body template, is a key technique in enabling machines to have an understanding of people in images. It still poses several challenges ...
Comments