research-article

KTN: Knowledge Transfer Network for Multi-person DensePose Estimation

Authors:
Xuanhan Wang

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Lianli Gao

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Jingkuan Song

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Heng Tao Shen

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 3780–3788https://doi.org/10.1145/3394171.3414014

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3780–3788

ABSTRACT

In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion and insufficient annotations. In particular, we address two main problems: 1) how to design a simple yet effective pipeline for densepose estimation; and 2) how to equip this pipeline with the ability of handling the issues of limited annotations and class-imbalanced labels. To tackle these problems, we develop a novel densepose estimation framework based on a two-stage pipeline, called Knowledge Transfer Network (KTN). Unlike existing works which directly propagate the pyramidal base features of regions, we enhance their representation power by a multi-instance decoder (MID). MID can well distinguish the target instance from other interference instances and background. Then, we introduce a knowledge transfer machine (KTM), which improves densepose estimation by utilizing the external commonsense knowledge. Notably, with the help of our knowledge transfer machine (KTM), current densepose estimation systems (either based on RCNN or fully-convolutional frameworks) can be improved in terms of the accuracy of human densepose estimation. Solid experiments on densepose estimation benchmarks demonstrate the superiority and generalizability of our approach. Our code and models will be publicly available.

Supplemental Material

3394171.3414014.mp4

mp4

31.5 MB

Download

References

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML.Google Scholar
Lianli Gao, Xuanhan Wang, Jingkuan Song, and Yang Liu. 2019. Fused GRU with semantic-temporal attention for video captioning. Neurocomputing (2019).Google Scholar
Spyros Gidaris and Nikos Komodakis. 2018. Dynamic Few-Shot Visual Learning Without Forgetting. In CVPR.Google Scholar
Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level Human Parsing via Part Grouping Network. In ECCV. 805--822.Google Scholar
Riza Alp Gü ler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. In CVPR.Google Scholar
Yuyu Guo, Lianli Gao, Jingkuan Song, Peng Wang, Wuyuan Xie, and Heng Tao Shen. 2019. Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild. In ACM MM. 356--364.Google Scholar
Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017. Mask R-CNN. In ICCV.Google Scholar
Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning Deep Representation for Imbalanced Classification. In CVPR. 5375--5384.Google Scholar
Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2020. A Context Knowledge Map Guided Coarse-to-Fine Action Recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 2742--2752.Google ScholarCross Ref
Buyu Li, Yu Liu, and Xiaogang Wang. 2019 c. Gradient Harmonized Single-stage Detector.. In AAAI.Google Scholar
Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. 2019 d. CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark. In CVPR.Google Scholar
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019 b. Learnable aggregating net with diversity learning for video question answering. In ACM MM.Google Scholar
Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019 a. Scale-Aware Trident Networks for Object Detection. In ICCV 2019.Google Scholar
Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. 2020. PaStaNet: Toward Human Activity Knowledge Engine. In CVPR.Google Scholar
Tsung-Yi Lin, Piotr Dollá r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017a. Feature Pyramid Networks for Object Detection. In CVPR.Google Scholar
Tsungyi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017b. Focal Loss for Dense Object Detection. In ICCV. 2999--3007.Google Scholar
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.Google Scholar
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In CVPR. 2537--2546.Google Scholar
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML.Google Scholar
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. In ECCV. 185--201.Google Scholar
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In CVPR.Google Scholar
Tomas Mikolov, Kai Chen, Greg S Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICML.Google Scholar
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards Accurate Multi-person Pose Estimation in the Wild. In CVPR.Google Scholar
Fumin Shen, Chunhua Shen, Qinfeng Shi, Anton Van Den Hengel, Zhenmin Tang, and Heng Tao Shen. 2015. Hashing on Nonlinear Manifolds. IEEE Transactions on Image Processing (2015), 1839--1851.Google Scholar
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting Subspace Relation in Semantic Labels for Cross-modal Hashing. IEEE Transactions on Knowledge and Data Engineering (2020).Google Scholar
Li Shen, Zhouchen Lin, and Qingming Huang. 2016. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV. 467--482.Google Scholar
Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical Networks for Few-shot Learning. arXiv preprint arXiv:1703.05175 (2017), 4077--4087.Google Scholar
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. In CVPR. 5693--5703.Google Scholar
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NIPS.Google Scholar
Guanan Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjinzhou, and Jian Sun. 2020. High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification.. In CVPR.Google Scholar
Xuanhan Wang, Lianli Gao, Peng Wang, Xiaoshuai Sun, and Xianglong Liu. 2018a. Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length. IEEE Transactions on Multimedia, Vol. 20 (2018), 634--644.Google ScholarDigital Library
Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. 2018b. Repulsion Loss: Detecting Pedestrians in a Crowd. In CVPR. 7774--7783.Google Scholar
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018c. Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs. In CVPR.Google Scholar
Yuxiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to Model the Tail. In NIPS. 7029--7039.Google Scholar
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple Baselines for Human Pose Estimation and Tracking. In ECCV.Google Scholar
Hang Xu, Chenhan Jiang, Xiaodan Liang, Liang Lin, and Zhenguo Li. 2019. Reasoning-RCNN: Unifying Adaptive Global Reasoning Into Large-Scale Object Detection. In CVPR. 6419--6428.Google Scholar
Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. 2019. Parsing R-CNN for Instance-Level Human Analysis. In CVPR.Google Scholar
Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. 2019. Feature Transfer Learning for Face Recognition With Under-Represented Data. In CVPR. 5704--5713.Google Scholar
Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. 2018. Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd. In ECCV. 637--653.Google Scholar
Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2018. Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing. In ACM MM. 792--800.Google Scholar

Index Terms

KTN: Knowledge Transfer Network for Multi-person DensePose Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems

Recommendations

Semantic-aware Transfer with Instance-adaptive Parsing for Crowded Scenes Pose Estimation
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Crowded scenes human pose estimation remains challenging, which requires joint comprehension of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism, which is a detect-then-estimate pipeline, has become the mainstream ...
Read More
Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Dense human pose "in the wild'' task aims to map all 2D pixels of the detected human body to a 3D surface by establishing surface correspondences, i.e., surface patch index and part-specific UV coordinates. It remains challenging especially under the ...
Read More
KTN: Knowledge Transfer Network for Learning Multiperson 2D-3D Correspondences
Human densepose estimation, aiming at establishing dense correspondences between 2D pixels of human body and 3D human body template, is a key technique in enabling machines to have an understanding of people in images. It still poses several challenges ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
2D-to-3D surface estimation
commonsense knowledge transfer
human densepose estimation
human instance-level analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 228
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

KTN: Knowledge Transfer Network for Multi-person DensePose Estimation

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semantic-aware Transfer with Instance-adaptive Parsing for Crowded Scenes Pose Estimation

Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild

KTN: Knowledge Transfer Network for Learning Multiperson 2D-3D Correspondences