research-article

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Authors:
Haowen Wang

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

0009-0003-7994-0547
View Profile

,
Zhipeng Fan

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

0009-0003-7067-8220
View Profile

,
Zhen Zhao

Midea Group, Beijing, China

Midea Group, Beijing, China

0009-0009-4683-8910
View Profile

,
Zhengping Che

Midea Group, Beijing, China

Midea Group, Beijing, China

0000-0001-6818-1125
View Profile

,
Zhiyuan Xu

Midea Group, Beijing, China

Midea Group, Beijing, China

0000-0003-2879-3244
View Profile

,
Dong Liu

Midea Group, Shanghai, China

Midea Group, Shanghai, China

0009-0003-4913-2688
View Profile

,
Feifei Feng

Midea Group, Shanghai, China

Midea Group, Shanghai, China

0009-0003-8612-5022
View Profile

,
Yakun Huang

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

0000-0003-4051-0200
View Profile

,
Xiuquan Qiao

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

0000-0002-0140-0650
View Profile

,
Jian Tang

Midea Group, Beijing, China

Midea Group, Beijing, China

0000-0003-4418-0114
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 3676–3685https://doi.org/10.1145/3581783.3612142

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3676–3685

ABSTRACT

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

References

Bruce G Baumgart. 1975. A polyhedron representation for computer vision. In Proceedings of the May 19-22, 1975, national computer conference and exposition. 589--596.Google ScholarDigital Library
Dengsheng Chen, Jun Li, Zheng Wang, and Kai Xu. 2020. Learning canonical shape space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11973--11982.Google ScholarCross Ref
Kai Chen and Qi Dou. 2021. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2773--2782.Google ScholarCross Ref
Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Linlin Shen, and Ales Leonardis. 2021. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1581--1590.Google ScholarCross Ref
Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5939--5948.Google ScholarCross Ref
Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision. Springer, 628--644.Google ScholarCross Ref
Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5868--5877.Google ScholarCross Ref
Xinke Deng, Junyi Geng, Timothy Bretl, Yu Xiang, and Dieter Fox. 2022. iCaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters, Vol. 7, 2 (2022), 1784--1791.Google ScholarCross Ref
Xinke Deng, Yu Xiang, Arsalan Mousavian, Clemens Eppner, Timothy Bretl, and Dieter Fox. 2020. Self-supervised 6d object pose estimation for robot manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3665--3671.Google ScholarCross Ref
Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10286--10296.Google ScholarCross Ref
Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, and Federico Tombari. 2022. GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6781--6791.Google ScholarCross Ref
James D Foley, Foley Dan Van, Andries Van Dam, Steven K Feiner, John F Hughes, and J Hughes. 1996. Computer graphics: principles and practice. Vol. 12110. Addison-Wesley Professional.Google Scholar
Muhammad Zubair Irshad, Thomas Kollar, Michael Laskey, Kevin Stone, and Zsolt Kira. 2022. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. In 2022 International Conference on Robotics and Automation, ICRA 2022, Philadelphia, PA, USA, May 23-27, 2022. IEEE, 10632--10640.Google ScholarDigital Library
Justin Johnson, Nikhila Ravi, Jeremy Reizenstein, David Novotny, Shubham Tulsiani, Christoph Lassner, and Steve Branson. 2020. Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses. 1--1.Google Scholar
Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. 2017. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision. 1521--1529.Google ScholarCross Ref
Taeyeop Lee, Byeong-Uk Lee, Myungchul Kim, and In So Kweon. 2021. Category-level metric scale object shape and pose estimation. IEEE Robotics and Automation Letters, Vol. 6, 4 (2021), 8575--8582.Google ScholarCross Ref
Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 683--698.Google ScholarDigital Library
Haitao Lin, Zichang Liu, Chilam Cheang, Yanwei Fu, Guodong Guo, and Xiangyang Xue. 2022. SAR-Net: Shape Alignment and Recovery Network for Category-Level 6D Object Pose and Size Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6707--6717.Google ScholarCross Ref
Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, and Yuanqing Li. 2021. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3560--3569.Google ScholarCross Ref
David Lindlbauer, Jörg Mueller, and Marc Alexa. 2017. Changing the appearance of real-world objects by modifying their surroundings. In Proceedings of the 2017 CHI conference on human factors in computing systems. 3954--3965.Google ScholarDigital Library
William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm. ACM siggraph computer graphics, Vol. 21, 4 (1987), 163--169.Google Scholar
Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. 2018. Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV). 800--815.Google ScholarDigital Library
Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics, Vol. 22, 12 (2015), 2633--2651.Google Scholar
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460--4470.Google ScholarCross Ref
Markus Oberweger, Mahdi Rad, and Vincent Lepetit. 2018. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 119--134.Google ScholarDigital Library
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019a. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165--174.Google ScholarCross Ref
Kiru Park, Timothy Patten, and Markus Vincze. 2019b. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7668--7677.Google ScholarCross Ref
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
Mahdi Rad and Vincent Lepetit. 2017. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE international conference on computer vision. 3828--3836.Google ScholarCross Ref
Jason Rambach, Alain Pagani, Michael Schneider, Oleksandr Artemenko, and Didier Stricker. 2018. 6DoF object tracking based on 3D scans for augmented reality remote live support. Computers, Vol. 7, 1 (2018), 6.Google ScholarCross Ref
Guanya Shi, Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Fabio Ramos, Animashree Anandkumar, and Yuke Zhu. 2021. Fast uncertainty quantification for deep object pose estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5200--5207.Google ScholarDigital Library
Misha Sra, Sergio Garrido-Jurado, Chris Schmandt, and Pattie Maes. 2016. Procedurally generated virtual reality from 3D reconstructed physical space. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology. 191--200.Google ScholarDigital Library
David Stutz and Andreas Geiger. 2018. Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1955--1964.Google ScholarCross Ref
Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. 2019. Deep multi-state object pose estimation for augmented reality assembly. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 222--227.Google ScholarCross Ref
Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins. 2017. Sum: Sequential scene understanding and manipulation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3281--3288.Google ScholarDigital Library
Meng Tian, Marcelo H Ang, and Gim Hee Lee. 2020. Shape prior deformation for categorical 6d object pose and size estimation. In European Conference on Computer Vision. Springer, 530--546.Google ScholarDigital Library
Deming Wang, Guangliang Zhou, Yi Yan, Huiyi Chen, and Qijun Chen. 2021b. GeoPose: Dense reconstruction guided 6d object pose estimation with geometric consistency. IEEE Transactions on Multimedia, Vol. 24 (2021), 4394--4408.Google ScholarCross Ref
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642--2651.Google ScholarCross Ref
Haowen Wang, Mingyuan Wang, Zhengping Che, Zhiyuan Xu, Xiuquan Qiao, Mengshi Qi, Feifei Feng, and Jian Tang. 2022. RGB-Depth Fusion GAN for Indoor Depth Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6209--6218.Google ScholarCross Ref
Jiaze Wang, Kai Chen, and Qi Dou. 2021a. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4807--4814.Google ScholarDigital Library
Yongming Wen, Yiquan Fang, Junhao Cai, Kimwa Tung, and Hui Cheng. 2021. GCCN: Geometric Constraint Co-attention Network for 6D Object Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 2671--2679.Google ScholarDigital Library
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912--1920.Google Scholar
Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2403--2412.Google ScholarCross Ref
Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8833--8842.Google ScholarCross Ref
Ruida Zhang, Yan Di, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. 2022. SSP-Pose: Symmetry-Aware Shape Prior Deformation for Direct Category-Level Object Pose Estimation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7452--7459.Google Scholar
Zerong Zheng, Tao Yu, Qionghai Dai, and Yebin Liu. 2021. Deep implicit templates for 3d shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1429--1439.Google ScholarCross Ref
Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).Google Scholar
Lu Zou, Zhangjin Huang, Naijie Gu, and Guoping Wang. 2022. 6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning. IEEE Transactions on Image Processing, Vol. 31 (2022), 6907--6921.Google ScholarDigital Library

Index Terms

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
        Reconstruction
      2. Computer vision tasks
        Vision for robotics

Recommendations

3D Shape Reconstruction of Loop Objects in X-Ray Protein Crystallography

Knowledge of the shape of crystals can benefit data collection in X-ray crystallography. A preliminary step is the determination of the loop object, i.e., the shape of the loop holding the crystal. Based on the standard set-up of experimental X-ray ...
Read More
DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation
Computer Vision – ECCV 2022
Abstract
Scalable 6D pose estimation for rigid objects from RGB images aims at handling multiple objects and generalizing to novel objects. Building on a well-known auto-encoding framework to cope with object symmetry and the lack of labeled training data, ...
Read More
ShAPO: Implicit Representations for Multi-object Shape, Appearance, and Pose Optimization
Computer Vision – ECCV 2022
Abstract
Our method studies the complex task of object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose and size estimation in complex multi-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d shape reconstruction
6d pose estimation
robotic vision
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 99
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

3D Shape Reconstruction of Loop Objects in X-Ray Protein Crystallography

DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation

ShAPO: Implicit Representations for Multi-object Shape, Appearance, and Pose Optimization