ABSTRACT
Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.
- Bruce G Baumgart. 1975. A polyhedron representation for computer vision. In Proceedings of the May 19-22, 1975, national computer conference and exposition. 589--596.Google ScholarDigital Library
- Dengsheng Chen, Jun Li, Zheng Wang, and Kai Xu. 2020. Learning canonical shape space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11973--11982.Google ScholarCross Ref
- Kai Chen and Qi Dou. 2021. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2773--2782.Google ScholarCross Ref
- Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Linlin Shen, and Ales Leonardis. 2021. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1581--1590.Google ScholarCross Ref
- Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5939--5948.Google ScholarCross Ref
- Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision. Springer, 628--644.Google ScholarCross Ref
- Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5868--5877.Google ScholarCross Ref
- Xinke Deng, Junyi Geng, Timothy Bretl, Yu Xiang, and Dieter Fox. 2022. iCaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters, Vol. 7, 2 (2022), 1784--1791.Google ScholarCross Ref
- Xinke Deng, Yu Xiang, Arsalan Mousavian, Clemens Eppner, Timothy Bretl, and Dieter Fox. 2020. Self-supervised 6d object pose estimation for robot manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3665--3671.Google ScholarCross Ref
- Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10286--10296.Google ScholarCross Ref
- Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, and Federico Tombari. 2022. GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6781--6791.Google ScholarCross Ref
- James D Foley, Foley Dan Van, Andries Van Dam, Steven K Feiner, John F Hughes, and J Hughes. 1996. Computer graphics: principles and practice. Vol. 12110. Addison-Wesley Professional.Google Scholar
- Muhammad Zubair Irshad, Thomas Kollar, Michael Laskey, Kevin Stone, and Zsolt Kira. 2022. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. In 2022 International Conference on Robotics and Automation, ICRA 2022, Philadelphia, PA, USA, May 23-27, 2022. IEEE, 10632--10640.Google ScholarDigital Library
- Justin Johnson, Nikhila Ravi, Jeremy Reizenstein, David Novotny, Shubham Tulsiani, Christoph Lassner, and Steve Branson. 2020. Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses. 1--1.Google Scholar
- Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. 2017. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision. 1521--1529.Google ScholarCross Ref
- Taeyeop Lee, Byeong-Uk Lee, Myungchul Kim, and In So Kweon. 2021. Category-level metric scale object shape and pose estimation. IEEE Robotics and Automation Letters, Vol. 6, 4 (2021), 8575--8582.Google ScholarCross Ref
- Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 683--698.Google ScholarDigital Library
- Haitao Lin, Zichang Liu, Chilam Cheang, Yanwei Fu, Guodong Guo, and Xiangyang Xue. 2022. SAR-Net: Shape Alignment and Recovery Network for Category-Level 6D Object Pose and Size Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6707--6717.Google ScholarCross Ref
- Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, and Yuanqing Li. 2021. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3560--3569.Google ScholarCross Ref
- David Lindlbauer, Jörg Mueller, and Marc Alexa. 2017. Changing the appearance of real-world objects by modifying their surroundings. In Proceedings of the 2017 CHI conference on human factors in computing systems. 3954--3965.Google ScholarDigital Library
- William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm. ACM siggraph computer graphics, Vol. 21, 4 (1987), 163--169.Google Scholar
- Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. 2018. Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV). 800--815.Google ScholarDigital Library
- Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics, Vol. 22, 12 (2015), 2633--2651.Google Scholar
- Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460--4470.Google ScholarCross Ref
- Markus Oberweger, Mahdi Rad, and Vincent Lepetit. 2018. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 119--134.Google ScholarDigital Library
- Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019a. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165--174.Google ScholarCross Ref
- Kiru Park, Timothy Patten, and Markus Vincze. 2019b. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7668--7677.Google ScholarCross Ref
- Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
- Mahdi Rad and Vincent Lepetit. 2017. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE international conference on computer vision. 3828--3836.Google ScholarCross Ref
- Jason Rambach, Alain Pagani, Michael Schneider, Oleksandr Artemenko, and Didier Stricker. 2018. 6DoF object tracking based on 3D scans for augmented reality remote live support. Computers, Vol. 7, 1 (2018), 6.Google ScholarCross Ref
- Guanya Shi, Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Fabio Ramos, Animashree Anandkumar, and Yuke Zhu. 2021. Fast uncertainty quantification for deep object pose estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5200--5207.Google ScholarDigital Library
- Misha Sra, Sergio Garrido-Jurado, Chris Schmandt, and Pattie Maes. 2016. Procedurally generated virtual reality from 3D reconstructed physical space. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology. 191--200.Google ScholarDigital Library
- David Stutz and Andreas Geiger. 2018. Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1955--1964.Google ScholarCross Ref
- Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. 2019. Deep multi-state object pose estimation for augmented reality assembly. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 222--227.Google ScholarCross Ref
- Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins. 2017. Sum: Sequential scene understanding and manipulation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3281--3288.Google ScholarDigital Library
- Meng Tian, Marcelo H Ang, and Gim Hee Lee. 2020. Shape prior deformation for categorical 6d object pose and size estimation. In European Conference on Computer Vision. Springer, 530--546.Google ScholarDigital Library
- Deming Wang, Guangliang Zhou, Yi Yan, Huiyi Chen, and Qijun Chen. 2021b. GeoPose: Dense reconstruction guided 6d object pose estimation with geometric consistency. IEEE Transactions on Multimedia, Vol. 24 (2021), 4394--4408.Google ScholarCross Ref
- He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642--2651.Google ScholarCross Ref
- Haowen Wang, Mingyuan Wang, Zhengping Che, Zhiyuan Xu, Xiuquan Qiao, Mengshi Qi, Feifei Feng, and Jian Tang. 2022. RGB-Depth Fusion GAN for Indoor Depth Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6209--6218.Google ScholarCross Ref
- Jiaze Wang, Kai Chen, and Qi Dou. 2021a. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4807--4814.Google ScholarDigital Library
- Yongming Wen, Yiquan Fang, Junhao Cai, Kimwa Tung, and Hui Cheng. 2021. GCCN: Geometric Constraint Co-attention Network for 6D Object Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 2671--2679.Google ScholarDigital Library
- Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912--1920.Google Scholar
- Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2403--2412.Google ScholarCross Ref
- Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8833--8842.Google ScholarCross Ref
- Ruida Zhang, Yan Di, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. 2022. SSP-Pose: Symmetry-Aware Shape Prior Deformation for Direct Category-Level Object Pose Estimation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7452--7459.Google Scholar
- Zerong Zheng, Tao Yu, Qionghai Dai, and Yebin Liu. 2021. Deep implicit templates for 3d shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1429--1439.Google ScholarCross Ref
- Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).Google Scholar
- Lu Zou, Zhangjin Huang, Naijie Gu, and Guoping Wang. 2022. 6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning. IEEE Transactions on Image Processing, Vol. 31 (2022), 6907--6921.Google ScholarDigital Library
Index Terms
- DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
Recommendations
3D Shape Reconstruction of Loop Objects in X-Ray Protein Crystallography
Knowledge of the shape of crystals can benefit data collection in X-ray crystallography. A preliminary step is the determination of the loop object, i.e., the shape of the loop holding the crystal. Based on the standard set-up of experimental X-ray ...
DISP6D: Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation
Computer Vision – ECCV 2022AbstractScalable 6D pose estimation for rigid objects from RGB images aims at handling multiple objects and generalizing to novel objects. Building on a well-known auto-encoding framework to cope with object symmetry and the lack of labeled training data, ...
ShAPO: Implicit Representations for Multi-object Shape, Appearance, and Pose Optimization
Computer Vision – ECCV 2022AbstractOur method studies the complex task of object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose and size estimation in complex multi-...
Comments