skip to main content
10.1145/3581783.3612142acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Published:27 October 2023Publication History

ABSTRACT

Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

References

  1. Bruce G Baumgart. 1975. A polyhedron representation for computer vision. In Proceedings of the May 19-22, 1975, national computer conference and exposition. 589--596.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dengsheng Chen, Jun Li, Zheng Wang, and Kai Xu. 2020. Learning canonical shape space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11973--11982.Google ScholarGoogle ScholarCross RefCross Ref
  3. Kai Chen and Qi Dou. 2021. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2773--2782.Google ScholarGoogle ScholarCross RefCross Ref
  4. Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Linlin Shen, and Ales Leonardis. 2021. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1581--1590.Google ScholarGoogle ScholarCross RefCross Ref
  5. Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5939--5948.Google ScholarGoogle ScholarCross RefCross Ref
  6. Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision. Springer, 628--644.Google ScholarGoogle ScholarCross RefCross Ref
  7. Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5868--5877.Google ScholarGoogle ScholarCross RefCross Ref
  8. Xinke Deng, Junyi Geng, Timothy Bretl, Yu Xiang, and Dieter Fox. 2022. iCaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters, Vol. 7, 2 (2022), 1784--1791.Google ScholarGoogle ScholarCross RefCross Ref
  9. Xinke Deng, Yu Xiang, Arsalan Mousavian, Clemens Eppner, Timothy Bretl, and Dieter Fox. 2020. Self-supervised 6d object pose estimation for robot manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3665--3671.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10286--10296.Google ScholarGoogle ScholarCross RefCross Ref
  11. Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, and Federico Tombari. 2022. GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise Voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6781--6791.Google ScholarGoogle ScholarCross RefCross Ref
  12. James D Foley, Foley Dan Van, Andries Van Dam, Steven K Feiner, John F Hughes, and J Hughes. 1996. Computer graphics: principles and practice. Vol. 12110. Addison-Wesley Professional.Google ScholarGoogle Scholar
  13. Muhammad Zubair Irshad, Thomas Kollar, Michael Laskey, Kevin Stone, and Zsolt Kira. 2022. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. In 2022 International Conference on Robotics and Automation, ICRA 2022, Philadelphia, PA, USA, May 23-27, 2022. IEEE, 10632--10640.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Justin Johnson, Nikhila Ravi, Jeremy Reizenstein, David Novotny, Shubham Tulsiani, Christoph Lassner, and Steve Branson. 2020. Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses. 1--1.Google ScholarGoogle Scholar
  15. Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. 2017. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision. 1521--1529.Google ScholarGoogle ScholarCross RefCross Ref
  16. Taeyeop Lee, Byeong-Uk Lee, Myungchul Kim, and In So Kweon. 2021. Category-level metric scale object shape and pose estimation. IEEE Robotics and Automation Letters, Vol. 6, 4 (2021), 8575--8582.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. 2018. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 683--698.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Haitao Lin, Zichang Liu, Chilam Cheang, Yanwei Fu, Guodong Guo, and Xiangyang Xue. 2022. SAR-Net: Shape Alignment and Recovery Network for Category-Level 6D Object Pose and Size Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6707--6717.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jiehong Lin, Zewei Wei, Zhihao Li, Songcen Xu, Kui Jia, and Yuanqing Li. 2021. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3560--3569.Google ScholarGoogle ScholarCross RefCross Ref
  20. David Lindlbauer, Jörg Mueller, and Marc Alexa. 2017. Changing the appearance of real-world objects by modifying their surroundings. In Proceedings of the 2017 CHI conference on human factors in computing systems. 3954--3965.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. William E Lorensen and Harvey E Cline. 1987. Marching cubes: A high resolution 3D surface construction algorithm. ACM siggraph computer graphics, Vol. 21, 4 (1987), 163--169.Google ScholarGoogle Scholar
  22. Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. 2018. Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV). 800--815.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics, Vol. 22, 12 (2015), 2633--2651.Google ScholarGoogle Scholar
  24. Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460--4470.Google ScholarGoogle ScholarCross RefCross Ref
  25. Markus Oberweger, Mahdi Rad, and Vincent Lepetit. 2018. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 119--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019a. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165--174.Google ScholarGoogle ScholarCross RefCross Ref
  27. Kiru Park, Timothy Patten, and Markus Vincze. 2019b. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7668--7677.Google ScholarGoogle ScholarCross RefCross Ref
  28. Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, Vol. 30 (2017).Google ScholarGoogle Scholar
  29. Mahdi Rad and Vincent Lepetit. 2017. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE international conference on computer vision. 3828--3836.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jason Rambach, Alain Pagani, Michael Schneider, Oleksandr Artemenko, and Didier Stricker. 2018. 6DoF object tracking based on 3D scans for augmented reality remote live support. Computers, Vol. 7, 1 (2018), 6.Google ScholarGoogle ScholarCross RefCross Ref
  31. Guanya Shi, Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Fabio Ramos, Animashree Anandkumar, and Yuke Zhu. 2021. Fast uncertainty quantification for deep object pose estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5200--5207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Misha Sra, Sergio Garrido-Jurado, Chris Schmandt, and Pattie Maes. 2016. Procedurally generated virtual reality from 3D reconstructed physical space. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology. 191--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David Stutz and Andreas Geiger. 2018. Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1955--1964.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yongzhi Su, Jason Rambach, Nareg Minaskan, Paul Lesur, Alain Pagani, and Didier Stricker. 2019. Deep multi-state object pose estimation for augmented reality assembly. In 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 222--227.Google ScholarGoogle ScholarCross RefCross Ref
  35. Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins. 2017. Sum: Sequential scene understanding and manipulation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3281--3288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Meng Tian, Marcelo H Ang, and Gim Hee Lee. 2020. Shape prior deformation for categorical 6d object pose and size estimation. In European Conference on Computer Vision. Springer, 530--546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Deming Wang, Guangliang Zhou, Yi Yan, Huiyi Chen, and Qijun Chen. 2021b. GeoPose: Dense reconstruction guided 6d object pose estimation with geometric consistency. IEEE Transactions on Multimedia, Vol. 24 (2021), 4394--4408.Google ScholarGoogle ScholarCross RefCross Ref
  38. He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642--2651.Google ScholarGoogle ScholarCross RefCross Ref
  39. Haowen Wang, Mingyuan Wang, Zhengping Che, Zhiyuan Xu, Xiuquan Qiao, Mengshi Qi, Feifei Feng, and Jian Tang. 2022. RGB-Depth Fusion GAN for Indoor Depth Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6209--6218.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jiaze Wang, Kai Chen, and Qi Dou. 2021a. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4807--4814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yongming Wen, Yiquan Fang, Junhao Cai, Kimwa Tung, and Hui Cheng. 2021. GCCN: Geometric Constraint Co-attention Network for 6D Object Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 2671--2679.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912--1920.Google ScholarGoogle Scholar
  43. Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2403--2412.Google ScholarGoogle ScholarCross RefCross Ref
  44. Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. 2021. Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8833--8842.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ruida Zhang, Yan Di, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. 2022. SSP-Pose: Symmetry-Aware Shape Prior Deformation for Direct Category-Level Object Pose Estimation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7452--7459.Google ScholarGoogle Scholar
  46. Zerong Zheng, Tao Yu, Qionghai Dai, and Yebin Liu. 2021. Deep implicit templates for 3d shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1429--1439.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).Google ScholarGoogle Scholar
  48. Lu Zou, Zhangjin Huang, Naijie Gu, and Guoping Wang. 2022. 6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning. IEEE Transactions on Image Processing, Vol. 31 (2022), 6907--6921.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)99
          • Downloads (Last 6 weeks)12

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader