3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Wu, Jiajun; Xue, Tianfan; Lim, Joseph J.; Tian, Yuandong; Tenenbaum, Joshua B.; Torralba, Antonio; Freeman, William T.

doi:10.1007/s11263-018-1074-6

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Published: 21 March 2018

Volume 126, pages 1009–1026, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jiajun Wu ORCID: orcid.org/0000-0002-4176-343X¹^na1,
Tianfan Xue²^na1,
Joseph J. Lim³,
Yuandong Tian⁴,
Joshua B. Tenenbaum¹,
Antonio Torralba¹ &
…
William T. Freeman^1,5

1357 Accesses
17 Citations
5 Altmetric
Explore all metrics

Abstract

Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-End Object Detection with Transformers

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

Visual Prompt Tuning

References

Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE conference on computer vision and pattern recognition.
Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: Exemplar part-based 2d–3d alignment using a large dataset of cad models. In IEEE conference on computer vision and pattern recognition.
Bansal, A., & Russell, B. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In IEEE conference on computer vision and pattern recognition.
Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. Computer Vision Systems
Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(12), 2930–2940.
Article Google Scholar
Bever, T. G., & Poeppel, D. (2010). Analysis by synthesis: A (re-) emerging program of research for language and vision. Biolinguistics, 4(2–3), 174–200.
Google Scholar
Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In European Conference on Computer Vision.
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In IEEE conference on computer vision and pattern recognition.
Chen, J., Izadi, S., & Fitzgibbon, A. (2012). Kinêtre: Animating the world with the human body. In ACM symposium on user interface software and technology.
Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision.
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.
Fidler, S., Dickinson, S. J., & Urtasun, R. (2012). 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in neural information processing systems.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.
Hejrati, M., & Ramanan, D. (2012). Analyzing 3d objects in cluttered images. In Advances in neural information processing systems.
Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In IEEE conference on computer vision and pattern recognition.
Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358), 1177–1190.
Article Google Scholar
Hinton, G. F. (1981). A parallel computation that assigns canonical object-based frames of reference. In International joint conference on artificial intelligence.
Hu, W., & Zhu, S. C. (2015). Learning 3d object templates by quantizing geometry and appearance spaces. IEEE Transactions on Pattern Analysis and Machine intelligence, 37(6), 1190–1205.
Article Google Scholar
Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics, 34(4), 87.
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems.
Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In IEEE conference on computer vision and pattern recognition.
Kar, A., Häne, C., & Malik, J. (2017). Learning a multi-view stereo machine. In Advances in neural information processing systems.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015a). Picture: A probabilistic programming language for scene perception. In IEEE conference on computer vision and pattern recognition.
Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B. (2015b) Deep convolutional inverse graphics network. In Advances in neural information processing systems.
Leclerc, Y. G., & Fischler, M. A. (1992). An optimization-based approach to the interpretation of single line drawings as 3d wire frames. International Journal of Computer Vision, 9(2), 113–136.
Article Google Scholar
Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., & Guibas, L. J. (2015). Joint embeddings of shapes and images via cnn image purification. ACM Transactions on Graphics, 34(6), 234.
Google Scholar
Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. In IEEE international conference on computer vision.
Lim, J. J., Khosla, A., Torralba, A. (2014). FPM: Fine pose parts-based model with 3d cad models. In European conference on computer vision.
Liu, J., & Belhumeur, P. N. (2013). Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In IEEE international conference on computer vision.
Lowe, D. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial intelligence, 31(3), 355–395.
Article Google Scholar
McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In IEEE international conference on computer vision.
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In IEEE international symposium on mixed and augmented reality (pp. 127–136).
Newell, A., Yang, K., & Deng, J. (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision.
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In IEEE conference on computer vision and pattern recognition.
Prasad, M., Fitzgibbon, A., Zisserman, A., & Van Gool, L. (2010). Finding nemo: Deformable object class modelling using curve matching. In IEEE conference on computer vision and pattern recognition.
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In IEEE conference on computer vision and pattern recognition.
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In British machine vision conference.
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In IEEE international conference on computer vision.
Shih, K. J., Mallya, A., Singh, S., & Hoiem, D. (2015). Part localization using multi-proposal consensus for fine-grained categorization. In British machine vision conference.
Shrivastava, A., & Gupta, A. (2013). Building part-based object detectors via 3d geometry. In IEEE international conference on computer vision.
Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017) Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In IEEE conference on computer vision and pattern recognition.
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In IEEE conference on computer vision and pattern recognition.
Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics, 33(4), 37.
MATH Google Scholar
Su, H., Qi, C. R., Li, Y., & Guibas, L. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In IEEE international conference on computer vision.
Sun, B., & Saenko, K. (2014) From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference.
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2015). Web-scale training for face identification. In IEEE conference on computer vision and pattern recognition.
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In IEEE conference on computer vision and pattern recognition.
Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems.
Torralba, A., & Efros, A. A. (2011) Unbiased look at dataset bias. In IEEE conference on computer vision and pattern recognition.
Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Advances in neural information processing systems.
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In IEEE conference on computer vision and pattern recognition (pp. 1653–1660).
Tulsiani, S., & Malik, J. (2015). Viewpoints and keypoints. In IEEE conference on computer vision and pattern recognition.
Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multi-view supervision for single-view reconstruction via differentiable ray consistency. In IEEE conference on computer vision and pattern recognition.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605.
MATH Google Scholar
Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing pascal voc. In IEEE conference on computer vision and pattern recognition.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems.
Wu, J., Zhang, C., Xue, T., Freeman, W. T., & Tenenbaum, J. B. (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems.
Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3d shape reconstruction via 2.5d sketches. In Advances in neural information processing systems.
Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010) Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.
Xue, T., Liu, J., & Tang, X. (2012). Example-based 3d object reconstruction from line drawings. In IEEE conference on computer vision and pattern recognition.
Yang, Y., & Ramanan, D. (2011) Articulated pose estimation with flexible mixtures-of-parts. In IEEE conference on computer vision and pattern recognition.
Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition.
Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.
Article Google Scholar
Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J. (2017). 3dmatch: Learning the matching of local 3d geometry in range scans. In IEEE conference on computer vision and pattern recognition.
Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016) Learning dense correspondence via 3d-guided cycle consistency. In IEEE conference on computer vision and pattern recognition.
Zhou, X., Leonardos, S., Hu, X., & Daniilidis, K. (2015) 3d shape reconstruction from 2d landmarks: A convex formulation. In IEEE conference on computer vision and pattern recognition.
Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(11), 2608–2623.
Article Google Scholar

Download references

Acknowledgements

This work is supported by NSF Robust Intelligence 1212849 and NSF Big Data 1447476 to W.F., NSF Robust Intelligence 1524817 to A.T., ONR MURI N00014-16-1-2007 to J.B.T., Shell Research, the Toyota Research Institute, and the Center for Brain, Minds and Machines (NSF STC award CCF-1231216). The authors would like to thank Nvidia for GPU donations. Part of this work was done when Jiajun Wu was an intern at Facebook AI Research, and Tianfan Xue was a graduate student at MIT CSAIL.

Author information

Jiajun Wu and Tianfan Xue contributed equally to this work.

Authors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, USA
Jiajun Wu, Joshua B. Tenenbaum, Antonio Torralba & William T. Freeman
Google Research, Mountain View, CA, USA
Tianfan Xue
University of Southern California, Los Angeles, CA, USA
Joseph J. Lim
Facebook Inc., Menlo Park, CA, USA
Yuandong Tian
Google Research, Cambridge, MA, USA
William T. Freeman

Authors

Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tianfan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Joseph J. Lim
View author publications
You can also search for this author in PubMed Google Scholar
Yuandong Tian
View author publications
You can also search for this author in PubMed Google Scholar
Joshua B. Tenenbaum
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Torralba
View author publications
You can also search for this author in PubMed Google Scholar
William T. Freeman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiajun Wu.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J., Xue, T., Lim, J.J. et al. 3D Interpreter Networks for Viewer-Centered Wireframe Modeling. Int J Comput Vis 126, 1009–1026 (2018). https://doi.org/10.1007/s11263-018-1074-6

Download citation

Received: 17 June 2017
Accepted: 26 February 2018
Published: 21 March 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11263-018-1074-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Abstract

Access this article

Similar content being viewed by others

End-to-End Object Detection with Transformers

Deep learning-based 3D reconstruction: a survey

Visual Prompt Tuning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Abstract

Access this article

Similar content being viewed by others

End-to-End Object Detection with Transformers

Deep learning-based 3D reconstruction: a survey

Visual Prompt Tuning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation