Skip to main content

SoundSpaces: Audio-Visual Navigation in 3D Environments

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12351))

Abstract

Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception. Project: http://vision.cs.utexas.edu/projects/audio_visual_navigation.

C. Chen and U. Jain—Contributed equally.

U. Jain—Work done as an intern at Facebook AI Research.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    While algorithms could also run with ambisonic inputs, using binaural sound has the advantage of allowing human listeners to interpret our video results (see Supp video).

  2. 2.

    Replica has more multi-room trajectories, where audio gives clear cues of room entrances/exits (vs. open floor plans in Matterport). This may be why AG is better than PG and APG on Replica.

References

  1. Alameda-Pineda, X., Horaud, R.: Vision-guided robot hearing. Int. J. Robot. Res. 34, 437–456 (2015)

    Article  Google Scholar 

  2. Alameda-Pineda, X., et al.: Salsa: a novel dataset for multimodal group behavior analysis. IEEE Trans. Pattern Anal. Mach. intell. 38(8), 1707–1720 (2015)

    Article  Google Scholar 

  3. Ammirato, P., Poirson, P., Park, E., Kosecka, J., Berg, A.: A dataset for developing and benchmarking active vision. In: ICRA (2016)

    Google Scholar 

  4. Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

  5. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)

    Google Scholar 

  6. Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)

    Google Scholar 

  7. Armeni, I., Sax, A., Zamir, A.R., Savarese, S.: Joint 2D–3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints, February 2017

    Google Scholar 

  8. Ban, Y., Girin, L., Alameda-Pineda, X., Horaud, R.: Exploiting the complementarity of audio and visual data in multi-speaker tracking. In: ICCV Workshop on Computer Vision for Audio-Visual Media. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017). https://hal.inria.fr/hal-01577965

  9. Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

    Google Scholar 

  10. Brodeur, S., et al.: Home: a household multimodal environment. https://arxiv.org/abs/1711.11017 (2017)

  11. Cao, C., Ren, Z., Schissler, C., Manocha, D., Zhou, K.: Interactive sound propagation with bidirectional path tracing. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)

    Article  Google Scholar 

  12. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)

    Google Scholar 

  13. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: Proceedings of the International Conference on 3D Vision (3DV) (2017)

    Google Scholar 

  14. Chaplot, D.S., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural mapping. In: ICLR (2020)

    Google Scholar 

  15. Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)

    Google Scholar 

  16. Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM (2017)

    Google Scholar 

  17. Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. http://arxiv.org/abs/1903.01959

  18. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: NeurIPS (2015)

    Google Scholar 

  19. Connors, E.C., Yazzolino, L.A., Sánchez, J., Merabet, L.B.: Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. J. Vis. Exp. JoVE 73, e50272 (2013)

    Google Scholar 

  20. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)

    Google Scholar 

  21. Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. In: ECCV (2018)

    Google Scholar 

  22. Das, A., et al.: Probing emergent semantics in predictive agents via question answering. In: ICML (2020)

    Google Scholar 

  23. Egan, M.D., Quirt, J., Rousseau, M.: Architectural Acoustics. Elsevier, Amsterdam (1989)

    Google Scholar 

  24. Ekstrom, A.D.: Why vision is important to how we navigate. Hippocampus 25, 731–735 (2015)

    Article  Google Scholar 

  25. Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)

    Google Scholar 

  26. Evers, C., Naylor, P.: Acoustic slam. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1484–1498 (2018)

    Article  Google Scholar 

  27. Fortin, M., et al.: Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. Brain 131, 2995–3005 (2008)

    Article  Google Scholar 

  28. Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)

    Google Scholar 

  29. Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: VisualEchoes: spatial image representation learning through echolocation. In: ECCV (2020)

    Google Scholar 

  30. Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3

    Chapter  Google Scholar 

  31. Gao, R., Grauman, K.: 2.5 D visual sound. In: CVPR (2019)

    Google Scholar 

  32. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)

    Google Scholar 

  33. Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 15–21 (2015)

    Google Scholar 

  34. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR (2018)

    Google Scholar 

  35. Gordon, D., Kadian, A., Parikh, D., Hoffman, J., Batra, D.: SplitNet: Sim2Sim and Task2Task transfer for embodied visual navigation. In: ICCV (2019)

    Google Scholar 

  36. Gougoux, F., Zatorre, R.J., Lassonde, M., Voss, P., Lepore, F.: A functional neuroimaging study of sound localization: visual cortex activity predicts performance in early-blind individuals. PLoS Biol. 3(2), e27 (2005)

    Article  Google Scholar 

  37. Gunther, R., Kazman, R., MacGregor, C.: Using 3D sound as a navigational aid in virtual environments. Behav. Inf. Technol. 23(6), 435–446 (2010). https://doi.org/10.1080/01449290410001723364

    Article  Google Scholar 

  38. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)

    Google Scholar 

  39. Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)

  40. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML (2018)

    Google Scholar 

  41. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  42. Henriques, J.F., Vedaldi, A.: MapNet: an allocentric spatial memory for mapping environments. In: CVPR (2018)

    Google Scholar 

  43. Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NeurIPS (2000)

    Google Scholar 

  44. Jain, U., et al.: A cordial sync: going beyond marginal policies for multi-agent embodied tasks. In: ECCV (2020)

    Google Scholar 

  45. Jain, U., et al.: Two body problem: collaborative visual task completion. In: CVPR (2019)

    Google Scholar 

  46. Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. TPAMI 41(7), 1601–1614 (2018)

    Article  Google Scholar 

  47. Johnson, M., Hofmann, K., Hutton, T., Bignell, D.: The malmo platform for artificial intelligence experimentation. In: International Joint Conference on AI (2016)

    Google Scholar 

  48. Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jakowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: Proceedings of the IEEE Conference on Computational Intelligence and Games (2016)

    Google Scholar 

  49. Kingma, D., Ba, J.: A method for stochastic optimization. In: CVPR (2017)

    Google Scholar 

  50. Kojima, N., Deng, J.: To learn or not to learn: analyzing the role of learning for navigation in virtual environments. arXiv preprint arXiv:1907.11770 (2019)

  51. Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)

    Google Scholar 

  52. Kuttruff, H.: Room Acoustics. CRC Press, Boca Raton (2016)

    Book  Google Scholar 

  53. Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)

    Google Scholar 

  54. Lessard, N., Paré, M., Lepore, F., Lassonde, M.: Early-blind human subjects localize sound sources better than sighted subjects. Nature 395, 278–280 (1998)

    Article  Google Scholar 

  55. Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)

    Google Scholar 

  56. Massiceti, D., Hicks, S.L., van Rheede, J.J.: Stereosonic vision: exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm. PLoS ONE 13(7), e0199389 (2018)

    Article  Google Scholar 

  57. Merabet, L., Sanchez, J.: Audio-based navigation using virtual environments: combining technology and neuroscience. AER J. Res. Pract. Vis. Impair. Blind. 2, 128–137 (2009)

    Google Scholar 

  58. Merabet, L.B., Pascual-Leone, A.: Neural reorganization following sensory loss: the opportunity of change. Nat. Rev. Neurosci. 11, 44–52 (2010)

    Article  Google Scholar 

  59. Mirowski, P., et al.: Learning to navigate in complex environments. In: ICLR (2017)

    Google Scholar 

  60. Mishkin, D., Dosovitskiy, A., Koltun, V.: Benchmarking classic and learned navigation in complex 3D environments. arXiv preprint arXiv:1901.10915 (2019)

  61. Morgado, P., Nvasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: NeurIPS (2018)

    Google Scholar 

  62. Murali, A. et al..: PyRobot: an open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236 (2019)

  63. Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: AAAI (2000)

    Google Scholar 

  64. Nakadai, K., Nakamura, K.: Sound source localization and separation. Wiley Encyclopedia of Electrical and Electronics Engineering (1999)

    Google Scholar 

  65. Nakadai, K., Okuno, H.G., Kitano, H.: Epipolar geometry based sound localization and extraction for humanoid audition. In: IROS Workshops. IEEE (2001)

    Google Scholar 

  66. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)

    Google Scholar 

  67. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)

    Google Scholar 

  68. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48

    Chapter  Google Scholar 

  69. Picinali, L., Afonso, A., Denis, M., Katz, B.: Exploration of architectural spaces by blind people using auditory virtual reality for the construction of spatial knowledge. Int. J. Hum.-Comput. Stud. 72(4), 393–407 (2014)

    Article  Google Scholar 

  70. Qin, J., Cheng, J., Wu, X., Xu, Y.: A learning based approach to audio surveillance in household environment. Int. J. Inf. Acquis. 3, 213–219 (2006)

    Article  Google Scholar 

  71. Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)

    Article  Google Scholar 

  72. RoÈder, B., Teder-SaÈlejaÈrvi, W., Sterr, A., RoÈsler, F., Hillyard, S.A., Neville, H.J.: Improved auditory spatial tuning in blind humans. Nature 400, 162–166 (1999)

    Article  Google Scholar 

  73. Romano, J.M., Brindza, J.P., Kuchenbecker, K.J.: ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming. Auton. Robot. 34, 207–215 (2013). https://doi.org/10.1007/s10514-013-9323-6

    Article  Google Scholar 

  74. Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. In: ICLR (2018)

    Google Scholar 

  75. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  76. Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)

    Google Scholar 

  77. Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

  78. Sukhbaatar, S., Szlam, A., Synnaeve, G., Chintala, S., Fergus, R.: Mazebase: a sandbox for learning from games. arXiv preprint arXiv:1511.07401 (2015)

  79. Thinus-Blanc, C., Gaunet, F.: Representation of space in blind persons: vision as a spatial sense? Psychol. Bull. 121, 20 (1997)

    Article  Google Scholar 

  80. Thomason, J., Gordon, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL-HLT (2019)

    Google Scholar 

  81. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)

    MATH  Google Scholar 

  82. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)

    Google Scholar 

  83. Tolman, E.C.: Cognitive maps in rats and men. Psychol. Rev. 55, 189 (1948)

    Article  Google Scholar 

  84. van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  85. Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Sakas, G., Muller, S., Shirley, P. (eds) Photorealistic Rendering Techniques, pp. 145–167. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-642-87825-1_11

  86. Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J., Bandera, J., Romero-Garces, A., Reche-Lopez, P.: Audio-visual perception system for a humanoid robotic head. Sensors 14, 9522–9545 (2014)

    Article  Google Scholar 

  87. Voss, P., Lassonde, M., Gougoux, F., Fortin, M., Guillemot, J.P., Lepore, F.: Early-and late-onset blind individuals show supra-normal auditory abilities in far-space. Curr. Biol. 14(19), 1734–1738 (2004)

    Article  Google Scholar 

  88. Wang, Y., Kapadia, M., Huang, P., Kavan, L., Badler, N.: Sound localization and multi-modal steering for autonomous virtual agents. In: Symposium on Interactive 3D Graphics and Games (2014)

    Google Scholar 

  89. Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: CVPR (2019)

    Google Scholar 

  90. Wijmans, E., et al.: Decentralized distributed PPO: solving PointGoal navigation. In: ICLR (2020)

    Google Scholar 

  91. Wood, J., Magennis, M., Arias, E.F.C., Gutierrez, T., Graupp, H., Bergamasco, M.: The design and evaluation of a computer game for the blind in the GRAB haptic audio virtual environment. In: Proceedings of Eurohpatics (2003)

    Google Scholar 

  92. Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: CVPR (2019)

    Google Scholar 

  93. Woubie, A., Kanervisto, A., Karttunen, J., Hautamaki, V.: Do autonomous agents benefit from hearing? arXiv preprint arXiv:1905.04192 (2019)

  94. Wu, X., Gong, H., Chen, P., Zhong, Z., Xu, Y.: Surveillance robot utilizing video and audio information. J. Intell. Robot. Syst. 55, 403–421 (2009). https://doi.org/10.1007/s10846-008-9297-3

    Article  MATH  Google Scholar 

  95. Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. In: ICCV (2019)

    Google Scholar 

  96. Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., Sumner, A.: TORCS, the open racing car simulator (2013). http://www.torcs.org

  97. Xia, F., et al.: Interactive Gibson: a benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442 (2019)

  98. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: CVPR (2018)

    Google Scholar 

  99. Yoshida, T., Nakadai, K., Okuno, H.G.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609. IEEE (2009)

    Google Scholar 

  100. Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS (2016)

    Google Scholar 

  101. Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv:1706.00932 (2017)

  102. Zaunschirm, M., Schörkhuber, C., Höldrich, R.: Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. J. Acoust. Soc. Am. 143, 3616 (2018)

    Article  Google Scholar 

  103. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35

    Chapter  Google Scholar 

  104. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR (2018)

    Google Scholar 

  105. Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: ICCV (2017)

    Google Scholar 

  106. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: ICRA (2017)

    Google Scholar 

Download references

Acknowledgements

UT Austin is supported in part by DARPA Lifelong Learning Machines. We thank Alexander Schwing, Dhruv Batra, Erik Wijmans, Oleksandr Maksymets, Ruohan Gao, and Svetlana Lazebnik for valuable discussions and support with the AI-Habitat platform.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changan Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5785 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, C. et al. (2020). SoundSpaces: Audio-Visual Navigation in 3D Environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12351. Springer, Cham. https://doi.org/10.1007/978-3-030-58539-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58539-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58538-9

  • Online ISBN: 978-3-030-58539-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics