research-article

Automatic Network Architecture Search for RGB-D Semantic Segmentation

Authors:
Wenna Wang

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0009-0005-1387-1256
View Profile

,
Tao Zhuo

Shandong Artificial Intelligence Institute, Jinan, China

Shandong Artificial Intelligence Institute, Jinan, China

0000-0003-4854-8772
View Profile

,
Xiuwei Zhang

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0001-7230-1476
View Profile

,
Mingjun Sun

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0009-0008-1361-6185
View Profile

,
Hanlin Yin

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0002-1086-2873
View Profile

,
Yinghui Xing

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0001-6021-8261
View Profile

,
Yanning Zhang

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0002-2977-8057
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 3777–3786https://doi.org/10.1145/3581783.3612288

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3777–3786

ABSTRACT

Recent RGB-D semantic segmentation networks are usually manually designed. However, due to limited human efforts and time costs, their performance might be inferior for complex scenarios. To address this issue, we propose the first Neural Architecture Search (NAS) method that designs the network automatically. Specifically, the target network consists of an encoder and a decoder. The encoder is designed with two independent branches, where each branch specializes in extracting features from RGB and depth images, respectively. The decoder fuses the features and generates the final segmentation result. Besides, for automatic network design, we design a grid-like network-level search space combined with a hierarchical cell-level search space. By further developing an effective gradient-based search strategy, the network structure with hierarchical cell architectures is discovered. Extensive results on two datasets show that the proposed method outperforms the state-of-the-art approaches, which achieves a mIoU score of 55.1% on the NYU-Depth v2 dataset and 50.3% on the SUN-RGBD dataset.

References

Lizhi Bai, Jun Yang, Chunqi Tian, Yaoru Sun, Maoyu Mao, Yanjun Xu, and Weirong Xu. 2022. DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation. arXiv preprint arXiv:2210.06747 (2022).Google Scholar
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. 2016. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016).Google Scholar
Shubhankar Borse, Hong Cai, Yizhe Zhang, and Fatih Porikli. 2021. HS3: Learning with Proper Task Complexity in Hierarchically Supervised Semantic Segmentation. arXiv preprint arXiv:2111.02333 (2021).Google Scholar
Martijn MA Bosma, Arkadiy Dushatskiy, Monika Grewal, Tanja Alderliesten, and Peter Bosman. 2022. Mixed-block neural architecture search for medical image segmentation. In Medical Imaging 2022: Image Processing, Vol. 12032. SPIE, 193--199.Google Scholar
Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. 2021. Shapeconv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7088--7097.Google ScholarCross Ref
Jiazhou Chen, Yangfan Zhan, Yanghui Xu, and Xiang Pan. 2023. FAFNet: Fully aligned fusion network for RGBD semantic segmentation based on hierarchical semantic flows. IET Image Processing, Vol. 17, 1 (2023), 32--41.Google ScholarCross Ref
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV). 801--818.Google ScholarDigital Library
Lin-Zhuo Chen, Zheng Lin, Ziqin Wang, Yong-Liang Yang, and Ming-Ming Cheng. 2021. Spatial information guided convolution for real-time RGBD semantic segmentation. IEEE Transactions on Image Processing, Vol. 30 (2021), 2313--2324.Google ScholarDigital Library
Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang Zeng. 2020. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In European Conference on Computer Vision. Springer, 561--577.Google ScholarDigital Library
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research, Vol. 20, 1 (2019), 1997--2017.Google ScholarDigital Library
Fahimeh Fooladgar and Shohreh Kasaei. 2019. Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images. arXiv preprint arXiv:1912.11691 (2019).Google Scholar
Tianxiao Gao, Wu Wei, Zhongbin Cai, Zhun Fan, Shane Xie, Xinmei Wang, and Qiuda Yu. 2021. CI-Net: Contextual information for joint semantic segmentation and depth estimation. arXiv preprint arXiv:2107.13800 (2021).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Lei He, Jiwen Lu, Guanghui Wang, Shiyu Song, and Jie Zhou. 2021. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing, Vol. 440 (2021), 251--263.Google ScholarCross Ref
Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. 2019. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 1440--1444.Google ScholarCross Ref
Jindong Jiang, Lunan Zheng, Fei Luo, and Zhijun Zhang. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018).Google Scholar
Yirui Jiang, Trung Hieu Tran, and Leon Williams. 2022. Advanced Visual SLAM and Image Segmentation Techniques for Augmented Reality. International Journal of Virtual and Augmented Reality (IJVAR), Vol. 6, 1 (2022), 1--28.Google ScholarDigital Library
Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. 2019. Geometry-aware distillation for indoor semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2869--2878.Google ScholarCross Ref
Alex Krasner, Mikhail Sizintsev, Abhinav Rajvanshi, Han-Pang Chiu, Niluthpol Mithun, Kevin Kaighn, Philip Miller, Ryan Villamil, and Supun Samarasekera. 2022. SIGNAV: Semantically-Informed GPS-Denied Navigation and Mapping in Visually-Degraded Environments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2972--2981.Google ScholarCross Ref
Guohao Li, Guocheng Qian, Itzel C Delgadillo, Matthias Muller, Ali Thabet, and Bernard Ghanem. 2020. Sgas: Sequential greedy architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1620--1630.Google ScholarCross Ref
Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. 2019. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 82--92.Google ScholarCross Ref
Haoming Liu, Li Guo, Zhongwen Zhou, and Hanyuan Zhang. 2022a. Pyramid-Context Guided Feature Fusion for RGB-D Semantic Segmentation. In 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 1--6.Google Scholar
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).Google Scholar
Yuqiao Liu, Yanan Sun, Bing Xue, Mengjie Zhang, Gary G Yen, and Kay Chen Tan. 2021. A survey on evolutionary neural architecture search. IEEE transactions on neural networks and learning systems (2021).Google ScholarCross Ref
Yunlong Liu, Osamu Yoshie, and Hiroshi Watanabe. 2022b. Application of Multi-modal Fusion Attention Mechanism in Semantic Segmentation. In Proceedings of the Asian Conference on Computer Vision. 1245--1264.Google Scholar
Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee. 2017. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE international conference on computer vision. 4980--4989.Google Scholar
Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966--6975.Google ScholarCross Ref
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. 2021. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10912--10922.Google ScholarCross Ref
Daniel Seichter, Söhnke Benedikt Fischedick, Mona Köhler, and Horst-Michael Groß. 2022. Efficient Multi-Task RGB-D Scene Analysis for Indoor Environments. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--10.Google Scholar
Daniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. 2021. Efficient rgb-d semantic segmentation for indoor scene analysis. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 13525--13531.Google ScholarDigital Library
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746--760.Google ScholarDigital Library
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition. 567--576.Google ScholarCross Ref
Peng Sun, Wenhu Zhang, Huanyu Wang, Songyuan Li, and Xi Li. 2021. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1407--1417.Google ScholarCross Ref
Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. 2022a. Multimodal Token Fusion for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12186--12195.Google ScholarCross Ref
Yikai Wang, Fuchun Sun, Wenbing Huang, Fengxiang He, and Dacheng Tao. 2022c. Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google ScholarCross Ref
Yikai Wang, Fuchun Sun, Ming Lu, and Anbang Yao. 2020. Learning deep multimodal feature representation with asymmetric multi-layer fusion. In Proceedings of the 28th ACM International Conference on Multimedia. 3902--3910.Google ScholarDigital Library
Yi-Chun Wang, Jun-Wei Hsieh, and Ming-Ching Chang. 2022 NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation. arXiv preprint arXiv:2210.00698 (2022).Google Scholar
Pengjin Wei, Guohang Yan, Yikang Li, Kun Fang, Wei Liu, Xinyu Cai, and Jie Yang. 2022. CROON: Automatic Multi-LiDAR Calibration and Refinement Method in Road Scene. arXiv preprint arXiv:2203.03182 (2022).Google Scholar
Lingxi Xie and Alan Yuille. 2017. Genetic cnn. In Proceedings of the IEEE international conference on computer vision. 1379--1388.Google ScholarCross Ref
Yajie Xing, Jingbo Wang, Xiaokang Chen, and Gang Zeng. 2019. 2.5 D convolution for RGB-D semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 1410--1414.Google ScholarCross Ref
Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 675--684.Google ScholarCross Ref
Lumin Xu, Yingda Guan, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. 2021. Vipnas: Efficient video pose estimation via neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16072--16081.Google ScholarCross Ref
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. 2019. Pc-darts: Partial channel connections for memory-efficient architecture search. arXiv preprint arXiv:1907.05737 (2019).Google Scholar
Jun Yang, Lizhi Bai, Yaoru Sun, Chunqi Tian, Maoyu Mao, and Guorun Wang. 2023. Pixel Difference Convolutional Network for RGB-D Semantic Segmentation. arXiv preprint arXiv:2302.11951 (2023).Google Scholar
Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022a. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing, Vol. 31 (2022), 1204--1216.Google ScholarCross Ref
Yali Yang, Yuanping Xu, Chaolong Zhang, Zhijie Xu, and Jian Huang. 2022b. Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation. In Proceedings of the 4th International Symposium on Signal Processing Systems. 68--73.Google ScholarDigital Library
Hanrong Ye and Dan Xu. 2022. Inverted pyramid multi-task transformer for dense scene understanding. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVII. Springer, 514--530.Google Scholar
Yihang Yin, Siyu Huang, and Xiang Zhang. 2022. Bm-nas: Bilevel multimodal neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8901--8909.Google ScholarCross Ref
Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, and Qi Tian. 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia. 3743--3752.Google ScholarDigital Library
Guodong Zhang, Jing-Hao Xue, Pengwei Xie, Sifan Yang, and Guijin Wang. 2021b. Non-local aggregation for RGB-D semantic segmentation. IEEE Signal Processing Letters, Vol. 28 (2021), 658--662.Google ScholarCross Ref
Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Lei Wang, and Wenqi Ren. 2021a. Dcnas: Densely connected neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13956--13967.Google ScholarCross Ref
Yang Zhang, Yang Yang, Chenyun Xiong, Guodong Sun, and Yanwen Guo. 2022. Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation. arXiv preprint arXiv:2201.01427 (2022).Google Scholar
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4106--4115.Google ScholarCross Ref
Hao Zhou, Lu Qi, Hai Huang, Xu Yang, Zhaoliang Wan, and Xianglong Wen. 2022. CANet: Co-attention network for RGB-D semantic segmentation. Pattern Recognition, Vol. 124 (2022), 108468.Google ScholarDigital Library
Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, and Jian Yang. 2020. Pattern-structure diffusion for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4514--4523.Google ScholarCross Ref
Wenbin Zou, Yingqing Peng, Zhengyu Zhang, Shishun Tian, and Xia Li. 2022. RGB-D Gate-guided edge distillation for indoor semantic segmentation. Multimedia Tools and Applications, Vol. 81, 25 (2022), 35815--35830.Google ScholarDigital Library

Index Terms

Automatic Network Architecture Search for RGB-D Semantic Segmentation
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Controllable Cost Search Strategy in Unstructured P2P
CHINAGRID '11: Proceedings of the 2011 Sixth Annual ChinaGrid Conference

The search width and search depth is the key factor of unstructured Peer-to-Peer (P2P) network search algorithms in unstructured Peer-to-Peer (P2P) network. Existing guided search algorithms decreased search width by QoS or history search records, the ...
Read More
Neural architecture search: a survey

Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed ...
Read More
Self-attention neural architecture search for semantic image segmentation
Abstract
Self-attention can capture long-distance dependencies and is widely used in semantic segmentation. Existing methods mainly use two kinds of self-attentions, i.e., spatial attention and channel attention, which can capture the relations ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
grid-like network-level search space
hierarchical cell-level search space
nas
rgb-d semantic segmentation
search strategy
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 102
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Network Architecture Search for RGB-D Semantic Segmentation

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Controllable Cost Search Strategy in Unstructured P2P

Neural architecture search: a survey

Self-attention neural architecture search for semantic image segmentation