Abstract
The traditional scene analysis mainly focuses on outdoor scene recognition rather than indoor scene understanding. However, with the widespread use of depth cameras, we have a new opportunity to handle the indoor scene recognition problem. In this paper, we propose a multi-task metric multi-kernel learning algorithm that exploits the inter-source similarities and complementarities between color images and depth images to conduct the indoor scene recognition. Specifically, our method utilize multi-task metric learning to learn a Mahalanobis metric for RGB-D images. Multi-task metric learning can extract the common properties from color images and depth images to learn better metrics. Furthermore, the learned metrics are employed to transform features to a correcting feature space for obtaining a better representation. By exploiting multi-kernel learning, our method can leverage multiple feature representations to train a more discriminative classifier. We conduct experiments on NYU Depth Dataset and B3DO Dataset to evaluate the effectiveness of our approach. The experimental results have demonstrated that our proposed method can lead to better indoor scene recognition.
Similar content being viewed by others
References
Barron JT, Malik J (2013) Intrinsic scene properties from a single RGB-d image. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):17-24
Bo L, Lai K, Ren X, Fox D (2011) Object recognition with hierarchical kernel descriptors. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1729-1736
Cai F, Cherkassky V (2012) Generalized SMO algorithm for SVM-based multitask learning. IEEE Transactions on Neural Networks and Learning Systems 23(6):997–1003
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Cruz L, Lucio D, Velho L (2012) Kinect and rgbd images: challenges and applications. In: Proceedings of IEEE international conference on graphics, patterns and images tutorials (SIBGRAPI-t):36-49
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning (ICML):209-216
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res:615–637
Fan H, Yang M, Cao Z, Jiang Y, Yin Q (2014) Learning compact face representation: Packing a face into an int32. In: Proceedings of ACM international conference on multimedia (ACM MM):933–936
Gao X, Gao F, Tao D, Li X (2013) Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning. IEEE Transactions on Neural Networks and Learning Systems 24(12):2013–2026
Gonen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Gould S, Fulton R, Koller D (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of IEEE international conference on computer vision (ICCV):1–8
Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from RGB-d images. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):564-571
Gupta S, Girshick R, Arbelez P, Malik J (2014) Learning rich features from RGB-d images for object detection and segmentation. In: Proceedings of european conference on computer vision (ECCV):345-360
Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with Microsoft Kinect sensor: A review. IEEE Transactions on Cybernetics 43(5):1318–1334
Han J, Pauwels EJ, De Zeeuw PM, De With PH (2012) Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment. IEEE Transactions on Consumer Electronics 58(2):255–263
He X, Zemel RS, Carreira-Perpinan M (2004) Multiscale conditional random fields for image labeling. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):695-702
Janoch A, Karayev S, Jia Y, Barron JT, Fritz M, Saenko K, Darrell T (2013) A category-level 3d object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision:141–165
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM international conference on multimedia (ACM MM):675-678
Jiang F, Zhang S, Wu S, Gao Y, Zhao D (2015) Multi-layered gesture recognition with kinect. J Mach Learn Res 16(1):227–254
Khosla A, An B, Lim JJ, Torralba A (2014) Looking beyond the visible scene. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3710–3717
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS):1097–1105
Kulis B (2012) Metric learning: a survey. Foundations and Trends in Machine Learning 5(4):287–364
Kumar MP, Torr PHS, Zisserman A (2007) An invariant large margin nearest neighbour classifier. In: Proceedings of IEEE international conference on computer vision (ICCV):1-8
Lapin M, Schiele B, Hein M (2014) Scalable multitask representation learning for scene classification. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1434-1441
Li LJ, Su H, Lim Y, Fei-Fei L (2012) Objects as attributes for scene classification. Trends and Topics in Computer Vision:57–69
Lin D, Fidler S, Urtasun R (2013) Holistic scene understanding for 3d object detection with rgbd cameras. In: Proceedings of IEEE international conference on computer vision (ICCV):1417-1424
Ming Y, Ruan Q, Hauptmann AG (2012) Activity recognition from rgb-d camera with 3d local spatio-temporal features. In: Proceedings of IEEE international conference on multimedia and expo (ICME):344-349
Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2743-2750
Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, Tang X (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2403-2412
Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: Proceedings of IEEE international conference on computer vision (ICCV):1307-1314
Parameswaran S, Weinberger KQ (2010) Large margin multi-task metric learning. Advances in Neural Information Processing Systems (NIPS):1867–1875
Qian Q, Jin R, Zhu S, Lin Y (2015) Fine-grained Visual Categorization via Multi-stage Metric Learning. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3716-3724
Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521
Ramirez I, Sprechmann P, Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3501-3508
Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: Features and algorithms. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR): 2759-2766
Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):4657–4666
Shao T, Xu W, Zhou K, Wang J, Li D, Guo B (2012) An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Trans Graph 31(6):136
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of european conference on computer vision (ECCV):746-760
Song X, Jiang S, Herranz L (2015) Joint multi-feature spatial context for scene recognition in the semantic manifold. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1312-1320
Wallraven C, Caputo B, Graf A (2003) Recognition with local features: the kernel recipe. In: Proceedings of IEEE international conference on computer vision (ICCV):257-264
Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learn Res 14(1):2549–2582
Wan S, Hu C, Aggarwal JK (2014) Indoor scene recognition from RGB-d images by learning scene bases. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3416-3421
Wang A, Lu J, Wang G, Cai J, Cham TJ (2014) Multi-modal unsupervised feature learning for RGB-d scene labeling. In: Proceedings of european conference on computer vision (ECCV):453–467
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
Xing EP, Jordan MI, Russell S, Ng AY (2002) Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems (NIPS):505–512
Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3493-3498
Yu M, Liu L, Shao L (2015) Structure-preserving binary representations for RGB-D action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2015.2491925
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, in part by the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant IRT13088 and in part by the Shaanxi Innovative Research Team for Key Science and Technology under Grant 2012KCT-02.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, Y., Gao, X. Indoor scene recognition via multi-task metric multi-kernel learning from RGB-D images. Multimed Tools Appl 76, 4427–4443 (2017). https://doi.org/10.1007/s11042-016-3423-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3423-1