Skip to main content
Log in

Indoor scene recognition via multi-task metric multi-kernel learning from RGB-D images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The traditional scene analysis mainly focuses on outdoor scene recognition rather than indoor scene understanding. However, with the widespread use of depth cameras, we have a new opportunity to handle the indoor scene recognition problem. In this paper, we propose a multi-task metric multi-kernel learning algorithm that exploits the inter-source similarities and complementarities between color images and depth images to conduct the indoor scene recognition. Specifically, our method utilize multi-task metric learning to learn a Mahalanobis metric for RGB-D images. Multi-task metric learning can extract the common properties from color images and depth images to learn better metrics. Furthermore, the learned metrics are employed to transform features to a correcting feature space for obtaining a better representation. By exploiting multi-kernel learning, our method can leverage multiple feature representations to train a more discriminative classifier. We conduct experiments on NYU Depth Dataset and B3DO Dataset to evaluate the effectiveness of our approach. The experimental results have demonstrated that our proposed method can lead to better indoor scene recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Barron JT, Malik J (2013) Intrinsic scene properties from a single RGB-d image. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):17-24

  2. Bo L, Lai K, Ren X, Fox D (2011) Object recognition with hierarchical kernel descriptors. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1729-1736

  3. Cai F, Cherkassky V (2012) Generalized SMO algorithm for SVM-based multitask learning. IEEE Transactions on Neural Networks and Learning Systems 23(6):997–1003

    Article  Google Scholar 

  4. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27

    Article  Google Scholar 

  5. Cruz L, Lucio D, Velho L (2012) Kinect and rgbd images: challenges and applications. In: Proceedings of IEEE international conference on graphics, patterns and images tutorials (SIBGRAPI-t):36-49

  6. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning (ICML):209-216

  7. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531

  8. Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res:615–637

  9. Fan H, Yang M, Cao Z, Jiang Y, Yin Q (2014) Learning compact face representation: Packing a face into an int32. In: Proceedings of ACM international conference on multimedia (ACM MM):933–936

  10. Gao X, Gao F, Tao D, Li X (2013) Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning. IEEE Transactions on Neural Networks and Learning Systems 24(12):2013–2026

    Article  Google Scholar 

  11. Gonen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  MATH  Google Scholar 

  12. Gould S, Fulton R, Koller D (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of IEEE international conference on computer vision (ICCV):1–8

  13. Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from RGB-d images. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):564-571

  14. Gupta S, Girshick R, Arbelez P, Malik J (2014) Learning rich features from RGB-d images for object detection and segmentation. In: Proceedings of european conference on computer vision (ECCV):345-360

  15. Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with Microsoft Kinect sensor: A review. IEEE Transactions on Cybernetics 43(5):1318–1334

    Article  Google Scholar 

  16. Han J, Pauwels EJ, De Zeeuw PM, De With PH (2012) Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment. IEEE Transactions on Consumer Electronics 58(2):255–263

    Article  Google Scholar 

  17. He X, Zemel RS, Carreira-Perpinan M (2004) Multiscale conditional random fields for image labeling. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):695-702

  18. Janoch A, Karayev S, Jia Y, Barron JT, Fritz M, Saenko K, Darrell T (2013) A category-level 3d object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision:141–165

  19. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM international conference on multimedia (ACM MM):675-678

  20. Jiang F, Zhang S, Wu S, Gao Y, Zhao D (2015) Multi-layered gesture recognition with kinect. J Mach Learn Res 16(1):227–254

    MathSciNet  Google Scholar 

  21. Khosla A, An B, Lim JJ, Torralba A (2014) Looking beyond the visible scene. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3710–3717

  22. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS):1097–1105

  23. Kulis B (2012) Metric learning: a survey. Foundations and Trends in Machine Learning 5(4):287–364

    Article  MATH  Google Scholar 

  24. Kumar MP, Torr PHS, Zisserman A (2007) An invariant large margin nearest neighbour classifier. In: Proceedings of IEEE international conference on computer vision (ICCV):1-8

  25. Lapin M, Schiele B, Hein M (2014) Scalable multitask representation learning for scene classification. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1434-1441

  26. Li LJ, Su H, Lim Y, Fei-Fei L (2012) Objects as attributes for scene classification. Trends and Topics in Computer Vision:57–69

  27. Lin D, Fidler S, Urtasun R (2013) Holistic scene understanding for 3d object detection with rgbd cameras. In: Proceedings of IEEE international conference on computer vision (ICCV):1417-1424

  28. Ming Y, Ruan Q, Hauptmann AG (2012) Activity recognition from rgb-d camera with 3d local spatio-temporal features. In: Proceedings of IEEE international conference on multimedia and expo (ICME):344-349

  29. Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2743-2750

  30. Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, Tang X (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2403-2412

  31. Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: Proceedings of IEEE international conference on computer vision (ICCV):1307-1314

  32. Parameswaran S, Weinberger KQ (2010) Large margin multi-task metric learning. Advances in Neural Information Processing Systems (NIPS):1867–1875

  33. Qian Q, Jin R, Zhu S, Lin Y (2015) Fine-grained Visual Categorization via Multi-stage Metric Learning. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3716-3724

  34. Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521

    MathSciNet  MATH  Google Scholar 

  35. Ramirez I, Sprechmann P, Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3501-3508

  36. Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: Features and algorithms. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR): 2759-2766

  37. Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):4657–4666

  38. Shao T, Xu W, Zhou K, Wang J, Li D, Guo B (2012) An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Trans Graph 31(6):136

    Article  Google Scholar 

  39. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of european conference on computer vision (ECCV):746-760

  40. Song X, Jiang S, Herranz L (2015) Joint multi-feature spatial context for scene recognition in the semantic manifold. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1312-1320

  41. Wallraven C, Caputo B, Graf A (2003) Recognition with local features: the kernel recipe. In: Proceedings of IEEE international conference on computer vision (ICCV):257-264

  42. Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learn Res 14(1):2549–2582

    Google Scholar 

  43. Wan S, Hu C, Aggarwal JK (2014) Indoor scene recognition from RGB-d images by learning scene bases. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3416-3421

  44. Wang A, Lu J, Wang G, Cai J, Cham TJ (2014) Multi-modal unsupervised feature learning for RGB-d scene labeling. In: Proceedings of european conference on computer vision (ECCV):453–467

  45. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  46. Xing EP, Jordan MI, Russell S, Ng AY (2002) Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems (NIPS):505–512

  47. Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3493-3498

  48. Yu M, Liu L, Shao L (2015) Structure-preserving binary representations for RGB-D action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2015.2491925

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, in part by the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant IRT13088 and in part by the Shaanxi Innovative Research Team for Key Science and Technology under Grant 2012KCT-02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinbo Gao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Gao, X. Indoor scene recognition via multi-task metric multi-kernel learning from RGB-D images. Multimed Tools Appl 76, 4427–4443 (2017). https://doi.org/10.1007/s11042-016-3423-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-3423-1

Keywords

Navigation