Abstract
Tasks that require the synchronization of perception and action are incredibly hard and pose a fundamental challenge to the fields of machine learning and computer vision. One important example of such a task is the problem of performing visual recognition through a sequence of controllable fixations; this requires jointly deciding what inference to perform from fixations and where to perform these fixations. While these two problems are challenging when addressed separately, they become even more formidable if solved jointly. Recently, a restricted Boltzmann machine (RBM) model was proposed that could learn meaningful fixation policies and achieve good recognition performance. In this paper, we propose an alternative approach based on a feed-forward, auto-regressive architecture, which permits exact calculation of training gradients (given the fixation sequence), unlike for the RBM model. On a problem of facial expression recognition, we demonstrate the improvement gained by this alternative approach. Additionally, we investigate several variations of the model in order to shed some light on successful strategies for fixation-based recognition.
Notes
This is done by setting \(\mathbf {z}\left( i_k,j_k\right) = \mathrm{sigmoid}\left( \bar{ \mathbf {z}}\left( i_k,j_k\right) \right) \), and learning the unconstrained \(\bar{ \mathbf {z}}\left( i_k,j_k\right) \) vectors instead. We also use a learning rate \(100\) times larger than learning the other parameters.
The retinal transformation covered a patch of \(44\times 44\) pixels, without using a lower resolution periphery. Hence, the total number of pixels is \(1936\).
References
Bazzani, L., Freitas, N., Larochelle, H., Murino, V., & Ting, J.-A. (2011). Learning attentional policies for tracking and recognition in video with deep networks. In Proceedings of the 28th international conference on machine learning (ICML 2011) (pp. 937–944). ACM.
Butko, N. J., & Movellan, J. R. (2010). Infomax control of eye movements. IEEE Transactions on Autonomous Mental Development, 2(2), 91–107.
Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011 (pp. 409–416). IEEE.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition. CVPR 2005 (Vol. 1, pp. 886–893). IEEE.
David, G. (2004). Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Denil, M., Bazzani, L., Larochelle, H., & de Freitas, N. (2012). Learning where to attend with deep architectures for image tracking. Neural Computation, 24(8), 2151–2184.
Erez, T., Tramper, J. J., Smart, W. D., & Stan CAM Gielen. (2011). A pomdp model of eye-hand coordination. In AAAI.
Fazl, A., Grossberg, S., & Mingolla, E. (2009). View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds. Cognitive psychology, 58(1), 1–48.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.
Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV).
Kanan, C., & Cottrell, G. (2010) Robust classification of objects, faces, and flowers using natural image statistics. In CVPR.
Krause, A., & Ong, C. S. (2011). Contextual gaussian process bandit optimization. In NIPS (pp. 2447–2455).
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1106–1114.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th international conference on machine learning (pp. 536–543). ACM.
Larochelle, H., & Hinton, G. E. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in neural information processing systems (pp. 1243–1251).
Larochelle, H., & Murray, I. (2011). The neural autoregressive distribution estimator. Artificial Intelligence and Statistics (AISTATS), 15, 29–37.
Larochelle, H., & Lauly, S. (2012). A neural autoregressive topic model. Advances in Neural Information Processing Systems, 25, 2717–2725.
Lazebnik, S. (2006). Cordelia, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Mathe, S., & Sminchisescu, C. (2013). Action from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems (pp. 1923–1931, 2013).
Nair, V., & Hinton, G. E. (2010) Rectified linear units improve restricted boltzmann machines. In ICML.
Najemnik, J., & Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434(7031), 387–391.
Perazzi, F., Krahenbuhl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012 (pp. 733–740). IEEE.
Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML 2011).
Schmidhuber, J., & Huber, R. (1991). Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(01n02), 125–134.
Southall, J. P. C. (1962). Helmholtzs treatise on physiological optics. vol. 2: The sensation of vision, trans. J. P. C. Southall. (translated from the third german edition).
Susskind, J. M., Anderson, A. K., & Hinton, G. E. (2010). The toronto face database. Department of Computer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep.
Uria, B., Murray, I., & Larochelle, H. (2013). Rnade: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2175–2183.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on machine learning (ICML 2008) (pp. 1096–1103). ACM.
Yang, J., Yu., K., & Gong, Y. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
Acknowledgments
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada, the National Natural Science Foundation under Grants NNSF-61171118 and the Ministry of Education under Grants SRFDP-20110002110057 of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Marc’Aurelio Ranzato, Geoffrey E. Hinton, and Yann LeCun.
Rights and permissions
About this article
Cite this article
Zheng, Y., Zemel, R.S., Zhang, YJ. et al. A Neural Autoregressive Approach to Attention-based Recognition. Int J Comput Vis 113, 67–79 (2015). https://doi.org/10.1007/s11263-014-0765-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0765-x