Practical Recommendations for Gradient-Based Training of Deep Architectures

Bengio, Yoshua

doi:10.1007/978-3-642-35289-8_26

Yoshua Bengio¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7700))

71k Accesses
745 Citations
9 Altmetric

Abstract

Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyperparameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Blind Descent: A Prequel to Gradient Descent

Navigating beyond backpropagation: on alternative training methods for deep neural networks

Article 22 April 2025

Deep Learning

References

Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)
Article Google Scholar
Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms. In: NIPS 2011 (2011)
Google Scholar
Bagnell, J.A., Bradley, D.M.: Differentiable sparse coding. In: NIPS 2009, pp. 113–120 (2009)
Google Scholar
Baxter, J.: Learning internal representations. In: COLT 1995, pp. 311–320 (1995)
Google Scholar
Baxter, J.: A Bayesian/information theoretic model of learning via multiple task sampling. Machine Learning 28, 7–40 (1997)
Article MATH Google Scholar
Bengio, Y.: Neural net language models. Scholarpedia 3(1), 3881 (2008)
Article Google Scholar
Bengio, Y.: Learning deep architectures for AI. Now Publishers (2009)
Google Scholar
Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: JMLR W&CP: Proc. Unsupervised and Transfer Learning (2011)
Google Scholar
Bengio, Y., Delalleau, O.: On the Expressive Power of Deep Architectures. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 18–36. Springer, Heidelberg (2011)
Chapter Google Scholar
Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Large Scale Kernel Machines (2007)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)
MATH Google Scholar
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: NIPS 2005, pp. 123–130 (2006a)
Google Scholar
Bengio, Y., Delalleau, O., Le Roux, N.: The curse of highly variable functions for local kernel machines. In: NIPS 2005, pp. 107–114 (2006b)
Google Scholar
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS 2006 (2007)
Google Scholar
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML 2009 (2009)
Google Scholar
Bengio, Y., Alain, G., Rifai, S.: Implicit density estimation by local moment matching to sample from auto-encoders. Technical report, arXiv:1207.0057 (2012)
Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Machine Learning Res. 13, 281–305 (2012)
MathSciNet MATH Google Scholar
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proc. Python for Scientific Comp. Conf. (SciPy) (2010)
Google Scholar
Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: NIPS 2011 (2011)
Google Scholar
Berkes, P., Wiskott, L.: Applying Slow Feature Analysis to Image Sequences Yields a Rich Repertoire of Complex Cell Properties. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 81–86. Springer, Heidelberg (2002)
Chapter Google Scholar
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Technical Report 2848, LIDS (2010)
Google Scholar
Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research 10, 1737–1754 (2009)
MathSciNet MATH Google Scholar
Bordes, A., Weston, J., Collobert, R., Bengio, Y. (2011). Learning structured embeddings of knowledge bases. In: AAAI (2011)
Google Scholar
Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: AISTATS 2012 (2012)
Google Scholar
Bottou, L.: From machine learning to machine reasoning. Technical report, arXiv.1102 (2011)
Google Scholar
Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012)
Google Scholar
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: NIPS 2008 (2008)
Google Scholar
Bottou, L., LeCun, Y.: Large-scale on-line learning. In: NIPS 2003 (2004)
Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1994)
MATH Google Scholar
Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples from an rbm-derived process. Neural Computation 23(8), 2053–2073 (2011)
Article MathSciNet Google Scholar
Caruana, R.: Multitask connectionist learning. In: Proceedings of the 1993 Connectionist Models Summer School, pp. 372–379 (1993)
Google Scholar
Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for training restricted boltzmann machines. In: ICML 2011, pp. 105–112 (2011)
Google Scholar
Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse coding and vector quantization. In: ICML 2011 (2011)
Google Scholar
Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: ICML 2004 (2004a)
Google Scholar
Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: International Conference on Machine Learning, ICML (2004b)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011a)
MATH Google Scholar
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011b)
Google Scholar
Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spike-and-slab RBMs. In: ICML 2011 (2011)
Google Scholar
Dauphin, Y., Glorot, X., Bengio, Y.: Sampled reconstruction for large-scale learning of embeddings. In: Proc. ICML 2011 (2011)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Information Science 41(6), 391–407 (1990)
Article Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (2011)
Google Scholar
Elman, J.L.: Learning and development in neural networks: The importance of starting small. Cognition 48, 781–799 (1993)
Article Google Scholar
Erhan, D., Courville, A., Bengio, Y.: Understanding representations learned in deep architectures. Technical Report 1355, Université de Montréal/DIRO (2010a)
Google Scholar
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Machine Learning Res. 11, 625–660 (2010b)
MathSciNet MATH Google Scholar
Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9(5), 768–786 (1998)
Article Google Scholar
Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
Article Google Scholar
Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press (2006)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS 2010, pp. 249–256 (2010)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y. (2011a). Deep sparse rectifier neural networks. In: AISTATS 2011 (2011)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: ICML 2011 (2011b)
Google Scholar
Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: NIPS 2009, pp. 646–654 (2009)
Google Scholar
Goodfellow, I., Courville, A., Bengio, Y.: Spike-and-slab sparse coding for unsupervised feature discovery. In: NIPS Workshop on Challenges in Learning Hierarchical Models (2011)
Google Scholar
Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In: ICML (2010)
Google Scholar
Håstad, J.: Almost optimal lower bounds for small depth circuits. In: STOC 1986, pp. 6–20 (1986)
Google Scholar
Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Computational Complexity 1, 113–129 (1991)
Article MathSciNet MATH Google Scholar
Hinton, G.E.: Relaxation and its role in vision. Ph.D. thesis, University of Edinburgh (1978)
Google Scholar
Hinton, G.E.: Learning distributed representations of concepts. In: Proc. 8th Annual Conf. Cog. Sc. Society, pp. 1–12 (1986)
Google Scholar
Hinton, G.E.: Connectionist learning procedures. Artificial Intelligence 40, 185–234 (1989)
Article Google Scholar
Hinton, G.E.: A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003, Department of Computer Science, University of Toronto (2010)
Google Scholar
Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
Hutter, F.: Automated Configuration of Algorithms for Solving Hard Computational Problems. Ph.D. thesis, University of British Columbia (2009)
Google Scholar
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello Coello, C.A. (ed.) LION 5. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011)
Chapter Google Scholar
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV (2009)
Google Scholar
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: CVPR 2009 (2009)
Google Scholar
Krueger, K.A., Dayan, P.: Flexible shaping: how learning in small steps helps. Cognition 110, 380–394 (2009)
Article Google Scholar
Lamblin, P., Bengio, Y.: Important gains from supervised fine-tuning of deep architectures on large labeled sets. In: NIPS 2010 Deep Learning and Unsupervised Feature Learning Workshop (2010)
Google Scholar
Lang, K.J., Hinton, G.E.: The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University (1988)
Google Scholar
Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: ICML 2008 (2008)
Google Scholar
Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. J. Machine Learning Res. 10, 1–40 (2009)
MATH Google Scholar
Le, Q., Ngiam, J., Chen, Z., Hao Chia, D.J., Koh, P.W., Ng, A.: Tiled convolutional neural networks. In: NIPS 2010 (2010)
Google Scholar
Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.: On optimization methods for deep learning. In: ICML 2011 (2011)
Google Scholar
Le Roux, N., Manzagol, P.-A., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: NIPS 2007 (2008)
Google Scholar
Le Roux, N., Bengio, Y., Fitzgibbon, A.: Improving first and second-order methods by modeling uncertainty. In: Optimization for Machine Learning. MIT Press (2011)
Google Scholar
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical report, arXiv:1202.6258 (2012)
Google Scholar
LeCun, Y.: Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de Paris VI (1987)
Google Scholar
LeCun, Y.: Generalization and network design strategies. Technical Report CRG-TR-89-4, University of Toronto (1989)
Google Scholar
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)
Article Google Scholar
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer, Heidelberg (1998a)
Chapter Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998b)
Article Google Scholar
Lee, H., Ekanadham, C., Ng, A. (2008). Sparse deep belief net model for visual area V2. In: NIPS 2007 (2007)
Google Scholar
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML 2009 (2009)
Google Scholar
Martens, J.: Deep learning via Hessian-free optimization. In: ICML 2010, pp. 735–742 (2010)
Google Scholar
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: Proc. Unsupervised and Transfer Learning, JMLR W&CP, vol. 7 (2011)
Google Scholar
Montavon, G., Braun, M.L., Müller, K.-R.: Deep Boltzmann machines as feed-forward hierarchies. In: AISTATS 2012 (2012)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML 2010 (2010)
Google Scholar
Nemirovski, A., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley (1983)
Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Mathematical Programming 120(1), 221–259 (2009)
Article MathSciNet MATH Google Scholar
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997)
Article Google Scholar
Pearlmutter, B.: Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160 (1994)
Article Google Scholar
Pinto, N., Doukhan, D., DiCarlo, J.J., Cox, D.D.: A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Comput. Biol. 5(11), e1000579 (2009)
Google Scholar
Pollack, J.B.: Recursive distributed representations. Artificial Intelligence 46(1), 77–105 (1990)
Article Google Scholar
Polyak, B., Juditsky, A.: Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization 30(4), 838–855 (1992)
Article MathSciNet MATH Google Scholar
Raiko, T., Valpola, H., LeCun, Y. (2012). Deep learning made easier by linear transformations in perceptrons. In: AISTATS 2012 (2012)
Google Scholar
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: NIPS 2006 (2007)
Google Scholar
Ranzato, M., Boureau, Y.-L., LeCun, Y.: Sparse feature learning for deep belief networks. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems (NIPS 2007), vol. 20, pp. 1185–1192. MIT Press, Cambridge (2008a)
Google Scholar
Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief networks. In: NIPS 2007 (2008b)
Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62, 107–136 (2006)
Article Google Scholar
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting auto-encoders: Explicit invariance during feature extraction. In: ICML 2011 (2011a)
Google Scholar
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., Muller, X.: The manifold tangent classifier. In: NIPS 2011 (2011b)
Google Scholar
Rifai, S., Bengio, Y., Dauphin, Y., Vincent, P.: A generative process for sampling contractive auto-encoders. In: ICML 2012 (2012)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407 (1951)
Article MathSciNet MATH Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Article Google Scholar
Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS 2009 (2009)
Google Scholar
Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.: On random weights and unsupervised feature learning. In: ICML 2011 (2011)
Google Scholar
Schaul, T., Zhang, S., LeCun, Y.: No More Pesky Learning Rates. Technical report (2012)
Google Scholar
Schraudolph, N.N.: Centering Neural Network Gradient Factors. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 207–548. Springer, Heidelberg (1998)
Chapter Google Scholar
Socher, R., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: ICML 2011 (2011)
Google Scholar
Srinivasan, A., Ramakrishnan, G.: Parameter screening and optimisation for ILP using designed experiments. Journal of Machine Learning Research 12, 627–662 (2011)
MATH Google Scholar
Swersky, K., Chen, B., Marlin, B., de Freitas, N.: A tutorial on stochastic approximation algorithms for training restricted boltzmann machines and deep belief nets. In: Information Theory and Applications Workshop (2010)
Google Scholar
Tenenbaum, J., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Article Google Scholar
Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive divergence. In: ICML 2009 (2009)
Google Scholar
van der Maaten, L., Hinton, G.E.: Visualizing data using t-sne. J. Machine Learning Res. 9 (2008)
Google Scholar
Vincent, P.: A connection between score matching and denoising autoencoders. Neural Computation 23(7) (2011)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: ICML 2008 (2008)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res. 11 (2010)
Google Scholar
Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: ICML 2008 (2008)
Google Scholar
Weston, J., Bengio, S., Usunier, N.: Wsabie: Scaling up to large vocabulary image annotation. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI (2011)
Google Scholar
Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14(4), 715–770 (2002)
Article MATH Google Scholar
Zou, W.Y., Ng, A.Y., Yu, K.: Unsupervised learning of visual invariance with temporal coherence. In: NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Université de Montréal, Canada
Yoshua Bengio

Authors

Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, Technische Universität Berlin, Franklinstr. 28/29, 10587, Berlin, Germany
Grégoire Montavon & Klaus-Robert Müller &
Dept. of computer Science, Willamette University, 900 State Street, 97301, Salem, OR, USA
Geneviève B. Orr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In: Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-35289-8_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35288-1
Online ISBN: 978-3-642-35289-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics