Elsevier

Neural Networks

Volume 117, September 2019, Pages 104-123
Neural Networks

Understanding autoencoders with information theoretic concepts

https://doi.org/10.1016/j.neunet.2019.05.003Get rights and content

Highlights

  • We illustrate an advanced information theoretic methodology to understand the dynamics of learning and the design of autoencoders.

  • Using the recently proposed matrix-based Renyi’s α-entropy functional, we suggested and observed three fundamental properties associated with information flow in autoencoders.

  • Our observations have direct impact on the optimal design of autoencoders, the design of alternative feedforward training methods, and even in the problem of generalization.

Abstract

Despite their great success in practical applications, there is still a lack of theoretical and systematic methods to analyze deep neural networks. In this paper, we illustrate an advanced information theoretic methodology to understand the dynamics of learning and the design of autoencoders, a special type of deep learning architectures that resembles a communication channel. By generalizing the information plane to any cost function, and inspecting the roles and dynamics of different layers using layer-wise information quantities, we emphasize the role that mutual information plays in quantifying learning from data. We further suggest and also experimentally validate, for mean square error training, three fundamental properties regarding the layer-wise flow of information and intrinsic dimensionality of the bottleneck layer, using respectively the data processing inequality and the identification of a bifurcation point in the information plane that is controlled by the given data. Our observations have direct impact on the optimal design of autoencoders, the design of alternative feedforward training methods, and even in the problem of generalization.

Introduction

Deep neural networks (DNNs) have drawn significant interest from the machine learning community, especially due to their recent empirical success in various applications such as image recognition (Krizhevsky, Sutskever, & Hinton, 2012), speech recognition (Graves, Mohamed, & Hinton, 2013), natural language processing (Mesnil, He, Deng, & Bengio, 2013). Despite the overwhelming advantages achieved by deep neural networks over the classical machine learning models, the theoretical and systematic understanding of deep neural networks still remains limited and unsatisfactory. Consequently, deep models themselves are typically regarded as “black boxes” (Alain & Bengio, 2016).

This is an unfortunate terminology that the second author has disputed since the late 90s (Principe, Euliano, & Lefebvre, 2000). In fact, most neural architectures are homogeneous in terms of processing elements (PEs), e.g., sigmoid nonlinearities. Therefore no matter if they are used in the first layer, in the middle layer or the output layer they always perform the same function: they create ridge functions (Light, 1992) in the space spanned by the previous layer outputs, i.e., training will only control the steering of the ridge, while the bias controls the aggregation of the different partitions (Minsky and Papert, 2017, Principe and Chen, 2015). Moreover, it is also possible to provide geometric interpretations to the projections, extending well known work of Kolmogorov for optimal filtering in linear spaces (Kolmogorov, 1939). What has been missing is a framework that can provide an assessment of the quality of the different projections learned during training besides the quantification of the “external” error.

More recently, there has been a growing interest in understanding deep neural networks using information theory. Information theoretic learning (ITL) (Principe, 2010) has been successfully applied to various machine learning applications by providing more robust cost or objective functions, but its role can be extended to create a framework to help optimally design deep learning architectures, as explained in this paper. Recently, Tishby proposed the Information Plane (IP) as an alternative to understand the role of learning in deep architectures (Shwartz-Ziv & Tishby, 2017). The use of information theoretic ideas is an excellent addition because Information Theory is essentially a theory of bounds (MacKay, 2003). Entropy and mutual information quantify properties of data and the results of functional transformations applied to data at a sufficient abstract level that can lead to an optimal performance as illustrated by Stratonovich’s three variational problems (Stratonovich, 1965). These recent works demonstrate the potential that various information theory concepts hold to open the “black box” of DNNs.

As an application we will concentrate on the design of stacked autoencoders (SAE), a fully unsupervised deep architecture. Autoencoders have a remarkable similarity with a transmission channel (Yu, Emigh, Santana, & Príncipe, 2017) and so they are a good choice to evaluate the appropriateness of using ITL in understanding the architectures and the dynamics of learning in DNNs. We are interested in unveiling the role of the layer-wise mutual information during the autoencoder training phase, and investigating how its dynamics through learning relate to different information theoretic concepts (e.g., different data processing inequalities). We propose to do this for arbitrary topologies using empirical estimators of Renyi’s mutual information, as explained in Giraldo, Rao, and Principe (2015). Moreover, we are also interested in how to use our observations to benefit the design and implementation of DNNs, such as optimizing a neural network topology or training a neural network in a feedforward greedy-layer manner, as an alternative to the standard backpropagation.

The rest of this paper is organized as follows. In Section 2, we briefly introduce background and related works, including a review of the geometric projection view of multilayer systems, elements of Renyi’s entropy and their matrix-based functional as well as previous works on understanding DNNs. Following this, we suggest three fundamental properties associated with the layer-wise mutual information and also give our reasoning in Section 3. We then carry out experiments on three real-world datasets to validate these properties in Section 4. An experimental interpretation is also presented. We conclude this paper and present our insights in Section 5.

To summarize, our main contributions are threefold:

  • Instead of using the basic Shannon or Renyi’s definitions on mutual information which require precise PDF estimation in high-dimensional space, we suggest using the recently proposed matrix-based Renyi’s α-entropy functional (Giraldo et al., 2015) to estimate information quantities in DNNs. We demonstrate that this class of estimators can be used in high dimensions (1000) and preserve the theoretical expectations of the Data Processing Inequality. The new estimators compute the entropy and mutual information in the reproducing kernel Hilbert spaces (RKHS) and avoid explicit PDF estimation, thus making information flow estimation simple and practical as required to analyze learning dynamics of DNNs.

  • Benefiting from the simple yet precise estimator, we suggest three fundamental properties associated with information flow in SAEs, which provides insights in the dynamics of DNNs from an information theory perspective.

  • The new estimators and our observations also illustrate practical implications on the problem of architecture design/selection, generalizability, and other critical issues in deep learning communities. Moreover, the proposed methodologies can be extended to other DNN architectures much more complex than MLP or SAEs, like the CNNs (Yu, Wickstrøm, Jenssen and Principe, 2018).

The abbreviations and variables mentioned in this paper are summarized in Table 1.

Section snippets

Background and related works

In this section, we start with a review of the geometric interpretation of multilayer system mappings as well as the basic autoencoder that provides a geometric underpinning for the IP quantification. The geometric interpretation describes the basic operation of pairwise layer projections, which agrees with the pairwise mutual information in IP, hinting that we are effectively quantifying the role of projections using information theoretic quantities and opening the “black box”. After that, we

The data processing inequality (DPI) and its extensions tostacked autoencoders (SAEs)

Before systematically interpreting SAEs operation and learning using information theoretic concepts, let us recall the basic learning mechanism (i.e., backpropagation) in any feedforward DNNs (including SAEs, MLP, etc.), the input signals are propagated from input layer to the output layer and the errors are back-propagated in the reverse direction from the output layer to the input layer through the adjoint or dual network (Kuroe, Nakai, & Mori, 1993) (Krawczak, 2013 Chapter 7). Both

Experiments

This section presents two sets of experiments to corroborate our Section 3 fundamental properties directly from data and the nonparametric statistical estimators put forth in this work. Specifically, Section 4.1 validates the first type of DPI and also demonstrates the two IPs defined in Section 3.2 to illustrate the existence of bifurcation point that is controlled by the given data, whereas Section 4.2 validates the second type of DPI raised in Section 3.1. Note that, we also give a

Conclusions

In this paper, we analyzed DNNs learning from a joint geometric and information theoretic perspective, thus emphasizing the role that pair-wise mutual information plays in understanding DNNs. As an application of this idea, three fundamental properties are presented concentrating on stacked autoencoders (SAEs). The experiments on three real-world datasets validated the data processing inequality associated with layer-wise mutual information and the existence of bifurcation point associated with

Acknowledgments

The authors would like to express their sincere gratitude to Dr. Luis Gonzalo Sánchez Giraldo from the University of Miami and Dr. Robert Jenssen from the UiT — The Arctic University of Norway for their careful reading of our manuscript and many insightful comments and suggestions. The authors also thank the anonymous reviewers for their very helpful suggestions, which led to substantial improvements of the paper. This work is supported in part by the U.S. Office of Naval Research under Grant

References (96)

  • ArpitD. et al.

    Why regularized auto-encoders learn sparse representation?

  • BachS. et al.

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    PloS One

    (2015)
  • Bengio, Y. How auto-encoders could provide credit assignment in deep networks via target propagation, arXiv preprint...
  • BengioY. et al.

    Better mixing via deep representations

  • BhatiaR.

    Infinitely divisible matrices

    American Mathematical Monthly

    (2006)
  • BrahmaP.P. et al.

    Why deep learning works: A manifold disentanglement perspective

    IEEE Transactions on Neural Networks and Learning Systems

    (2016)
  • Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., & Lerchner, A. Understanding...
  • ChenB. et al.

    System parameter identification: information criteria and algorithms

    (2013)
  • CoverT.M. et al.

    Elements of information theory

    (2012)
  • CsiszárI.

    A class of measures of informativity of observation channels

    Periodica Mathematica Hungarica

    (1972)
  • FahlmanS.E. et al.

    The cascade-correlation learning architecture

  • GaoS. et al.

    Efficient estimation of mutual information for strongly dependent variables

  • Giraldo, L. G., & Principe, J. C. Rate-distortion auto-encoders, arXiv preprint...
  • GiraldoL.G.S. et al.

    Measures of entropy from data using infinitely divisible kernels

    IEEE Transactions on Information Theory

    (2015)
  • GlorotX. et al.

    Deep sparse rectifier neural networks

  • GravesA. et al.

    Speech recognition with deep recurrent neural networks

  • HaykinS.

    Neural networks: a comprehensive foundation

    (1994)
  • HaykinS.S.

    Neural networks and learning machines, vol. 3

    (2009)
  • HaykinS.S.

    Adaptive filter theory (5th edition)

    (2014)
  • HeK. et al.

    Deep residual learning for image recognition

  • HigginsI. et al.

    Beta-vae: Learning basic visual concepts with a constrained variational framework

  • HintonG.E. et al.

    Reducing the dimensionality of data with neural networks

    science

    (2006)
  • IoffeS. et al.

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

  • JansenS. et al.

    On the notion (s) of duality for Markov processes

    Probability Surveys

    (2014)
  • Khadivi, P., Tandon, R., & Ramakrishnan, N. Flow of information in feed-forward deep neural networks, arXiv preprint...
  • KokiopoulouE. et al.

    Trace optimization and eigenproblems in dimension reduction methods

    Numerical Linear Algebra with Applications

    (2011)
  • KolchinskyA. et al.

    Estimating mixture entropy with pairwise distances

    Entropy

    (2017)
  • KolmogorovA.N.

    Sur linterpolation et extrapolation des suites stationnaires

    CR Academy of Science

    (1939)
  • KraskovA. et al.

    Estimating mutual information

    Physical Review E

    (2004)
  • KrawczakM.

    Multilayer neural networks: a generalized net perspective, vol. 478

    (2013)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • KuroeY. et al.

    A learning method of nonlinear mappings by neural networks with considering their derivatives

  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • LevinaE. et al.

    Maximum likelihood estimation of intrinsic dimension

  • LiggettT.M.

    Interacting particle systems, vol. 276

    (2012)
  • LightW.

    Ridge functions, sigmoidal functions and neural networks

    Approximation theory VII

    (1992)
  • LinH.W. et al.

    Why does deep and cheap learning work so well?

    Journal of Statistical Physics

    (2017)
  • LinskerR.

    Self-organization in a perceptual network

    Computer

    (1988)
  • Cited by (122)

    View all citing articles on Scopus
    View full text