Understanding autoencoders with information theoretic concepts
Introduction
Deep neural networks (DNNs) have drawn significant interest from the machine learning community, especially due to their recent empirical success in various applications such as image recognition (Krizhevsky, Sutskever, & Hinton, 2012), speech recognition (Graves, Mohamed, & Hinton, 2013), natural language processing (Mesnil, He, Deng, & Bengio, 2013). Despite the overwhelming advantages achieved by deep neural networks over the classical machine learning models, the theoretical and systematic understanding of deep neural networks still remains limited and unsatisfactory. Consequently, deep models themselves are typically regarded as “black boxes” (Alain & Bengio, 2016).
This is an unfortunate terminology that the second author has disputed since the late s (Principe, Euliano, & Lefebvre, 2000). In fact, most neural architectures are homogeneous in terms of processing elements (PEs), e.g., sigmoid nonlinearities. Therefore no matter if they are used in the first layer, in the middle layer or the output layer they always perform the same function: they create ridge functions (Light, 1992) in the space spanned by the previous layer outputs, i.e., training will only control the steering of the ridge, while the bias controls the aggregation of the different partitions (Minsky and Papert, 2017, Principe and Chen, 2015). Moreover, it is also possible to provide geometric interpretations to the projections, extending well known work of Kolmogorov for optimal filtering in linear spaces (Kolmogorov, 1939). What has been missing is a framework that can provide an assessment of the quality of the different projections learned during training besides the quantification of the “external” error.
More recently, there has been a growing interest in understanding deep neural networks using information theory. Information theoretic learning (ITL) (Principe, 2010) has been successfully applied to various machine learning applications by providing more robust cost or objective functions, but its role can be extended to create a framework to help optimally design deep learning architectures, as explained in this paper. Recently, Tishby proposed the Information Plane (IP) as an alternative to understand the role of learning in deep architectures (Shwartz-Ziv & Tishby, 2017). The use of information theoretic ideas is an excellent addition because Information Theory is essentially a theory of bounds (MacKay, 2003). Entropy and mutual information quantify properties of data and the results of functional transformations applied to data at a sufficient abstract level that can lead to an optimal performance as illustrated by Stratonovich’s three variational problems (Stratonovich, 1965). These recent works demonstrate the potential that various information theory concepts hold to open the “black box” of DNNs.
As an application we will concentrate on the design of stacked autoencoders (SAE), a fully unsupervised deep architecture. Autoencoders have a remarkable similarity with a transmission channel (Yu, Emigh, Santana, & Príncipe, 2017) and so they are a good choice to evaluate the appropriateness of using ITL in understanding the architectures and the dynamics of learning in DNNs. We are interested in unveiling the role of the layer-wise mutual information during the autoencoder training phase, and investigating how its dynamics through learning relate to different information theoretic concepts (e.g., different data processing inequalities). We propose to do this for arbitrary topologies using empirical estimators of Renyi’s mutual information, as explained in Giraldo, Rao, and Principe (2015). Moreover, we are also interested in how to use our observations to benefit the design and implementation of DNNs, such as optimizing a neural network topology or training a neural network in a feedforward greedy-layer manner, as an alternative to the standard backpropagation.
The rest of this paper is organized as follows. In Section 2, we briefly introduce background and related works, including a review of the geometric projection view of multilayer systems, elements of Renyi’s entropy and their matrix-based functional as well as previous works on understanding DNNs. Following this, we suggest three fundamental properties associated with the layer-wise mutual information and also give our reasoning in Section 3. We then carry out experiments on three real-world datasets to validate these properties in Section 4. An experimental interpretation is also presented. We conclude this paper and present our insights in Section 5.
To summarize, our main contributions are threefold:
- •
Instead of using the basic Shannon or Renyi’s definitions on mutual information which require precise PDF estimation in high-dimensional space, we suggest using the recently proposed matrix-based Renyi’s -entropy functional (Giraldo et al., 2015) to estimate information quantities in DNNs. We demonstrate that this class of estimators can be used in high dimensions () and preserve the theoretical expectations of the Data Processing Inequality. The new estimators compute the entropy and mutual information in the reproducing kernel Hilbert spaces (RKHS) and avoid explicit PDF estimation, thus making information flow estimation simple and practical as required to analyze learning dynamics of DNNs.
- •
Benefiting from the simple yet precise estimator, we suggest three fundamental properties associated with information flow in SAEs, which provides insights in the dynamics of DNNs from an information theory perspective.
- •
The new estimators and our observations also illustrate practical implications on the problem of architecture design/selection, generalizability, and other critical issues in deep learning communities. Moreover, the proposed methodologies can be extended to other DNN architectures much more complex than MLP or SAEs, like the CNNs (Yu, Wickstrøm, Jenssen and Principe, 2018).
The abbreviations and variables mentioned in this paper are summarized in Table 1.
Section snippets
Background and related works
In this section, we start with a review of the geometric interpretation of multilayer system mappings as well as the basic autoencoder that provides a geometric underpinning for the IP quantification. The geometric interpretation describes the basic operation of pairwise layer projections, which agrees with the pairwise mutual information in IP, hinting that we are effectively quantifying the role of projections using information theoretic quantities and opening the “black box”. After that, we
The data processing inequality (DPI) and its extensions tostacked autoencoders (SAEs)
Before systematically interpreting SAEs operation and learning using information theoretic concepts, let us recall the basic learning mechanism (i.e., backpropagation) in any feedforward DNNs (including SAEs, MLP, etc.), the input signals are propagated from input layer to the output layer and the errors are back-propagated in the reverse direction from the output layer to the input layer through the adjoint or dual network (Kuroe, Nakai, & Mori, 1993) (Krawczak, 2013 Chapter 7). Both
Experiments
This section presents two sets of experiments to corroborate our Section 3 fundamental properties directly from data and the nonparametric statistical estimators put forth in this work. Specifically, Section 4.1 validates the first type of DPI and also demonstrates the two IPs defined in Section 3.2 to illustrate the existence of bifurcation point that is controlled by the given data, whereas Section 4.2 validates the second type of DPI raised in Section 3.1. Note that, we also give a
Conclusions
In this paper, we analyzed DNNs learning from a joint geometric and information theoretic perspective, thus emphasizing the role that pair-wise mutual information plays in understanding DNNs. As an application of this idea, three fundamental properties are presented concentrating on stacked autoencoders (SAEs). The experiments on three real-world datasets validated the data processing inequality associated with layer-wise mutual information and the existence of bifurcation point associated with
Acknowledgments
The authors would like to express their sincere gratitude to Dr. Luis Gonzalo Sánchez Giraldo from the University of Miami and Dr. Robert Jenssen from the UiT — The Arctic University of Norway for their careful reading of our manuscript and many insightful comments and suggestions. The authors also thank the anonymous reviewers for their very helpful suggestions, which led to substantial improvements of the paper. This work is supported in part by the U.S. Office of Naval Research under Grant
References (96)
- et al.
Kernel-based dimensionality reduction using renyi’s -entropy measures of similarity
Neurocomputing
(2017) - et al.
Neural networks and principal component analysis: Learning from examples without local minima
Neural Networks
(1989) - et al.
Intrinsic dimension estimation: Advances and open problems
Information Sciences
(2016) - et al.
Danco: An intrinsic dimensionality estimator exploiting angle and norm concentration
Pattern Recognition
(2014) - et al.
Flow of renyi information in deep neural networks
- et al.
Explaining nonlinear classification decisions with deep taylor decomposition
Pattern Recognition
(2017) - Achille, A., & Soatto, S. Emergence of invariance and disentangling in deep representations, arXiv preprint...
An information-theoretic route from generalization in expectation to generalization in probability
- Alain, G., & Bengio, Y. Understanding intermediate layers using linear classifier probes, arXiv preprint...
- et al.
Modeling stylized character expressions via deep learning
Why regularized auto-encoders learn sparse representation?
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation
PloS One
Better mixing via deep representations
Infinitely divisible matrices
American Mathematical Monthly
Why deep learning works: A manifold disentanglement perspective
IEEE Transactions on Neural Networks and Learning Systems
System parameter identification: information criteria and algorithms
Elements of information theory
A class of measures of informativity of observation channels
Periodica Mathematica Hungarica
The cascade-correlation learning architecture
Efficient estimation of mutual information for strongly dependent variables
Measures of entropy from data using infinitely divisible kernels
IEEE Transactions on Information Theory
Deep sparse rectifier neural networks
Speech recognition with deep recurrent neural networks
Neural networks: a comprehensive foundation
Neural networks and learning machines, vol. 3
Adaptive filter theory (5th edition)
Deep residual learning for image recognition
Beta-vae: Learning basic visual concepts with a constrained variational framework
Reducing the dimensionality of data with neural networks
science
Batch normalization: Accelerating deep network training by reducing internal covariate shift
On the notion (s) of duality for Markov processes
Probability Surveys
Trace optimization and eigenproblems in dimension reduction methods
Numerical Linear Algebra with Applications
Estimating mixture entropy with pairwise distances
Entropy
Sur linterpolation et extrapolation des suites stationnaires
CR Academy of Science
Estimating mutual information
Physical Review E
Multilayer neural networks: a generalized net perspective, vol. 478
Imagenet classification with deep convolutional neural networks
A learning method of nonlinear mappings by neural networks with considering their derivatives
Gradient-based learning applied to document recognition
Proceedings of the IEEE
Maximum likelihood estimation of intrinsic dimension
Interacting particle systems, vol. 276
Ridge functions, sigmoidal functions and neural networks
Approximation theory VII
Why does deep and cheap learning work so well?
Journal of Statistical Physics
Self-organization in a perceptual network
Computer
Cited by (122)
Fault detection using generalized autoencoder with neighborhood restriction for electrical drive systems of high-speed trains
2024, Control Engineering PracticeOptimized Bayesian convolutional neural networks for invasive breast cancer diagnosis system[Formula presented]
2023, Applied Soft ComputingAdditive autoencoder for dimension estimation
2023, NeurocomputingOn the adversarial robustness of generative autoencoders in the latent space
2024, Neural Computing and Applications