A Theory of Finite-Width Neural Networks: Generalization, Scaling Laws, and the Loss Landscape

Simsek, Berfin

doi:10.5075/epfl-thesis-9912

Simsek, Berfin

2023

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Deep learning has achieved remarkable success in various challenging tasks such as generating images from natural language or engaging in lengthy conversations with humans. The success in practice stems from the ability to successfully train massive neural networks on massive datasets. This thesis studies the theoretical foundations of the simplest architecture, that is, a deep (feedforward) neural network, with a particular emphasis on the role of width. We first focus on a simple model of finite-width neural networks to study generalization, a central inquiry in machine learning and statistical learning theory. Our study delves into the expected generalization error of a Gaussian random features model in terms of the number of features, number of data points, the kernel that it approximates, and the input distribution. Our formulas closely match numerical experiments. Next, we explore another simplification of finite-width neural networks to study their training dynamics. We assume a linear activation function, resulting in a linear predictor. However, the training dynamics remain non-trivial. In particular, the loss function is non-convex: the orthogonal symmetry gives rise to manifolds of saddle points at various loss levels. Nevertheless, these saddle points exhibit a unique arrangement, wherein the escape direction of a saddle channels the trajectory towards a subsequent saddle. By gluing the local trajectories between saddles, we describe a so-called saddle-to-saddle dynamics that provably kicks in for very small initializations. To study finite-width neural networks without devising a simple model, we shift our focus to the structure of network parameterization and permutation symmetry among hidden neurons. We identify a neuron-splitting technique that maps a critical point of a network to a manifold of symmetry-induced critical points of a wider network. By considering all possible neuron partitions and their permutations, we establish the precise scaling law for the number of critical manifolds. The scaling laws behave as $e^{c (\alpha)} m^m $ for large $m$ where $m$ is the width of the wider network and $\alpha$ is shrinkage factor, i.e. is the ratio between the number of distinct neurons to $m$. Notably, the maximum of $c(\alpha)$ is attained at $\alpha^* = \frac{1}{2 \log(2)}$, hence it is the shrinkage factor inducing the most numerous symmetry-induced critical manifolds. We then give an application of this scaling law for overparameterized networks. The key question is: can we give a rule of thumb for how much overparameterization is needed to ensure reliable convergence to a zero-loss solution? Our approach is based on studying the geometry and topology of the zero-loss solutions in overparameterized neural networks. We prove that \textit{all} zero-loss solution manifolds are identical up to neuron splitting, zero neuron addition, and permutation for input distributions with full support. Additionally, we give the scaling law of the zero-loss manifolds. The ratio between the two scaling laws yields a measure of the landscape complexity which decays with overparameterization. We observe that the complexity decreases rapidly until reaching an overparameterization factor of approximately $2\log(2)$, beyond which the complexity becomes smaller than one. Overall, we find it recommendable to use at least a factor of $2$ to $4$ of overparameterization to ensure reliable convergence to a zero-loss solution.

Details

Title A Theory of Finite-Width Neural Networks: Generalization, Scaling Laws, and the Loss Landscape

Author(s) Simsek, Berfin

Advisor(s)

Hongler, Clément
Gerstner, Wulfram

Pagination 197

Date 2023

Publisher Lausanne, EPFL

Keywords

Neural Networks; Machine Learning; Deep Learning; Random Feature Models; Generalization; Loss Landscape; Random Matrix Theory; Combinatorics; Constrained Optimization

Language English

DOI https://doi.org/10.5075/epfl-thesis-9912

Laboratories CSFT

Record Appears in Scientific production and competences > SB - School of Basic Sciences > MATH - Institute of Mathematics > CSFT - Chair of Statistical Field Theory
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2023-07-04

Files

Abstract

Details

PDF