Files

Abstract

Deep learning has achieved remarkable success in various challenging tasks such as generating images from natural language or engaging in lengthy conversations with humans. The success in practice stems from the ability to successfully train massive neural networks on massive datasets. This thesis studies the theoretical foundations of the simplest architecture, that is, a deep (feedforward) neural network, with a particular emphasis on the role of width. We first focus on a simple model of finite-width neural networks to study generalization, a central inquiry in machine learning and statistical learning theory. Our study delves into the expected generalization error of a Gaussian random features model in terms of the number of features, number of data points, the kernel that it approximates, and the input distribution. Our formulas closely match numerical experiments. Next, we explore another simplification of finite-width neural networks to study their training dynamics. We assume a linear activation function, resulting in a linear predictor. However, the training dynamics remain non-trivial. In particular, the loss function is non-convex: the orthogonal symmetry gives rise to manifolds of saddle points at various loss levels. Nevertheless, these saddle points exhibit a unique arrangement, wherein the escape direction of a saddle channels the trajectory towards a subsequent saddle. By gluing the local trajectories between saddles, we describe a so-called saddle-to-saddle dynamics that provably kicks in for very small initializations. To study finite-width neural networks without devising a simple model, we shift our focus to the structure of network parameterization and permutation symmetry among hidden neurons. We identify a neuron-splitting technique that maps a critical point of a network to a manifold of symmetry-induced critical points of a wider network. By considering all possible neuron partitions and their permutations, we establish the precise scaling law for the number of critical manifolds. The scaling laws behave as $e^{c (\alpha)} m^m $ for large $m$ where $m$ is the width of the wider network and $\alpha$ is shrinkage factor, i.e. is the ratio between the number of distinct neurons to $m$. Notably, the maximum of $c(\alpha)$ is attained at $\alpha^* = \frac{1}{2 \log(2)}$, hence it is the shrinkage factor inducing the most numerous symmetry-induced critical manifolds. We then give an application of this scaling law for overparameterized networks. The key question is: can we give a rule of thumb for how much overparameterization is needed to ensure reliable convergence to a zero-loss solution? Our approach is based on studying the geometry and topology of the zero-loss solutions in overparameterized neural networks. We prove that \textit{all} zero-loss solution manifolds are identical up to neuron splitting, zero neuron addition, and permutation for input distributions with full support. Additionally, we give the scaling law of the zero-loss manifolds. The ratio between the two scaling laws yields a measure of the landscape complexity which decays with overparameterization. We observe that the complexity decreases rapidly until reaching an overparameterization factor of approximately $2\log(2)$, beyond which the complexity becomes smaller than one. Overall, we find it recommendable to use at least a factor of $2$ to $4$ of overparameterization to ensure reliable convergence to a zero-loss solution.

Details

PDF