Elsevier

Neural Networks

Volume 142, October 2021, Pages 138-147
Neural Networks

Probabilistic robustness estimates for feed-forward neural networks

https://doi.org/10.1016/j.neunet.2021.04.037Get rights and content

Abstract

Robustness of deep neural networks is a critical issue in practical applications. In the general case of feed-forward neural networks (including convolutional deep neural network architectures), under random noise attacks, we propose to study the probability that the output of the network deviates from its nominal value by a given threshold. We derive a simple concentration inequality for the propagation of the input uncertainty through the network using the Cramer–Chernoff method and estimates of the local variation of the neural network mapping computed at the training points. We further discuss and exploit the resulting condition on the network to regularize the loss function during training. Finally, we assess the proposed tail probability estimates empirically on various public datasets and show that the observed robustness is very well estimated by the proposed method.

Introduction

Deep neural networks have proven to be very effective in practice to perform highly complex learning tasks (Goodfellow, Bengio, & Courville, 2016). Due to this success, they have gained a great deal of attention during these past few years and they have been applied widely. However, they also have been found to be very sensitive to data uncertainties (Fawzi et al., 2017, Szegedy et al., 2014) to the point that a whole research community is now addressing the so-called network attacks in order to study and design input noise that can fool the network decision. Attacks can be random when data are corrupted by some random noise or adversarial when the noise is specifically designed to alter the network output (Szegedy et al., 2014). Even though both types of attacks are related since they are both addressing the robustness of the network, we will only focus in this article on the random case. Most data are usually uncertain, either because the data are related to naturally noisy phenomenon and we only have access to some of its statistics or because assessing devices to do not have sufficient accuracy to record precisely the data. In this study, we will therefore assume that the network input data are corrupted by some additive bounded random noise.

Robustness to bounded input perturbations has been analyzed in the past few years. Most people have addressed the problem through the use of regularization techniques (Finlay et al., 2018, Gouk et al., 2018, Oberman and Calder, 2018, Virmaux and Scaman, 2018). The main idea is to consider the neural network as a Lipschitz map between the input and output data. The Lipschitz constant of the network is then estimated or upper bounded by the norm of the layer-by-layer weights product. This estimates the expansion or contraction capability of the network and is then used to regularize the loss during training. Often, there is a price to pay: the expressiveness of the network may be reduced, especially if the weights are too constrained or constrained layer by layer instead of constrained across layer (Couellan, 2021). Such strategies are enforcing robustness but do not provide guarantees or estimates on the level of robustness that has been achieved. In the case of adversarial perturbation, some authors have proposed methods for certifying robustness (Boopathy et al., 2018, Kolter and Wong, 2017). Recently, a probabilistic approach has also been proposed in the case of random noise for convolutional neural networks (Weng, Chen, Nguyen, Squillante, Boopathy, Oseledets, & Daniel, 2019). Pointing out that the threat of random noise may have been overlooked by the research community in favor of adversarial attacks, the authors have proposed probabilistic bounds based on the idea that the output of the network can be lower and upper bounded by two linear functions. The work proposed here is along the same line but distinct in several aspects. It combines upper bounds on tail probabilities calculated by deriving a specific Cramer–Chernoff concentration inequality for the propagation of uncertainty through the network with a network sensitivity estimate based on a network gradient calculation with respect to the inputs. The network gradient is computed by automatic differentiation and estimates the local variation of the output with respect to the input of the network. The estimation is carried out and averaged over the complete training set. A maximum component-wise gradient variation is also calculated in order to give probabilistic certificates rather than estimates. The certificates can be used in place of estimates whenever guaranteed upper bounds are needed, however they are often not as accurate since they are based on variation bounds rather than averages. For the specific case of piece-wise linear activation functions, we also propose an alternative bound based on the calculation of an average activation operator matrix computed at each layer using also the training samples. We then discuss the use of the derived bounds and estimate to regularize the neural network during training in order to reach regions of the weight space that ensure greater robustness properties. We then design experiments in order to assess the robustness probabilistic estimates for various regularization strategies.

The article is organized as follows: Section 2 provides the specific neural network concentration inequality using the Cramer–Chernoff method and the calculation of the network gradient estimate and the average activation operator for the case of piece-wise linear activations. Section 4 deals with training of the neural network and its regularization issues to increase its robustness. Section 5 provides the results of an empirical evaluation of the neural network robustness for various public datasets. Finally, Section 6 concludes the article.

Section snippets

Probabilistic certificates of robustness

Consider feed-forward neural networks that we represent as a successive composition of linear weighted combination of functions such that xl=fl((Wl)xl1+bl) for l=1,,L, where xl1Rnl1 is the input of the lth layer, the function fl is the Lfl-Lipschitz continuous activation function at layer l, and WlRnl1×nl and blRnl are the weight matrix and bias vector between layer l1 and l that define our model parameter θ={Wl,bl}l=1L that we want to estimate during training. The network can be seen

General neural network activations

Remember that the bound derived above relies on the fact that we have considered the linear upper bound of the neural network response. Therefore, the inequality (2) applied to the multi-layers case gives PεDxLxεLΓPεDtrεεl=1LWll=1LWlΓ2Lf2Even if Lfl is known for all levels (ex: Lfl=1 for all l if all network activations are ReLu), their product Lf may be a loose bound for the network Lipschitz constant. This means that the Chernoff bound proposed above may be tight with respect to

Controlling the bound during training

In this section, we are interested in exploiting the bounds derived above during the process of training the neural network. The main idea would be to ensure that optimal weights after training are satisfying the bound constraint (11) or (13). Naturally, this could be formulated as a constrained optimization training problem. Stochastic projected gradient techniques (Lacoste-Julien et al., 2012, Nedic and Lee, 2014) could be used to solve such a problem. However, in the general case, the

Experiments

In order to assess the quality of the estimated probability bounds, experiments are conducted on two types of datasets (regression and classification data). The neural network, its training and testing are implemented in the python (Team, 2015) environment using the keras (Chollet et al., 2015) library and Tensorflow (Abadi et al., 2015) backend. Results for the general network gradient strategy and the activation operator strategy presented in Section 3 are tested and presented next.

Conclusions

In this study, we have proposed analytical probabilistic estimates (and certificates) for feed-forward neural networks. The idea combines tail probability bound calculation using the Cramer–Chernoff scheme and the estimation of the network local variation. The network gradient computation is using the automatic differentiation procedure available in many neural network training packages and carried out only at the training samples which does not require much extra computational cost. In the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Our work has benefited from the AI Interdisciplinary Institute ANITI. ANITI is funded by the French “Investing for the Future - PIA3” program under the Grant agreement # ANR-19-PI3A-0004.

References (30)

  • HarrisonD. et al.

    Hedonic prices and the demand for clean air

    Journal of Environmental Economics and Management

    (1978)
  • ZhouD.-X.

    Universality of deep convolutional neural networks

    Applied and Computational Harmonic Analysis

    (2020)
  • AbadiM. et al.

    TensorFlow: Large-scale machine learning on heterogeneous systems

    (2015)
  • AllaireG.

    Numerical analysis and optimization

    (2007)
  • BishopC.M.
  • BoopathyA. et al.

    CNN-cert: An efficient framework for certifying robustness of convolutional neural networks

    (2018)
  • BoucheronS. et al.

    Concentration inequalities, a nonasymptotic theory of independence

    (2013)
  • CholletF.

    Keras

    (2015)
  • CouellanN.

    The coupling effect of Lipschitz regularization in deep neural networks

    SN Computer Science

    (2021)
  • DemboA.

    Bounds on the extreme eigenvalues of positive-definite Toeplitz matrices

    IEEE Transactions on Information Theory

    (1988)
  • FawziA. et al.

    The robustness of deep networks: A geometrical perspective

    IEEE Signal Processing Magazine

    (2017)
  • FinlayC. et al.

    Improved robustness to adversarial examples using Lipschitz regularization of the loss

    (2018)
  • GoodfellowI. et al.

    Deep learning

    (2016)
  • GoukH. et al.

    Regularisation of neural networks by enforcing Lipschitz continuity

    (2018)
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2015)
  • Cited by (15)

    • Design and optimization of a novel U-type vertical axis wind turbine with response surface and machine learning methodology

      2022, Energy Conversion and Management
      Citation Excerpt :

      The weights and bias are continuously adjusted and modified by back-propagation. After the error between the predicted value of output layer and true value drops to preset threshold, BPNN is believed to reveal the relationship of sample [73]. Afterwards, validation data sample need to assess the approximating performance of BPNN.

    • Neural partially linear additive model

      2024, Frontiers of Computer Science
    View all citing articles on Scopus
    View full text