CONTRIBUTED ARTICLESome numerical aspects of the training problem for feed-forward neural nets
Introduction
It is now well established that feed-forward neural networks can be used as universal approximators. The training of such networks in order to approximate functions is often formulated as a nonlinear least-squares problem, numerically equivalent to nonlinear regression. A major body of optimisation techniques is therefore available to analyse and solve it. An important factor in the performance of such methods, as well as the confidence which can be placed in the solutions they obtain, is the numerical conditioning of the error function. Ill-conditioning may conveniently be thought of as the situation in which the value of a function of several variables is locally very much more sensitive to changes in the values of certain combinations of its variables than it is to other variations. Although some authors have discussed aspects of this problem, the implications have perhaps not been widely realised. As evidence of this, examples can be found in the recent literature of problems formulated in such a way that extreme ill-conditioning is guaranteed by the inadequate amount of training data used. Even when this is not the case, ill-conditioning is definitely a common feature of the training problem and cannot be ignored (Dixon and Mills, 1991; Saarinen et al., 1991; Ellacott, 1993). The paper by Saarinen et al., in particular, covers similar ground to the present one in considerable detail but without the use of a geometrical interpretation of ill-conditioning; it does not seem to have been as widely appreciated as it might have been. It is hoped that the present discussion will contribute to the debate by clarifying some of these issues, particularly by a detailed analysis of an instance of a functional approximation problem.
Section snippets
Summary of the nonlinear least-squares problem
The supervised training problem for feed-forward artificial neural networks can be formulated as one of minimising, as a function of the weights, the sum of squares of differences between the predicted and training outputs. The number of variables (n) is usually equal to the sum of the number of connections and neurons in the network, while the number (m) of differences, or residual errors, is equal to the product of the number of outputs from the network (o) and the number of training sets (p
Feed-forward training problems are ill-conditioned
It has been observed that the feed-forward training problem is almost invariably ill-conditioned or singular. In this respect it is far from being unique among nonlinear regression problems. There are several distinct sources of ill-conditioning in the case of neural networks, the following being perhaps the main ones.
Example
Consider the simple problem of constructing a feed-forward network to approximate the function y=tan(x) (Fig. 1). A training set consisting of 18 patterns was constructed by computing y at 20 points uniformly distributed in the interval [0,π] and then discarding the two points on either side of the discontinuity at π/2 in order to keep the values of y within a reasonable range.
Fig. 2 shows the 1/4/1 network chosen for the approximation. Let wijk denote the link from node i in layer (k-1) to
Implications of ill-conditioning for the choice of training algorithm
The effect of ill-conditioning on the minimisation process is well understood. If first-order methods such as steepest-descent are used, the iterations will tend to favour directions corresponding to large eigenvalues of the hessian matrix. This results in extremely slow convergence along the flat valleys associated with ill-conditioning. Back-propagation is of course equivalent to steepest descent when used in batch mode, i.e. when all training sets are taken into account at each step.
Implications of ill-conditioning for global optimisation
The methods discussed above are intended to locate local minima, that is, points at each of which the error function has the lowest value in its immediate vicinity. Second-order methods, general or specialised, are in principle no more likely to find a global minimum than are first-order methods.
The problem of finding a global minimum even over a set of distinct, well-defined local minima is a difficult one because of the fundamental impossibility of recognising such a global minimum using
Conclusions
The training problem for neural networks is usually an ill-conditioned one. This property is inherent in the form of the networks and their excitation functions, but can be exacerbated by the use of insufficient numbers of training sets, or training sets deficient in independent information, or over-complex networks. The presence of ill-conditioning makes first-order minimisation methods unlikely to be efficient, but also affects second-order methods. It can make second order methods such as
Acknowledgements
The work described in this paper was carried out with the help of funding from the European Community through the Human Capital and Mobility Project 'Models, algorithms and systems for decision making', contract no. CHRX-CT93-0087.
References (13)
- Dixon, L. C. W. and Szego, G. P. (1975). Towards global optimisation. Amsterdam/New York: North-Holland//American...
- Dixon, L. C. W. and Szego, G. P. (1978). Towards global optimisation 2. Amsterdam/New York: North-Holland//American...
- Dixon, L. C. W. and Mills, D. J. (1991). Neural networks and nonlinear optimization I: the representation of continuous...
- Ellacott, S. W. (1993). The numerical analysis approach. In J. Taylor (Ed.), Mathematical approaches to neural...
- Gorse, D., Shepherd, A. and Taylor, J. G. (1994). A classical algorithm for avoiding local minima. 1994 International...
- McKeown, J. J. (1975). Specialised versus general-purpose algorithms for minimizing functions that are sums of squared...
Cited by (21)
A power-flow emulator approach for resilience assessment of repairable power grids subject to weather-induced failures and data deficiency
2018, Applied EnergyCitation Excerpt :A surrogate model, also known as emulator or meta-model, is a numerically cheap mathematical approximation of a computationally expensive realistic model [16]. Some examples of popular meta-models are Artificial Neural Networks [17,18], Poly-Harmonic Splines [19] and Kriging models [20]. Surrogates have been extensively applied to reduce time expenses of numerically burdensome models and few works have attempted to use meta-models to analyse power grids, see for instance [21–26].
Sound quality recognition using optimal wavelet-packet transform and artificial neural network methods
2016, Mechanical Systems and Signal ProcessingCitation Excerpt :In addition, to find the optimal weights in a short time, some improved algorithm for training the BP neural network was proposed, such as the gradient descent, the quasi Newton and the Levenberg–Marquardt (LM) methods. Due to the rapid convergence for solving nonlinear least squares problems [46], the LM algorithm is adopted and used in the ANN training in this paper. In summary, the finally determined parameters for OWPT–ANN modeling are listed in Table 7.
Buckling analysis of a beam-column using multilayer perceptron neural network technique
2013, Journal of the Franklin InstituteCitation Excerpt :Because of its versatility, the artificial neural network (ANN) has been extensively applied to problems in various fields, by utilizing the capability of ANN′s function approximation. Solving a differential equation using this technique requires the training of the ANN that calculates the solution values at any point in the solution space including those not considered during the ANN training and also offers the following advantage over standard numerical methods [16–28]: Solution obtained via. ANN is differentiable, closed analytic form and easily used in any subsequent calculation and provides a solution with very good generalization properties.
Solving initial-boundary value problems for systems of partial differential equations using neural networks and optimization techniques
2009, Journal of the Franklin InstituteCitation Excerpt :In some cases, the learning process of multilayer perceptrons is time consuming and ill-conditioned. To avoid these difficulties, increasing of the input information and proper use of the suitable error minimization techniques are useful [39]. For simplicity, here we consider N instead of Ni.
Machine and component residual life estimation through the application of neural networks
2009, Reliability Engineering and System SafetyNumerical solution for high order differential equations using a hybrid neural network-Optimization method
2006, Applied Mathematics and ComputationCitation Excerpt :The multi-layered feed forward neural networks are trainable. In some cases the feed forward training problem is ill-conditioned [20]. In proposed method here we do not suffer from such difficulties.