Abstract
This paper investigates the efficacy of cross-entropy and square-error objective functions used in training feed-forward neural networks to estimate posterior probabilities. Previous research has found no appreciable difference between neural network classifiers trained using cross-entropy or squared-error. The approach employed here, though, shows cross-entropy has significant, practical advantages over squared-error.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arotaritei D, Negoita Gh M (2002) Optimisation of Recurrent NN by GA with Variable Length Genotype. In: McKay B, Slaney J (eds) AI2002: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 691–692
Baum EB, Wilczek F (1988) Supervised learning of probability distributions by neural networks. In: Anderson D (ed) Neural information processing systems. American Institute of Physics, New York, pp 52–61
Berardi VL, Zhang GQ (1999) The effect of misclassification costs on neural network classifiers. Decis Sci 30(3):659–682
Bishop CM (1995a) Neural networks for pattern recognition. Clarendon, Oxford
Bourlard H, Wellekens CJ (1989) Links between Markov models and multilayer perceptrons. In: Toretzky DS (ed) Advances in neural information processing systems, vol 1. Morgan Kaufmann, San Mateo, pp 502–510
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2:303–314
Duda RO, Hart P (1973) Pattern classification and scene analysis. Wiley, New York
Fagarasan F, Negoita Gh M (1995) A genetic-based method for learning the parameter of a fuzzy inference system. In: Kasabov N, Coghill G (eds) Artificial neural networks and expert systems. IEEE Computer Society, Los Alamitos, pp 223–226
Frean M (1990) The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Comput 2(2):198–209
Hampshire JB, Waibel A (1990) Connectionist architectures for multi-speaker phoneme recognition. In: Toretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan Kaufmann, San Mateo, pp 203–210
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257
Hung MS, Hu MY, Patuwo BE, Shanker M (1996) Estimating posterior probabilities in classification problems with neural networks. Int J Comput Intell Organ 1:49–60
Mezard M, Nadal JP (1989) Learning in feedforward layered networks: the tiling algorithm. J Phys A 22:21921–2203
Papoulis A (1964) Probability, random variables, and stochastic processes, 1st edn. McGraw Hill, New York, p 175
Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a-posteriori probabilities. Neural comput 3:461–483
Rumelhart DE, Hinton GE, Williams RJ (1986a) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, the PDP group (eds) Parallel distributed processing: exploration in the microstructure of cognition, Foundations. MIT Press, Cambridge, MA, pp 318–362
Rumelhart DE, Hinton GE, Williams RJ (1986b) Learning representation by backpropagating errors. Nat (Lond) 323:533–536
Shoemaker PA (1991) A note on least-squares learning procedures and classification by neural networks. IEEE Trans Neural Netw 2(1):158–160
Wan EA (1990) Neural network classification: a Bayesian interpretation. IEEE Trans Neural Netw 1(4):303–375
White H (1989) Learning in artificial neural networks: a statistical perspective. Neural Comput 1:425–464
White H (1990) Connectionists nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Netw 3:535–549
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
This appendix contains information concerning the parameters used in generating the simulated distributions of the illustration problems.
1.1 Trivariate normal (z1)
Let μ ij and \(\sigma^{2}_{ij}\) be the mean and variance of normal variable i for group j. The mean and variance parameters for Group 1 are \((\mu_{11}, \sigma^{2}_{11}) = (\mu_{21}, \sigma^{2}_{21}) = (\mu_{31}, \sigma^{2}_{31}) = (10.0, 25.0).\) For Group 2, the parameters are \((\mu _{12}, \sigma^{2}_{12}) = (\mu_{22}, \sigma^{2}_{22}) = (\mu _{32}, \sigma^{2}_{32}) = (5.5, 25.0)\) and for Group 3 \((\mu_{13}, \sigma^{2}_{13}) = (\mu_{23}, \sigma^{2}_{23}) = (\mu_{33}, \sigma^{2}_{33}) = (7.5, 25.0).\) Let Σ1, Σ2, and Σ3 be the variance-covariance matrices for Groups 1, 2, and 3, respectively. For this example, Σ1 = Σ2 = Σ3 = Σ where
1.2 Bivariate bernoulli (z2)
Let z = (Z1j, Z2j) be the bivariate Bernoulli variables for group j where P(Z1j = 1) = p1j, P(Z2j = 1) = p2j, P(Z3j = 1) = p3j, and ρ j is the correlation coefficient. For Group 1, p11 = 0.8, p21=0.7, and ρ1=0.2. For Group 2, p12 = 0.5, p22=0.55, and ρ2=0.4. For Group 3, p13 = 0.425, p23=0.4, and ρ3=0.6.
1.3 Weibull (z3, z4)
For Weibull variable i, let α ij be the shape parameter for group j, and β ij be the scale parameter. For this example, the parameters for the first Weibull variable z3 are α11=4.0, β11=1.0, α12=1.5, β12=1.0, and α13=2.0, β13=1.0 and For the second Weibull variable z4, α21=0.35, β21=1.0, α22=0.55, β22=1.0, and α23=0.6, β23=1.0. Therefore, z3 is a concave density function and z4 is convex.
1.4 Binomial (z5)
Let t=1,2,...,T be the number of Bernoulli random variables composing the binomial random variate, and p j be the probability that the Bernoulli random variable for group j is 1. Then μ j = Tp j and \(\sigma^{2}_{j}=Tp_{j} (1 - p_{j}).\) For this example, p1=0.5, p2=0.3, p3=0.7 and T=10.
Rights and permissions
About this article
Cite this article
Kline, D.M., Berardi, V.L. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput & Applic 14, 310–318 (2005). https://doi.org/10.1007/s00521-005-0467-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-005-0467-y