Skip to main content
Log in

Dealing with the evaluation of supervised classification algorithms

Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Performance assessment of a learning method related to its prediction ability on independent data is extremely important in supervised classification. This process provides the information to evaluate the quality of a classification model and to choose the most appropriate technique to solve the specific supervised classification problem at hand. This paper aims to review the most important aspects of the evaluation process of supervised classification algorithms. Thus the overall evaluation process is put in perspective to lead the reader to a deep understanding of it. Additionally, different recommendations about their use and limitations as well as a critical view of the reviewed methods are presented according to the specific characteristics of the supervised classification problem scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. The output of the classifier for a given instance is just a class assignment.

  2. A continuous classifier is the one that yields a numeric value representing the degree to which an instance is a member of a class. Therefore, a discrimination threshold is needed to obtain a discrete classifier and thus to decide the class assignment for the instance.

  3. Certain types of misclassification forms are often more serious than other types (e.g. the relative severity of misclassifications when screening for a disease which can be easily treated if detected early enough, but which is otherwise fatal).

  4. It is a similar estimation scheme as in leaving-one-out, but leaving j data samples in the test set.

  5. An extreme example of a classier with a strong overfitting is \( kNN \) with \(k=1\).

  6. Conservative tests have a real type I error lower than the nominal \(\alpha \) value. By contrast, liberal tests have real type I error higher than its nominal nominal \(\alpha \) value.

  7. The same data partition is used for both classification algorithms.

  8. Note that we abuse notation in letting N denote the size of the dataset (when dealing with only one dataset) and the number of datasets (when dealing with several datasets).

  9. Homoscedasticity, in the case of comparing several classification algorithms in one dataset, implies equal variability of the classification behavior for every algorithm considered in the study.

  10. \(\bar{S}^j=\frac{1}{K}\sum _{k=1}^K S_k^j, \text{ with } j=1,\cdots ,N\).

  11. The best difference depends on the selected score. If the best performing algorithm is the one with maximum score, the higher the difference is, the better; and vice versa.

  12. http://sci2s.ugr.es/sicidm/.

  13. http://labs.isa.us.es:8080/apps/statservice.

  14. https://github.com/b0rxa/scmamp.

  15. Note that confidence intervals, similarly to McNemar’s test, can be used to evaluate the differences between two specific classifiers but not the differences between the classification algorithms.

  16. It is also possible to use WEKA methods from a command line and also to program new methods within the WEKA framework. This requires a deeper knowledge of the tool and programming languages but it allows to extend current functionalities.

  17. Several datasets can be chosen for the study but classification algorithms are compared separately for each dataset.

References

  • Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: A unifying approach for margin classifiers. J Mach Learn Res 1(2):113–141

    MathSciNet  MATH  Google Scholar 

  • Anagnostopoulos C, Hand DJ (2012) hmeasure: the H-measure and other scalar classification performance metrics. http://CRAN.R-project.org/package=hmeasure, R package version 1.0

  • Andersson A, Davidsson P, Linén J (1999) Measure-based classifier performance evaluation. Pattern Recognit Lett 11–13(20):1165–1173

    Article  Google Scholar 

  • Batuwita R, Palade V (2009) A new performance measure for class imbalance learning. Application to bioinformatics problems. In: Proceedings of the 26th international conference on machine learning and applications, pp 545–550

  • Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105

    MathSciNet  MATH  Google Scholar 

  • Bengio Y, Grandvalet Y (2005) Bias in estimating the variance of k-fold cross-validation. In: Duchesne P, Rémillard B (eds) Statistical modeling and analysis for complex data problems, chap 5. Springer, Berlin, pp 75–95

    Chapter  Google Scholar 

  • Berrar D, Lozano JA (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers? J Exp Theor Artif Intell 25(2):189–206

  • Bouckaert RR (2004) Estimationg replicability of classifier learning experiments. In: Brodley CE (ed) Proceedings of the 21st international conference on machine learning. ACM

  • Bouckaert RR, Frank E (2004) Evaluating the replicability of significance tests fo comparing learning algorihtms. In: Proceedings of the 8th Pacifica-Asian conference on knowledge discovery and data mining, pp 3–12

  • Boyd K, Eng KH, Page CD (2013) Area under the precision-recall curve: point estimates and confidence intervals. In: Machine learning and knowledge discovery in databases. ECML PKDD 2013, Part III, pp 451–466

  • Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  • Braga-Neto U, Dougherty E (2004) Bolstered error estimation. Pattern Recognit 37(6):1267–1281

    Article  MATH  Google Scholar 

  • Brain D, Webb GI (1999) On the effect of data set size on bias and variance in classification learning. In: Proceedings of the 4th Australian knowledge acquisition workshop, pp 117–128

  • Brain D, Webb GI (2002) The need for low bias algorithms in classification learning from large data sets. In: Proceedings of the 16th European conference principles of data mining and knowledge discovery, pp 62–73

  • Brier GW (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Rev 78:1–3

  • Budka M (2013) Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans Neural Netw Learn Syst 24(1):22–34

    Article  Google Scholar 

  • Burman P (1989) A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514

    Article  MathSciNet  MATH  Google Scholar 

  • Calvo B (2010) Positive unlabeled learning with applications in computational biology. Lambert Academic Publishing, Saarbrücken

    Google Scholar 

  • Chawla NV, Japkowicz N (2004) Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):2000–2004

    Article  Google Scholar 

  • Cohen J (1994) The earth is round (\(p <.05\)). Am Psychol 49:997–1003

    Article  Google Scholar 

  • Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: Proceedings of the 16th advances in neural information processing systems conference, p 313

  • Daniel WW (1990) Applied nonparametric statistics. Duxbury Thomson Learning, Pacific Grove

    Google Scholar 

  • Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning, pp 233–240

  • Davison A, Hinkley D (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Dawid A (1985) Calibration-based empirical probability. Ann Stat 13(4):1251–1274

    Article  MathSciNet  MATH  Google Scholar 

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: 3rd workshop on evaluation methods for machine learning

  • Denis DJ (2003) An alternative to null-hypothesis significance tests. Theory Sci 4(1)

  • Dmochowski JP, Sajda P, Parra LC (2010) Maximum likelihood in cost-sensitive learning: model specification, approximations, and upper bounds. J Mach Learn Res 11:3313–3332

    MathSciNet  MATH  Google Scholar 

  • Drummond C (2006) Machine learning as an experimental science (revisited). In: Proceedings of the 1st workshop on evaluation methods for machine learning

  • Drummond C (2008) Finding a balance between anarchy and orthodoxy. In: Proceedings of the 3rd workshop on evaluation methods for machine learning

  • Drummond C, Holte RC (2006) Cost curves: an improved methyod for visualizing classifier performance. Mach Learn 65(1):95–130

    Article  Google Scholar 

  • Drummond C, Japkowicz N (2010) Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. J Exp Theor Artif Intell 22(1):67–80

    Article  MATH  Google Scholar 

  • Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26

    Article  MathSciNet  MATH  Google Scholar 

  • Efron B (1982) The jackknife, the bootstrap and other resampling plans. Soc Ind Appl Math

  • Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331

    Article  MathSciNet  MATH  Google Scholar 

  • Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistics 1(1):54–77

    MathSciNet  MATH  Google Scholar 

  • Efron B, Tibshirani R (1993) An Introduction to the Bootstrap. Chapman & Hall, London

    Book  MATH  Google Scholar 

  • Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560

    MathSciNet  MATH  Google Scholar 

  • Egmont-Petersen M, Talmon JL, Hasman A (1997) Robustness metrics for measuring the influence of additive noise on the performance of statistical classifiers. Int J Med Inform 46:103–112

    Article  Google Scholar 

  • Elazmeh W, Japkowicz N, Matwin S (2006) A framework for measuring classification difference with imbalance. In: Proceedings of the 1st workshop on evaluation methods for machine learning

  • Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 4th international joint conference on artificial intelligence, vol 17, pp 973–978

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • Ferri C, Hernández-Orallo R, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27–38

    Article  Google Scholar 

  • Finner H (1993) On a monotonicity problem in step-down multiple test procedures. J Am Stat Assoc 88:920–923

    Article  MathSciNet  MATH  Google Scholar 

  • Fisher RA (1937) Statistical methods and scientific inference. Hafner publishing Co, New York

    Google Scholar 

  • Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1:55–77

    Article  Google Scholar 

  • Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11:86–92

    Article  MathSciNet  MATH  Google Scholar 

  • Fushiki T (2011) Estimation of prediction error by using k-fold cross-validation. Stat Comput 21(2):137–146

    Article  MathSciNet  MATH  Google Scholar 

  • Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44:1761–1776

    Article  Google Scholar 

  • Gama J (2010) Knowledge Discovery from Data Streams. Chapman and Hall/CRC, London

    Book  MATH  Google Scholar 

  • Gama J, Sebastiao R, Pereira Rodrigues P (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 329–338

  • Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  • Garcia S, Fernandez A, Luengo J, Herrera F (2010a) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  • Garcia V, Mollineda RA, Sanchez JS (2010b) Theoretical analysis of a performance measure for imbalanced data. In: Proceedings of the 18th IEEE international conference on pattern recognition, pp 617–620

  • Glover S, Dixon P (2004) Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychon Bull Rev 11(5):791–806

    Article  Google Scholar 

  • Golland P, Fischl B (2003) Permutation tests for classification: towards statistical significance in image-based studies. In: Proceedings of the 18th international conference on information processing in medical imaging, vol 18, pp 330–341

  • Golland P, Liang F, Makherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of the 18th annual conference on learning Theory, vol 18, pp 501–515

  • Good IJ (1968) Corroboration, explanation, evolving probability, simplicity, and a sharpened razor. Br J Philos Sci 19:123–143

    Article  MathSciNet  Google Scholar 

  • Good PI (2000) Permutation test: a practical guide to resampling methods for testing hypotheses. Springer

  • Goodman S (2008) A dirty dozen: twelve p-value misconceptions. Semin Hematol 45(3):135–140

    Article  Google Scholar 

  • Grandvalet Y, Bengio Y (2006) Hypothesis testing for cross-validation. Tech. rep., Département d’informatique et recherche opérationnelle, Université de Montréal

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  • Haller H, Krauss S (2002) Misinterpretations of significance: A problem students share with their teachers. Methods Psychol Res Online 7(1):1–20

    Google Scholar 

  • Hamill TM (1996) Reliability diagrams for multicategory probabilistic forecast. Weather Forecast 12(4):736–741

    Article  Google Scholar 

  • Hand DJ (1986) Recent advances in error rate estimation. Pattern Recognit Lett 4(5):335–346

    Article  Google Scholar 

  • Hand DJ (1994) Deconstructing statistical questions. J R Stat Soc Ser A 157(3):317–356

    Article  MathSciNet  Google Scholar 

  • Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under de ROC curve. Mach Learn 77:103–123

    Article  Google Scholar 

  • Hand DJ (2010) Evaluation diagnostic tests: the area under the ROC curve and the balance of errors. Stat Med 29:1502–1510

    MathSciNet  Google Scholar 

  • Hand DJ, Anagnostopoulos C (2013) When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recognit Lett 34(5):492–495

    Article  Google Scholar 

  • Hand DJ, Anagnostopoulos C (2014) A better Beta for the H measure of classification performance. Pattern Recogn Lett 40:41–46

    Article  Google Scholar 

  • Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45:171–186

    Article  MATH  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Berlin

    Book  MATH  Google Scholar 

  • Holland BS, Copenhaver MD (1987) An improved sequentially rejective bonferroni test procedure. Biometrics 43:417–423

    Article  MathSciNet  MATH  Google Scholar 

  • Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1):11–30

    Article  Google Scholar 

  • Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat 9:571–595

    Article  MATH  Google Scholar 

  • Isaksson A, Wallman M, Goransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognit Lett 29(14):1960–1965

    Article  Google Scholar 

  • Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25:87–112

    Article  MathSciNet  MATH  Google Scholar 

  • Japkowicz N (2006) Why question machine learning evaluation methods (an illustrative review of the shortcomings of current methods). In: Proceedings of the 1st workshop on evaluation methods for machine learning

  • Japkowicz N (2008) Classifier evaluation: a need for better education and restructuring. In: Proceedings of the 3rd workshop on evaluation methods for machine learning

  • Japkowicz N, Shah M (2011) Evaluating learning algorithms. Cambridge University Press, Cambridge, A classification perspective

    Book  MATH  Google Scholar 

  • Jaynes ET (1976) Confidence intervals vs. bayesian intervals. Found Probab Theory Stat Inference Stat Theor Sci 2:175–257

    Article  MathSciNet  MATH  Google Scholar 

  • Johnson DH (1999) The insignificance of statistical significance testing. J Wildl Manag 63(3):763–772

    Article  Google Scholar 

  • Joshi A, Porikli F, Papanikolopoulos NP (2012) Scalable active learning for multiclass image classification. IEEE Trans Pattern Anal Mach Intell 34(11):2259–2273

    Article  Google Scholar 

  • Joshi MV, Agarwal RC, Kumar V (2001) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 27th ACM SIGMOD international conference on management of data, pp 91–102

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143

  • Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions. In: Saitta L (ed) Proceedings of the 13th international conference on machine learning, Morgan Kaumann, pp 275–283

  • Kruskal W, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621

    Article  MATH  Google Scholar 

  • Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2):195–215

    Article  Google Scholar 

  • Kuhn M (2015) Caret: classification and regression training. http://CRAN.R-project.org/package=caret, R package version 6.0-41

  • Lacoste A, Laviolette F, Marchand M (2012) Bayesian comparison of machine learning algorithms on single and multiple datasets. In: Proceedings of the 15th international conference on artificial intellegence and statistics, pp 665–675

  • Larson SC (1931) The shrinkage of the coefficient of multiple correlation. J Educ Psychol 22:45–55

    Article  Google Scholar 

  • Lavesson N (2006) Evaluation of supervised learning algorithms and classifiers. Master’s thesis, Blekinge Institute of Technology

  • Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Proceedings of the 4th international conference on knowledge discovery and data minig, pp 73–79

  • Masson M (2011) A tutorial on a practical bayesian alternative to null-hypothesis significance testing. Behav Res Methods 43(3):679–90

    Article  MathSciNet  Google Scholar 

  • May WL, Johnson WD (1997) Confidence intervals for differences in correlated binary proportions. Stat Med 16(18):2127–2136

    Article  Google Scholar 

  • McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York

    Book  MATH  Google Scholar 

  • Moreno-Torres JG, Reader T, Aláiz-Rodriíguez R, Chawla NV, Herrera F (2012a) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530

    Article  Google Scholar 

  • Moreno-Torres JG, Sáez JA, Herrera F (2012b) Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312

    Article  Google Scholar 

  • Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281

    Article  MATH  Google Scholar 

  • Nakhaeizadeh G, Schnabl A (1998) Towards the personalization of algorihtms evaluation in data mining. In. In Proceedings of the 3rd international conference on knowledge discovery and data mining, pp 289–293

  • Ojala M, Garriga GC (2010) Permutation tests for studying classifier performance. J Mach Learn Res 11:1833–1863

    MathSciNet  MATH  Google Scholar 

  • Otero J, Sánchez L, Couso I, Palacios A (2014) Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure. J Comput Syst Sci 80(1):88–100

    Article  MathSciNet  MATH  Google Scholar 

  • Prati RC, Batista GEPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618

    Article  Google Scholar 

  • Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceeding of the 15th international conference on machine learning, pp 445–453

  • Raghavan V, Bollmann P, Jung GS (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst 7(3):205–229

    Article  Google Scholar 

  • Ranawana R, Palade V (2006) Optimized precision–a new measure for classifier performance evaluation. In: Proceedings of the 23th IEEE international conference on evolutionary computation, pp 2254–2261

  • Reader T, Hoens TR, Chawla NV (2010) Consequences of variability in classifier performance estimates. In: Proceedings of the 10th IEEE international conference on data mining, pp 421–430

  • Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141

    MathSciNet  MATH  Google Scholar 

  • Rodríguez JD, Pérez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575

    Article  Google Scholar 

  • Rodríguez JD, Pérez A, Lozano JA (2013) A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognit 46(3):855–864

    Article  Google Scholar 

  • Rom DM (1990) A sequentially rejective test procedure based on a modified bonferroni inequality. Biometrika 77:663–665

    Article  MathSciNet  Google Scholar 

  • Rozeboom W (1960) The fallacy of the null-hypothesis significance test. Psychol Bull 57(5):416–428

    Article  Google Scholar 

  • Schubert CM, Thorsen SN, Oxley ME (2011) The ROC manifold for classification systems. Pattern Recognit 44(2):350–362

    Article  MATH  Google Scholar 

  • Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:551–584

    Article  Google Scholar 

  • Silla CN, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1–2):31–72

    Article  MathSciNet  MATH  Google Scholar 

  • Smith C (1947) Some examples of discrimination. Ann Eugen 13:272–282

    Article  MathSciNet  Google Scholar 

  • Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the 19th Australian joint conference on artificial intelligence: advances in artificial intelligence, pp 1015–1021

  • Stone M (1974) Cross-validatory choice and assessment of statistical predictions (with discussion). J R Stat Soc Ser B 36:111–147

    MATH  Google Scholar 

  • Stone M (1977) Asymptotics for and against cross-validation. Biometrika 64(1):29–35

    Article  MathSciNet  MATH  Google Scholar 

  • Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687

    Article  Google Scholar 

  • Tan P, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Reading

  • Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Wareh Min 3(3):1–13

    Article  Google Scholar 

  • van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, Oxford

    MATH  Google Scholar 

  • Webb AR (2002) Statistical pattern recognition, vol 9, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Webb G (2000) Multiboosting: a technique for combining boosting and wagging. Mach Learn 40(2):159–196

    Article  Google Scholar 

  • Webb GI, Conilione P (2003) Estimating bias and variance from data. Tech. rep

  • Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newslett 6(1):7–19

    Article  Google Scholar 

  • Wilcoxon F (1945) Individual comparison by ranking methods. Biometrics 1(6):80–83

    Article  MathSciNet  Google Scholar 

  • Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

  • Yanagihara H (2012) Iterative bias correction of the cross validation criterion. Scand J Stat 39(1):116–130

    Article  MathSciNet  MATH  Google Scholar 

  • Zar JH (2010) Biostatistical analysis, 5th edn. Pearson Prentice Hall, Englewood Cliffs

    Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the Saiotek and Research Groups (IT609-13) programs (Basque Government), TIN2013-41272-P (Spanish Ministry of Economy and Competitiveness), and Provincial Council of Gipuzkoa (2011-CIEN-000060-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guzman Santafe.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santafe, G., Inza, I. & Lozano, J.A. Dealing with the evaluation of supervised classification algorithms. Artif Intell Rev 44, 467–508 (2015). https://doi.org/10.1007/s10462-015-9433-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-015-9433-y

Keywords

Navigation