Abstract
After discussing the construction of machine learning (ML) algorithms in the previous chapter, this chapter is dedicated to their assessment and performance estimation (with an emphasis on classification assessment), a topic that is equally important specially in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barndorff-Nielsen OE, Cox DR (1989) Asymptotic techniques for use in statistics. Chapman and Hall, New York
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont
Chen W, Gallas BD, Yousef WA (2012) Classifier variability: accounting for training and testing. Pattern Recogn 45(7):2661–2671
Efron B (1979) Bootstrap methods: another look at the Jackknife. Ann Stat 7(1):1–26
Efron B (1981) Nonparametric estimates of standard error: the Jackknife, the bootstrap and other methods. Biometrika 68(3):589–599
Efron B (1982) The Jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331
Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470
Efron B, Stein C (1981) The Jackknife estimate of variance. Ann Stat 9(3):586–596
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York
Efron B, Tibshirani R (1995) Cross validation and the bootstrap: estimating the error rate of a prediction rule. Technical report 176, Stanford University, Department of Statistics
Efron B, Tibshirani R (1997) Improvements on cross-validation: the \(.632+\) Bootstrap method. J Am Stat Assoc 92(438):548–560
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston
Hájek J, Šidák Z, Sen PK (1999) Theory of rank tests, 2nd edn. Academic Press, San Diego
Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393
Hampel FR (1986) Robust statistics : the approach based on influence functions. Wiley, New York
Hanley JA (1989) Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging 29(3):307–335
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Huber PJ (1996) Robust statistical procedures, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Jaeckel L (1972) The infinitesimal jackknife. Memorandum, MM 72-1215-11, Bell Lab Murray Hill
Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K (1999) Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 6(1):22–33
Mallows C (1974) On some topics in robustness. Memorandum, MM 72-1215-11, Bell Lab Murray Hill, NJ
Randles RH, Wolfe DA (1979) Introduction to the theory of nonparametric statistics. Wiley, New York
Sahiner B, Chan HP, Petrick N, Hadjiiski L, Paquerault S, Gurcan MN (2001) Resampling schemes for estimating the accuracy of a classifier designed with a limited data set. In: Medical image perception conference IX, airlie conference Center, Warrenton VA, 20–23
Sahiner B, Chan HP, Hadjiiski L (2008) Classifier performance prediction for computer-aided diagnosis using a limited dataset. Med Phys 35(4):1559
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J Roy Stat Soc: Ser B (Methodol) 36(2):111–147
Swets JA (1986) Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 99:100–117
Yousef WA (2019) A leisurely look at versions and variants of the cross validation estimator. arXiv preprint arXiv:1907.13413
Yousef WA (2021) Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recogn Lett 146:115–145
Yousef WA, Wagner RF, Loew MH (2004) Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters. In: Proceedings of 33rd applied imagery pattern recognition workshop, 2004. IEEE Computer Society, pp 190–195
Yousef WA, Wagner RF, Loew MH (2005) Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recogn Lett 26(16):2600–2610
Yousef WA, Wagner RF, Loew MH (2006) Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Trans Pattern Anal Mach Intell 28(11):1809–1817
Zhang P (1995) Assessing prediction error in nonparametric regression. Scand J Stat 22(1):83–94
Acknowledgements
The author is grateful to the U.S. Food and Drug Administration (FDA) for funding a very early stage of this chapter, and to Dr. Kyle Myers for her support. In his memorial, special thanks and gratitude to Dr. Robert F. Wagner, the supervisor and the teacher, or Bob Wagner, the big brother and friend. He reviewed a very early version of this chapter before he passed away.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7. Appendix
7. Appendix
1.1 7.1 Proofs
Lemma 1
The maximum likelihood estimation (MLE) for the probability mass function under nonparametric distribution, given a sample of n observations, is given by:
Proof
The proof is carried out by maximizing the likelihood function \(l(f)=\prod \limits _{i=1}^{n}{p_{i}}\), which can be rewritten under the constraint \(\sum _{i}p_{i}=1\), using a Lagrange’s multiplier, as:
The likelihood (68) is maximized by taking the first derivative and setting it to zero to obtain:
These n equations along with the constraint \(\sum _{i}p_{i}=1\) can be solved straightforwardly to give \(\hat{p}_{i}=\frac{1}{n},\ i=1,\ldots ,n\), which completes the proof. \(\square \)
Lemma 2
The no-information \({\textrm{AUC}}\) is given by \(\gamma _{{\textrm{AUC}}} = 0.5\).
Proof
\(\gamma _{{\textrm{AUC}}}\), an analogue to the no-information error rate \(\gamma \), is given by (2a) but with TPF and FPF given under the no-information distribution \(\text {E}_{0F}\) (see Sect. 3.3.4). Therefore, assume that there are \(n_{1}\) and \(n_2\) observations from class \(\omega _{1}\) and \(\omega _{2}\), respectively. Assume also for a fixed threshold th the two quantities that define the error rate are \({\textrm{TPF}}\) and \({\textrm{FPF}}\). Also, assume that the sample observations are tested by the classifier and each sample has been assigned a decision value (score). Under the no-information distribution, consider the following. For every decision value \(h_{\textbf{t}}(x_{i})\) assigned for the observation \(t_{i}=(x_{i},y_{i})\), create new \(n_{1}+n_{2}-1\) observations; all of them have the same decision value \(h_{\textbf{t}}(x_{i})\), while their responses are equal to the responses of the rest \(n_{1}+n_{2}-1\) observations \(t_{j},\ j \ne i\). Under this new sample that consists of \((n_{1}+n_{2})^{2}\) observations, it is quite easy to see that the new TPF and FPF for the same threshold th are given by \({\textrm{FPF}}_{0\widehat{F},th}={\textrm{TPF}}_{0\widehat{F},th}= ({\textrm{TPF}}\cdot n_{1}+{\textrm{FPF}}\cdot n_{2})/(n_{1}+n_{2})\). This means that the ROC curve under the no-information rate is a straight line with slope equal to one; this directly gives \(\gamma _{{\textrm{AUC}}} = 0.5\).
1.2 7.2 More on Influence Function (IF)
Assume that there is a distribution G near to the distribution F; then under some regularity conditions(see, e.g., [21], Chap. 2) a functional s can be approximated as:
The residual error can be neglected since it is of a small order in probability. Some properties of (70) are:
and the asymptotic variance of s(F) under F, following from (71), is given by:
which can be considered as an approximation to the variance under a distribution G near to F. Now, assume that the functional s is a functional statistic in the dataset \(\textbf{x}=\{x_{i}:x_{i}\sim F,\ i=1,\ldots ,n\}\). In that case the influence curve (23) is defined for each sample case \(x_{i}\), under the true distribution F as:
where \(F_{\varepsilon ,i}\) is the distribution under the perturbation at observation \(x_{i}\). Equation (73) is called the IF. If the distribution F is not known, the MLE \(\hat{F}\) of the distribution F is given by (3), and as an approximation \(\hat{F}\) may substitute for F in (73). The result may then be called the empirical IF [24], or infinitesimal jackknife [22]. In such an approximation, the perturbation defined in (22) can be rewritten as:
This kind of perturbation is illustrated in Fig. 4.
It will often be useful to write the probability mass function of (74) as:
A very interesting case arises from (75) if \(-1/(n+1)\) is substituted for \(\varepsilon \). In this case the new probability mass assigned to the point \(x_{j=i}\) in (75) will be zero. This value of \(\varepsilon \) simply generates the jackknife estimate discussed in Sect. 2.2, where the whole observation is removed from the dataset.
Substituting \(\hat{F}\) for G in (70) and combining the result with (73) gives the IF approximation for any functional statistic under the empirical distribution \(\hat{F}\). The result is:
The term \(O_{p}(n^{-1})\) reads “big-O of order 1/n in probability”. In general, \(U_{n}=O_p(d_{n})\) if \(U_{n}/d_{n}\) is bounded in probability, i.e., \(\Pr \{|U_{n}|/d_{n}<k_{\varepsilon }\}>1-\varepsilon \) \(\forall \) \(\varepsilon >0\). This concept can be found in [1, Chap. 2]. Then the asymptotic variance expressed in (72) can be given for s(F) by:
which can be approximated under the empirical distribution \(\hat{F}\) to give the nonparametric estimate of the variance for a statistic s by:
1.3 7.3 ML in Other Fields
In this section we provide very brief miscellanea from other fields for the reader to see a bigger picture of this chapter. As already was mentioned, ML is crucial to many applications. For example, in the medical imaging field, a tumor on a mammogram must be classified as malignant or benign. This is an example of prediction, regardless of whether it is done by a radiologist or by a computer aided detection (CAD) software. In either case, the prediction is done based on learning from previous mammograms. The features, i.e., predictors, in this case may be the size of the tumor, its density, various shape parameters, etc. The output, i.e., response, is categorical and belongs to the set: \(\mathcal {G} =\{benign,\ malignant\}\). There are so many such examples in biology and medicine that it is almost a field unto itself, i.e., biostatistics. The task may be diagnostic as in the mammographic example, or prognostic where, for example, one estimates the probability of occurrence of a second heart attack for a particular patient who has had a previous one. All of these examples involve a prediction step based on previous learning. A wide range of commercial and military applications arises in the field of satellite imaging. Predictors in this case can be measures from the image spectrum, while the response can be the type of land, crop, or vegetation of which the image was taken.
Some expressions and terminology of ML belong to some fields and applications more than the others. E.g., it is conventional in medical imaging to refer to \(e_{1}\) as the false negative fraction (FNF), and \(e_{2}\) as the false positive fraction (FPF). This is because diseased patients typically have a higher output value for a test than non-diseased patients. For example, a patient belonging to class 1 whose test output value is less than the threshold setting for the test will be called “test negative”, while the patient is in fact in the diseased class. This is a false negative decision; hence the name FNF. The situation is reversed for the other error component.
The importance of the AUC is natural and unquestionable in some applications than others. The equivalence of the area under the empirical ROC and the Mann-Whitney-Wilcoxon statistic is the basis of its use in the assessment of diagnostic tests; see Hanley and McNeil [19]. Swets [29] has recommended it as a natural summary measure of detection accuracy on the basis of signal-detection theory. Applications of this measure are widespread in the literature on both human diagnosis and computer-aided diagnosis, in medical imaging [23]. In the field of machine learning, Bradley [2] has recommended it as the preferred summary measure of accuracy when a single number is desired. These references also provide general background and access to the large literature on the subject.
Even the mistakes committed by some practitioners are obvious in some fields more than others. E.g., in DNA microarrays, these mistakes are fatal and produce very fragile results. This is because of the very high dimensionality of the problem with respect to the amount of available dataset. A more elaborate assessment phase should follow the design and construction phase in such ill-posed applications.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Yousef, W.A. (2023). Machine Learning Assessment: Implications to Cybersecurity. In: Traore, I., Woungang, I., Saad, S. (eds) Artificial Intelligence for Cyber-Physical Systems Hardening. Engineering Cyber-Physical Systems and Critical Infrastructures, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-031-16237-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-16237-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16236-7
Online ISBN: 978-3-031-16237-4
eBook Packages: EngineeringEngineering (R0)