Skip to main content

Machine Learning Assessment: Implications to Cybersecurity

  • Chapter
  • First Online:
Artificial Intelligence for Cyber-Physical Systems Hardening

Part of the book series: Engineering Cyber-Physical Systems and Critical Infrastructures ((ECPSCI,volume 2))

  • 334 Accesses

Abstract

After discussing the construction of machine learning (ML) algorithms in the previous chapter, this chapter is dedicated to their assessment and performance estimation (with an emphasis on classification assessment), a topic that is equally important specially in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barndorff-Nielsen OE, Cox DR (1989) Asymptotic techniques for use in statistics. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  2. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145

    Article  Google Scholar 

  3. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  4. Chen W, Gallas BD, Yousef WA (2012) Classifier variability: accounting for training and testing. Pattern Recogn 45(7):2661–2671

    Article  MATH  Google Scholar 

  5. Efron B (1979) Bootstrap methods: another look at the Jackknife. Ann Stat 7(1):1–26

    Article  MathSciNet  MATH  Google Scholar 

  6. Efron B (1981) Nonparametric estimates of standard error: the Jackknife, the bootstrap and other methods. Biometrika 68(3):589–599

    Article  MathSciNet  MATH  Google Scholar 

  7. Efron B (1982) The Jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia

    Book  MATH  Google Scholar 

  8. Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331

    Article  MathSciNet  MATH  Google Scholar 

  9. Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470

    Article  MathSciNet  MATH  Google Scholar 

  10. Efron B, Stein C (1981) The Jackknife estimate of variance. Ann Stat 9(3):586–596

    Article  MathSciNet  MATH  Google Scholar 

  11. Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  12. Efron B, Tibshirani R (1995) Cross validation and the bootstrap: estimating the error rate of a prediction rule. Technical report 176, Stanford University, Department of Statistics

    Google Scholar 

  13. Efron B, Tibshirani R (1997) Improvements on cross-validation: the \(.632+\) Bootstrap method. J Am Stat Assoc 92(438):548–560

    Google Scholar 

  14. Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston

    MATH  Google Scholar 

  15. Hájek J, Šidák Z, Sen PK (1999) Theory of rank tests, 2nd edn. Academic Press, San Diego

    MATH  Google Scholar 

  16. Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393

    Article  MathSciNet  MATH  Google Scholar 

  17. Hampel FR (1986) Robust statistics : the approach based on influence functions. Wiley, New York

    MATH  Google Scholar 

  18. Hanley JA (1989) Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging 29(3):307–335

    Google Scholar 

  19. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  20. Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  MATH  Google Scholar 

  21. Huber PJ (1996) Robust statistical procedures, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia

    Book  MATH  Google Scholar 

  22. Jaeckel L (1972) The infinitesimal jackknife. Memorandum, MM 72-1215-11, Bell Lab Murray Hill

    Google Scholar 

  23. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K (1999) Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 6(1):22–33

    Google Scholar 

  24. Mallows C (1974) On some topics in robustness. Memorandum, MM 72-1215-11, Bell Lab Murray Hill, NJ

    Google Scholar 

  25. Randles RH, Wolfe DA (1979) Introduction to the theory of nonparametric statistics. Wiley, New York

    MATH  Google Scholar 

  26. Sahiner B, Chan HP, Petrick N, Hadjiiski L, Paquerault S, Gurcan MN (2001) Resampling schemes for estimating the accuracy of a classifier designed with a limited data set. In: Medical image perception conference IX, airlie conference Center, Warrenton VA, 20–23

    Google Scholar 

  27. Sahiner B, Chan HP, Hadjiiski L (2008) Classifier performance prediction for computer-aided diagnosis using a limited dataset. Med Phys 35(4):1559

    Article  Google Scholar 

  28. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J Roy Stat Soc: Ser B (Methodol) 36(2):111–147

    MathSciNet  MATH  Google Scholar 

  29. Swets JA (1986) Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 99:100–117

    Article  Google Scholar 

  30. Yousef WA (2019) A leisurely look at versions and variants of the cross validation estimator. arXiv preprint arXiv:1907.13413

  31. Yousef WA (2021) Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recogn Lett 146:115–145

    Article  Google Scholar 

  32. Yousef WA, Wagner RF, Loew MH (2004) Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters. In: Proceedings of 33rd applied imagery pattern recognition workshop, 2004. IEEE Computer Society, pp 190–195

    Google Scholar 

  33. Yousef WA, Wagner RF, Loew MH (2005) Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recogn Lett 26(16):2600–2610

    Article  Google Scholar 

  34. Yousef WA, Wagner RF, Loew MH (2006) Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Trans Pattern Anal Mach Intell 28(11):1809–1817

    Article  Google Scholar 

  35. Zhang P (1995) Assessing prediction error in nonparametric regression. Scand J Stat 22(1):83–94

    MATH  Google Scholar 

Download references

Acknowledgements

The author is grateful to the U.S. Food and Drug Administration (FDA) for funding a very early stage of this chapter, and to Dr. Kyle Myers for her support. In his memorial, special thanks and gratitude to Dr. Robert F. Wagner, the supervisor and the teacher, or Bob Wagner, the big brother and friend. He reviewed a very early version of this chapter before he passed away.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Waleed A. Yousef .

Editor information

Editors and Affiliations

7. Appendix

7. Appendix

1.1 7.1 Proofs

Lemma 1

The maximum likelihood estimation (MLE) for the probability mass function under nonparametric distribution, given a sample of n observations, is given by:

$$\begin{aligned} \hat{F}:\ \text {mass}~\frac{1}{n}~\text {on}\,~t_i,\quad i=1,\ldots ,n. \end{aligned}$$
(67)

Proof

The proof is carried out by maximizing the likelihood function \(l(f)=\prod \limits _{i=1}^{n}{p_{i}}\), which can be rewritten under the constraint \(\sum _{i}p_{i}=1\), using a Lagrange’s multiplier, as:

$$\begin{aligned} l(f)=\prod \limits _{i=1}^{n}{p_{i}}+\lambda \left( \sum \limits _{i=1}^{n}{p_{i}}-1\right) .\end{aligned}$$
(68)

The likelihood (68) is maximized by taking the first derivative and setting it to zero to obtain:

$$\begin{aligned} \frac{\partial l(f)}{\partial p_{j}}=\prod \limits _{i\ne j}{p_{i}}+\lambda \overset{set}{=}0,\quad j=1,\ldots ,n.\end{aligned}$$
(69)

These n equations along with the constraint \(\sum _{i}p_{i}=1\) can be solved straightforwardly to give \(\hat{p}_{i}=\frac{1}{n},\ i=1,\ldots ,n\), which completes the proof.      \(\square \)

Lemma 2

The no-information \({\textrm{AUC}}\) is given by \(\gamma _{{\textrm{AUC}}} = 0.5\).

Proof

\(\gamma _{{\textrm{AUC}}}\), an analogue to the no-information error rate \(\gamma \), is given by (2a) but with TPF and FPF given under the no-information distribution \(\text {E}_{0F}\) (see Sect. 3.3.4). Therefore, assume that there are \(n_{1}\) and \(n_2\) observations from class \(\omega _{1}\) and \(\omega _{2}\), respectively. Assume also for a fixed threshold th the two quantities that define the error rate are \({\textrm{TPF}}\) and \({\textrm{FPF}}\). Also, assume that the sample observations are tested by the classifier and each sample has been assigned a decision value (score). Under the no-information distribution, consider the following. For every decision value \(h_{\textbf{t}}(x_{i})\) assigned for the observation \(t_{i}=(x_{i},y_{i})\), create new \(n_{1}+n_{2}-1\) observations; all of them have the same decision value \(h_{\textbf{t}}(x_{i})\), while their responses are equal to the responses of the rest \(n_{1}+n_{2}-1\) observations \(t_{j},\ j \ne i\). Under this new sample that consists of \((n_{1}+n_{2})^{2}\) observations, it is quite easy to see that the new TPF and FPF for the same threshold th are given by \({\textrm{FPF}}_{0\widehat{F},th}={\textrm{TPF}}_{0\widehat{F},th}= ({\textrm{TPF}}\cdot n_{1}+{\textrm{FPF}}\cdot n_{2})/(n_{1}+n_{2})\). This means that the ROC curve under the no-information rate is a straight line with slope equal to one; this directly gives \(\gamma _{{\textrm{AUC}}} = 0.5\).

1.2 7.2 More on Influence Function (IF)

Assume that there is a distribution G near to the distribution F; then under some regularity conditions(see, e.g., [21], Chap. 2) a functional s can be approximated as:

$$\begin{aligned} s(G)\approx s(F)+\int {IC_{s,F}(x)~dG(x)}.\end{aligned}$$
(70)

The residual error can be neglected since it is of a small order in probability. Some properties of (70) are:

$$\begin{aligned} \int {IC_{T,F}(x)~dF(x)=0}, \end{aligned}$$
(71)

and the asymptotic variance of s(F) under F, following from (71), is given by:

$$\begin{aligned} \text {Var}_{F}s(F)\simeq \int {\left[ {IC_{T,F}(x)}\right] ^{2}~dF(x)}, \end{aligned}$$
(72)

which can be considered as an approximation to the variance under a distribution G near to F. Now, assume that the functional s is a functional statistic in the dataset \(\textbf{x}=\{x_{i}:x_{i}\sim F,\ i=1,\ldots ,n\}\). In that case the influence curve (23) is defined for each sample case \(x_{i}\), under the true distribution F as:

$$\begin{aligned} U_{i}(s,F)=\lim _{\varepsilon \rightarrow 0}\frac{s(F_{\varepsilon ,i})-s(F)}{\varepsilon }=\left. {\frac{\partial s(F_{\varepsilon ,i})}{\partial \varepsilon }}\right| _{\varepsilon =0}, \end{aligned}$$
(73)

where \(F_{\varepsilon ,i}\) is the distribution under the perturbation at observation \(x_{i}\). Equation (73) is called the IF. If the distribution F is not known, the MLE \(\hat{F}\) of the distribution F is given by (3), and as an approximation \(\hat{F}\) may substitute for F in (73). The result may then be called the empirical IF [24], or infinitesimal jackknife [22]. In such an approximation, the perturbation defined in (22) can be rewritten as:

$$\begin{aligned} \hat{F}_{\varepsilon ,i}=(1-\varepsilon )\hat{F}+\varepsilon \delta _{x_{i}},~~x_{i}\in \textbf{x},~~i=1,\ldots ,n. \end{aligned}$$
(74)

This kind of perturbation is illustrated in Fig. 4.

Fig. 4
A graph labeled with x 1, x 2, x i, and x n on the horizontal axis and 1 over n and P r of x subscript j on the vertical axis indicate epsilon, 1 minus 1 over n epsilon, and epsilon over n.

The new probability masses for the dataset \(\textbf{x}\) under a perturbation at sample case \(x_{i}\) obtained by letting the new probability, at \(x_{i}\) exceed the new probability at any other case \(x_{i}\) by, \(\varepsilon \)

It will often be useful to write the probability mass function of (74) as:

$$\begin{aligned} \hat{f}_{\varepsilon ,i}(x_{j})=\left\{ { \begin{array}{lll} \frac{1-\varepsilon }{n}+\varepsilon &{} &{} j=i\\ \frac{1-\varepsilon }{n} &{} &{} j\ne i \end{array} }\right. . \end{aligned}$$
(75)

A very interesting case arises from (75) if \(-1/(n+1)\) is substituted for \(\varepsilon \). In this case the new probability mass assigned to the point \(x_{j=i}\) in (75) will be zero. This value of \(\varepsilon \) simply generates the jackknife estimate discussed in Sect. 2.2, where the whole observation is removed from the dataset.

Substituting \(\hat{F}\) for G in (70) and combining the result with (73) gives the IF approximation for any functional statistic under the empirical distribution \(\hat{F}\). The result is:

$$\begin{aligned} s(\hat{F})&=s(F)+\frac{1}{n}\sum \limits _{i=1}^{n}{U_{i}}(s,F)+O_{p}(n^{-1})\end{aligned}$$
(76a)
$$\begin{aligned}&\approx s(F)+\frac{1}{n}\sum \limits _{i=1}^{n}{U_{i}(s,F)}. \end{aligned}$$
(76b)

The term \(O_{p}(n^{-1})\) reads “big-O of order 1/n in probability”. In general, \(U_{n}=O_p(d_{n})\) if \(U_{n}/d_{n}\) is bounded in probability, i.e., \(\Pr \{|U_{n}|/d_{n}<k_{\varepsilon }\}>1-\varepsilon \) \(\forall \) \(\varepsilon >0\). This concept can be found in [1, Chap. 2]. Then the asymptotic variance expressed in (72) can be given for s(F) by:

$$\begin{aligned} \text {Var}_{F} s =\frac{1}{n}\text {E}_{F} U^{2}(x_{i},F), \end{aligned}$$
(77)

which can be approximated under the empirical distribution \(\hat{F}\) to give the nonparametric estimate of the variance for a statistic s by:

$$\begin{aligned} \widehat{\text {Var}}_{\hat{F}} s=\frac{1}{n^{2}}\sum \limits _{i=1}^{n}{U_{i}^{2}(x_{i},\hat{F})}. \end{aligned}$$
(78)

1.3 7.3 ML in Other Fields

In this section we provide very brief miscellanea from other fields for the reader to see a bigger picture of this chapter. As already was mentioned, ML is crucial to many applications. For example, in the medical imaging field, a tumor on a mammogram must be classified as malignant or benign. This is an example of prediction, regardless of whether it is done by a radiologist or by a computer aided detection (CAD) software. In either case, the prediction is done based on learning from previous mammograms. The features, i.e., predictors, in this case may be the size of the tumor, its density, various shape parameters, etc. The output, i.e., response, is categorical and belongs to the set: \(\mathcal {G} =\{benign,\ malignant\}\). There are so many such examples in biology and medicine that it is almost a field unto itself, i.e., biostatistics. The task may be diagnostic as in the mammographic example, or prognostic where, for example, one estimates the probability of occurrence of a second heart attack for a particular patient who has had a previous one. All of these examples involve a prediction step based on previous learning. A wide range of commercial and military applications arises in the field of satellite imaging. Predictors in this case can be measures from the image spectrum, while the response can be the type of land, crop, or vegetation of which the image was taken.

Some expressions and terminology of ML belong to some fields and applications more than the others. E.g., it is conventional in medical imaging to refer to \(e_{1}\) as the false negative fraction (FNF), and \(e_{2}\) as the false positive fraction (FPF). This is because diseased patients typically have a higher output value for a test than non-diseased patients. For example, a patient belonging to class 1 whose test output value is less than the threshold setting for the test will be called “test negative”, while the patient is in fact in the diseased class. This is a false negative decision; hence the name FNF. The situation is reversed for the other error component.

The importance of the AUC is natural and unquestionable in some applications than others. The equivalence of the area under the empirical ROC and the Mann-Whitney-Wilcoxon statistic is the basis of its use in the assessment of diagnostic tests; see Hanley and McNeil [19]. Swets [29] has recommended it as a natural summary measure of detection accuracy on the basis of signal-detection theory. Applications of this measure are widespread in the literature on both human diagnosis and computer-aided diagnosis, in medical imaging [23]. In the field of machine learning, Bradley [2] has recommended it as the preferred summary measure of accuracy when a single number is desired. These references also provide general background and access to the large literature on the subject.

Even the mistakes committed by some practitioners are obvious in some fields more than others. E.g., in DNA microarrays, these mistakes are fatal and produce very fragile results. This is because of the very high dimensionality of the problem with respect to the amount of available dataset. A more elaborate assessment phase should follow the design and construction phase in such ill-posed applications.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Yousef, W.A. (2023). Machine Learning Assessment: Implications to Cybersecurity. In: Traore, I., Woungang, I., Saad, S. (eds) Artificial Intelligence for Cyber-Physical Systems Hardening. Engineering Cyber-Physical Systems and Critical Infrastructures, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-031-16237-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16237-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16236-7

  • Online ISBN: 978-3-031-16237-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics