Machine Learning Assessment: Implications to Cybersecurity

Yousef, Waleed A.

doi:10.1007/978-3-031-16237-4_3

Waleed A. Yousef⁵

Part of the book series: Engineering Cyber-Physical Systems and Critical Infrastructures ((ECPSCI,volume 2))

334 Accesses

Abstract

After discussing the construction of machine learning (ML) algorithms in the previous chapter, this chapter is dedicated to their assessment and performance estimation (with an emphasis on classification assessment), a topic that is equally important specially in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barndorff-Nielsen OE, Cox DR (1989) Asymptotic techniques for use in statistics. Chapman and Hall, New York
Book MATH Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145
Article Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont
MATH Google Scholar
Chen W, Gallas BD, Yousef WA (2012) Classifier variability: accounting for training and testing. Pattern Recogn 45(7):2661–2671
Article MATH Google Scholar
Efron B (1979) Bootstrap methods: another look at the Jackknife. Ann Stat 7(1):1–26
Article MathSciNet MATH Google Scholar
Efron B (1981) Nonparametric estimates of standard error: the Jackknife, the bootstrap and other methods. Biometrika 68(3):589–599
Article MathSciNet MATH Google Scholar
Efron B (1982) The Jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331
Article MathSciNet MATH Google Scholar
Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470
Article MathSciNet MATH Google Scholar
Efron B, Stein C (1981) The Jackknife estimate of variance. Ann Stat 9(3):586–596
Article MathSciNet MATH Google Scholar
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York
Book MATH Google Scholar
Efron B, Tibshirani R (1995) Cross validation and the bootstrap: estimating the error rate of a prediction rule. Technical report 176, Stanford University, Department of Statistics
Google Scholar
Efron B, Tibshirani R (1997) Improvements on cross-validation: the $.632+$ Bootstrap method. J Am Stat Assoc 92(438):548–560
Google Scholar
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston
MATH Google Scholar
Hájek J, Šidák Z, Sen PK (1999) Theory of rank tests, 2nd edn. Academic Press, San Diego
MATH Google Scholar
Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393
Article MathSciNet MATH Google Scholar
Hampel FR (1986) Robust statistics : the approach based on influence functions. Wiley, New York
MATH Google Scholar
Hanley JA (1989) Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging 29(3):307–335
Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Book MATH Google Scholar
Huber PJ (1996) Robust statistical procedures, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Jaeckel L (1972) The infinitesimal jackknife. Memorandum, MM 72-1215-11, Bell Lab Murray Hill
Google Scholar
Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K (1999) Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 6(1):22–33
Google Scholar
Mallows C (1974) On some topics in robustness. Memorandum, MM 72-1215-11, Bell Lab Murray Hill, NJ
Google Scholar
Randles RH, Wolfe DA (1979) Introduction to the theory of nonparametric statistics. Wiley, New York
MATH Google Scholar
Sahiner B, Chan HP, Petrick N, Hadjiiski L, Paquerault S, Gurcan MN (2001) Resampling schemes for estimating the accuracy of a classifier designed with a limited data set. In: Medical image perception conference IX, airlie conference Center, Warrenton VA, 20–23
Google Scholar
Sahiner B, Chan HP, Hadjiiski L (2008) Classifier performance prediction for computer-aided diagnosis using a limited dataset. Med Phys 35(4):1559
Article Google Scholar
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J Roy Stat Soc: Ser B (Methodol) 36(2):111–147
MathSciNet MATH Google Scholar
Swets JA (1986) Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 99:100–117
Article Google Scholar
Yousef WA (2019) A leisurely look at versions and variants of the cross validation estimator. arXiv preprint arXiv:1907.13413
Yousef WA (2021) Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recogn Lett 146:115–145
Article Google Scholar
Yousef WA, Wagner RF, Loew MH (2004) Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters. In: Proceedings of 33rd applied imagery pattern recognition workshop, 2004. IEEE Computer Society, pp 190–195
Google Scholar
Yousef WA, Wagner RF, Loew MH (2005) Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recogn Lett 26(16):2600–2610
Article Google Scholar
Yousef WA, Wagner RF, Loew MH (2006) Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Trans Pattern Anal Mach Intell 28(11):1809–1817
Article Google Scholar
Zhang P (1995) Assessing prediction error in nonparametric regression. Scand J Stat 22(1):83–94
MATH Google Scholar

Download references

Acknowledgements

The author is grateful to the U.S. Food and Drug Administration (FDA) for funding a very early stage of this chapter, and to Dr. Kyle Myers for her support. In his memorial, special thanks and gratitude to Dr. Robert F. Wagner, the supervisor and the teacher, or Bob Wagner, the big brother and friend. He reviewed a very early version of this chapter before he passed away.

Author information

Authors and Affiliations

CS Department, HCILAB, Faculty of Computers and Artificial Intelligence, Helwan University, Helwan, Egypt
Waleed A. Yousef

Authors

Waleed A. Yousef
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Waleed A. Yousef .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Victoria, Victoria, BC, Canada
Issa Traore
Department of Computer Science, Ryerson University, Toronto, ON, Canada
Isaac Woungang
School of Computer Science, University of Windsor, Windsor, ON, Canada
Sherif Saad

7. Appendix

1.1 7.1 Proofs

Lemma 1

The maximum likelihood estimation (MLE) for the probability mass function under nonparametric distribution, given a sample of n observations, is given by:

$$\begin{aligned} \hat{F}:\ \text {mass}~\frac{1}{n}~\text {on}\,~t_i,\quad i=1,\ldots ,n. \end{aligned}$$

(67)

Proof

The proof is carried out by maximizing the likelihood function $l(f)=\prod \limits _{i=1}^{n}{p_{i}}$, which can be rewritten under the constraint $\sum _{i}p_{i}=1$, using a Lagrange’s multiplier, as:

$$\begin{aligned} l(f)=\prod \limits _{i=1}^{n}{p_{i}}+\lambda \left( \sum \limits _{i=1}^{n}{p_{i}}-1\right) .\end{aligned}$$

(68)

The likelihood (68) is maximized by taking the first derivative and setting it to zero to obtain:

$$\begin{aligned} \frac{\partial l(f)}{\partial p_{j}}=\prod \limits _{i\ne j}{p_{i}}+\lambda \overset{set}{=}0,\quad j=1,\ldots ,n.\end{aligned}$$

(69)

These n equations along with the constraint $\sum _{i}p_{i}=1$ can be solved straightforwardly to give $\hat{p}_{i}=\frac{1}{n},\ i=1,\ldots ,n$, which completes the proof. $\square $

Lemma 2

The no-information ${\textrm{AUC}}$ is given by $\gamma _{{\textrm{AUC}}} = 0.5$.

Proof

$\gamma _{{\textrm{AUC}}}$, an analogue to the no-information error rate $\gamma $, is given by (2a) but with TPF and FPF given under the no-information distribution $\text {E}_{0F}$ (see Sect. 3.3.4). Therefore, assume that there are $n_{1}$ and $n_2$ observations from class $\omega _{1}$ and $\omega _{2}$, respectively. Assume also for a fixed threshold th the two quantities that define the error rate are ${\textrm{TPF}}$ and ${\textrm{FPF}}$. Also, assume that the sample observations are tested by the classifier and each sample has been assigned a decision value (score). Under the no-information distribution, consider the following. For every decision value $h_{\textbf{t}}(x_{i})$ assigned for the observation $t_{i}=(x_{i},y_{i})$, create new $n_{1}+n_{2}-1$ observations; all of them have the same decision value $h_{\textbf{t}}(x_{i})$, while their responses are equal to the responses of the rest $n_{1}+n_{2}-1$ observations $t_{j},\ j \ne i$. Under this new sample that consists of $(n_{1}+n_{2})^{2}$ observations, it is quite easy to see that the new TPF and FPF for the same threshold th are given by ${\textrm{FPF}}_{0\widehat{F},th}={\textrm{TPF}}_{0\widehat{F},th}= ({\textrm{TPF}}\cdot n_{1}+{\textrm{FPF}}\cdot n_{2})/(n_{1}+n_{2})$. This means that the ROC curve under the no-information rate is a straight line with slope equal to one; this directly gives $\gamma _{{\textrm{AUC}}} = 0.5$.

1.2 7.2 More on Influence Function (IF)

Assume that there is a distribution G near to the distribution F; then under some regularity conditions(see, e.g., [21], Chap. 2) a functional s can be approximated as:

$$\begin{aligned} s(G)\approx s(F)+\int {IC_{s,F}(x)~dG(x)}.\end{aligned}$$

(70)

The residual error can be neglected since it is of a small order in probability. Some properties of (70) are:

$$\begin{aligned} \int {IC_{T,F}(x)~dF(x)=0}, \end{aligned}$$

(71)

and the asymptotic variance of s(F) under F, following from (71), is given by:

$$\begin{aligned} \text {Var}_{F}s(F)\simeq \int {\left[ {IC_{T,F}(x)}\right] ^{2}~dF(x)}, \end{aligned}$$

(72)

which can be considered as an approximation to the variance under a distribution G near to F. Now, assume that the functional s is a functional statistic in the dataset $\textbf{x}=\{x_{i}:x_{i}\sim F,\ i=1,\ldots ,n\}$. In that case the influence curve (23) is defined for each sample case $x_{i}$, under the true distribution F as:

$$\begin{aligned} U_{i}(s,F)=\lim _{\varepsilon \rightarrow 0}\frac{s(F_{\varepsilon ,i})-s(F)}{\varepsilon }=\left. {\frac{\partial s(F_{\varepsilon ,i})}{\partial \varepsilon }}\right| _{\varepsilon =0}, \end{aligned}$$

(73)

where $F_{\varepsilon ,i}$ is the distribution under the perturbation at observation $x_{i}$. Equation (73) is called the IF. If the distribution F is not known, the MLE $\hat{F}$ of the distribution F is given by (3), and as an approximation $\hat{F}$ may substitute for F in (73). The result may then be called the empirical IF [24], or infinitesimal jackknife [22]. In such an approximation, the perturbation defined in (22) can be rewritten as:

$$\begin{aligned} \hat{F}_{\varepsilon ,i}=(1-\varepsilon )\hat{F}+\varepsilon \delta _{x_{i}},~~x_{i}\in \textbf{x},~~i=1,\ldots ,n. \end{aligned}$$

(74)

This kind of perturbation is illustrated in Fig. 4.

A graph labeled with x 1, x 2, x i, and x n on the horizontal axis and 1 over n and P r of x subscript j on the vertical axis indicate epsilon, 1 minus 1 over n epsilon, and epsilon over n. — **Fig. 4**

It will often be useful to write the probability mass function of (74) as:

$$\begin{aligned} \hat{f}_{\varepsilon ,i}(x_{j})=\left\{ { \begin{array}{lll} \frac{1-\varepsilon }{n}+\varepsilon &{} &{} j=i\\ \frac{1-\varepsilon }{n} &{} &{} j\ne i \end{array} }\right. . \end{aligned}$$

(75)

A very interesting case arises from (75) if $-1/(n+1)$ is substituted for $\varepsilon $. In this case the new probability mass assigned to the point $x_{j=i}$ in (75) will be zero. This value of $\varepsilon $ simply generates the jackknife estimate discussed in Sect. 2.2, where the whole observation is removed from the dataset.

Substituting $\hat{F}$ for G in (70) and combining the result with (73) gives the IF approximation for any functional statistic under the empirical distribution $\hat{F}$. The result is:

$$\begin{aligned} s(\hat{F})&=s(F)+\frac{1}{n}\sum \limits _{i=1}^{n}{U_{i}}(s,F)+O_{p}(n^{-1})\end{aligned}$$

(76a)

$$\begin{aligned}&\approx s(F)+\frac{1}{n}\sum \limits _{i=1}^{n}{U_{i}(s,F)}. \end{aligned}$$

(76b)

The term $O_{p}(n^{-1})$ reads “big-O of order 1/n in probability”. In general, $U_{n}=O_p(d_{n})$ if $U_{n}/d_{n}$ is bounded in probability, i.e., $\Pr \{|U_{n}|/d_{n}<k_{\varepsilon }\}>1-\varepsilon $ $\forall $ $\varepsilon >0$. This concept can be found in [1, Chap. 2]. Then the asymptotic variance expressed in (72) can be given for s(F) by:

$$\begin{aligned} \text {Var}_{F} s =\frac{1}{n}\text {E}_{F} U^{2}(x_{i},F), \end{aligned}$$

(77)

which can be approximated under the empirical distribution $\hat{F}$ to give the nonparametric estimate of the variance for a statistic s by:

$$\begin{aligned} \widehat{\text {Var}}_{\hat{F}} s=\frac{1}{n^{2}}\sum \limits _{i=1}^{n}{U_{i}^{2}(x_{i},\hat{F})}. \end{aligned}$$

(78)

1.3 7.3 ML in Other Fields

In this section we provide very brief miscellanea from other fields for the reader to see a bigger picture of this chapter. As already was mentioned, ML is crucial to many applications. For example, in the medical imaging field, a tumor on a mammogram must be classified as malignant or benign. This is an example of prediction, regardless of whether it is done by a radiologist or by a computer aided detection (CAD) software. In either case, the prediction is done based on learning from previous mammograms. The features, i.e., predictors, in this case may be the size of the tumor, its density, various shape parameters, etc. The output, i.e., response, is categorical and belongs to the set: $\mathcal {G} =\{benign,\ malignant\}$. There are so many such examples in biology and medicine that it is almost a field unto itself, i.e., biostatistics. The task may be diagnostic as in the mammographic example, or prognostic where, for example, one estimates the probability of occurrence of a second heart attack for a particular patient who has had a previous one. All of these examples involve a prediction step based on previous learning. A wide range of commercial and military applications arises in the field of satellite imaging. Predictors in this case can be measures from the image spectrum, while the response can be the type of land, crop, or vegetation of which the image was taken.

Some expressions and terminology of ML belong to some fields and applications more than the others. E.g., it is conventional in medical imaging to refer to $e_{1}$ as the false negative fraction (FNF), and $e_{2}$ as the false positive fraction (FPF). This is because diseased patients typically have a higher output value for a test than non-diseased patients. For example, a patient belonging to class 1 whose test output value is less than the threshold setting for the test will be called “test negative”, while the patient is in fact in the diseased class. This is a false negative decision; hence the name FNF. The situation is reversed for the other error component.

The importance of the AUC is natural and unquestionable in some applications than others. The equivalence of the area under the empirical ROC and the Mann-Whitney-Wilcoxon statistic is the basis of its use in the assessment of diagnostic tests; see Hanley and McNeil [19]. Swets [29] has recommended it as a natural summary measure of detection accuracy on the basis of signal-detection theory. Applications of this measure are widespread in the literature on both human diagnosis and computer-aided diagnosis, in medical imaging [23]. In the field of machine learning, Bradley [2] has recommended it as the preferred summary measure of accuracy when a single number is desired. These references also provide general background and access to the large literature on the subject.

Even the mistakes committed by some practitioners are obvious in some fields more than others. E.g., in DNA microarrays, these mistakes are fatal and produce very fragile results. This is because of the very high dimensionality of the problem with respect to the amount of available dataset. A more elaborate assessment phase should follow the design and construction phase in such ill-posed applications.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yousef, W.A. (2023). Machine Learning Assessment: Implications to Cybersecurity. In: Traore, I., Woungang, I., Saad, S. (eds) Artificial Intelligence for Cyber-Physical Systems Hardening. Engineering Cyber-Physical Systems and Critical Infrastructures, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-031-16237-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-16237-4_3
Published: 24 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16236-7
Online ISBN: 978-3-031-16237-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Machine Learning Assessment: Implications to Cybersecurity

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7. Appendix

7. Appendix

1.1 7.1 Proofs

Lemma 1

Proof

Lemma 2

Proof

1.2 7.2 More on Influence Function (IF)

1.3 7.3 ML in Other Fields

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation