1 Introduction

As the world is becoming more and more digitalized, powerful security precautions are required to make public and private infrastructures more resilient to a broad range of cyber-threats (e.g., network intrusions and malware software). During the last decade, the cybersecurity literature has conferred a high-level role to deep learning as a powerful learning paradigm to detect ever-evolving cyber-threats in modern security systems. In particular, deep learning methods have made classical cybersecurity under-perform, since trainable multi-layer networks achieve higher feature representation capabilities than sophisticated hand-engineered features or rules constructed by classical cybersecurity approaches.

Recent cybersecurity studies have shown that deep learning performance can be further strengthened with ensemble learning systems (Tama & Lim, 2021). Ensemble learning combines several individual models to obtain better generalization by reducing the dispersion of predictions of single models and gaining model accuracy. In particular, ensembles are beneficial in deep learning since the selection of the weights of a deep neural network can be seen as an optimization problem with several local minima (Ganaie et al., 2022). This capability may allow neural models to achieve generalization errors on different subsets of the input space. On the other hand, selecting the ensemble members based on the local model accuracy may cause the excessive ensemble, since the performance of the ensemble system may not be significantly improved by some of the selected models (Lv et al., 2022). In particular, the larger the number of ensemble members, the higher the risk of local maxima (Lv et al., 2022). Hence, several researchers support increasing the diversity among individual models of deep ensembles, in addition to the accuracy of individual models, in order to learn diverse characteristics of training data (Dong et al., 2020). In deep ensemble learning, training independent neural models with different training samples or features may generate diversity (Guo et al., 2018), while ensemble pruning can be helpful to search for a good subset of ensemble members (Zhang et al., 2006).

In any case, neural models (even enhanced with ensemble learning) cannot be satisfactorily used in cybersecurity until attackers know how to take advantage of deep learning to enter victims’ systems that have already been secured with neural models. In addition, neural models are commonly trained as complex opaque models, so that the explanation of neural model decisions is often difficult due to the complexity of the neural network architecture. On the other hand, decision explanations can provide insights into which features are more relevant for the detection of cyber-threats by providing useful information to improve the accuracy of cybersecurity systems to evolving cyber-attacks (Andresini et al., 2022).

The present study is mainly inspired by the recent interest in bridging the gap between adversarial learning and eXplainable AI (XAI) in cybersecurity. We formulate a new cyber-threat detection method, called PANACEA (exPlAinability-based eNsemble Adversarial training for Cyber-thrEAt detection), which fuses together information synthesized through ensemble learning by accounting for both adversarial samples and explanations in the training stage. The proposed ensemble system is trained to address cyber-threat detection problems.

A contribution of this study is the evaluation of the effectiveness of a neural model ensemble system that fuses together base neural models trained through the adversarial training strategy. We resort to the adversarial training strategy, in order to account for different adversarial samples produced with the correct class targets in the training stage of each base neural model. For this purpose, we first use the state-of-the-art FGSM algorithm (Goodfellow et al., 2015) to generate adversarial samples for an initial neural model by perturbing its training samples. Subsequently, we learn each contributing neural model from the initial training set augmented with a random subset of the produced training adversarial samples. This introduces some randomness to make diverse ensemble members. Notice that this method differs from applying the adversarial ensemble training approach (Tramer et al., 2018), where different adversarial samples are generated from various neural models.

Another contribution of this study is the use of explanations produced for base neural model decisions to search for a good subset of ensemble members. We adopt an ensemble pruning strategy that aims to increase the diversity among the base models combined in the ensemble system. In fact, despite the use of adversarial training with random subsets of adversarial samples, the diversity of the base models remains a crucial factor for improving the classification performance of the ensemble system (Dong et al., 2020). In the majority of the state-of-the-art ensemble learning literature, the diversity of the member models of an ensemble system has been commonly measured by the inconsistency of prediction results (Bolón-Canedo & Alonso-Betanzos, 2019; Tsymbal et al., 2005). Instead, we investigate how XAI may be used to select base neural models to combine through an ensemble system. In particular, we present an XAI-based formulation of the concept of model diversity. This measures the diversity of ensemble members in terms of the different effects of input features on the accuracy of decisions yielded by ensemble base models. Specifically, we apply DALEX  (Biecek, 2018) framework to explain the global feature importance in neural models. In addition, we adopt a combination of XAI and clustering to select ensemble base models that achieve high explanation diversity. The idea of using clustering is founded in various studies (e.g., Giacinto et al. (2000) and Bakker and Heskes (2003)) that are based on the predictions of individual classifiers and use a clustering algorithm to group individual classifiers with low prediction diversities into a number of clusters. In our study, we perform a clustering step in combination with the proposed XAI-based formulation of the concept of model diversity.

An additional contribution is the improvement of the performance of the ensemble system by using a multi-headed neural network architecture. This fine tunes simultaneously base neural models selected through clustering, by taking advantage of a back-propagation strategy to share knowledge among multiple base models incorporated as sub-network blocks in the ensemble system. In this fusion schema, the model parameters are shared reducing the risk of overfitting (Ganaie et al., 2022).

In short, this paper provides the following contributions:

  • The definition of a neural method for cyber-threat detection applications, which integrates adversarial training and ensemble learning.

  • The formulation of an XAI-based approach to increase the ensemble diversity measured with respect to the effect of the input features on the accuracy of the base neural models combined in the ensemble system.

  • The definition of a multi-headed fusion neural network that embeds base neural models as sub-networks and learns how to best combine the predictions from each input sub-neural model with the individual model parameters shared during the fusion.

  • The illustration of an experimental evaluation that explores the performance of the proposed method in several cyber-threat detection problems, and demonstrates the ability of our method to achieve better performance than the initial model in recognizing cyber-threats and achieving better accuracy than several competitive approaches in the recent cybersecurity literature. As in the previous literature, the diversity of models was generally measured by the inconsistency of prediction results.

This paper is organized as follows. Motivations of the proposed method are illustrated in Sect. 2. Related works are presented in Sect. 3. The proposed method is described in Sect. 4. Section 5 reports the analysis of the time complexity of the proposed method. The experimental setup and the results of the evaluation of the proposed method are discussed in Sect. 6. Finally, Sect. 7 refocuses on the purpose of our research, draws conclusions and illustrates possible future developments.

2 Motivations

The proposed XAI-based approach is mainly founded on the idea that different sub-areas of the input feature space can be equally relevant to achieve a correct decision for multiple samples produced in the same situation. Hence, an accurate ensemble system may be produced through the fusion of base models that perform decisions which give more importance to different sub-areas of the input feature space. For this purpose, we use the XAI DALEX framework that measures the global feature relevance by comparing the classification error before and after permuting the values of the feature.

Our idea is inspired by the feature selection ensemble methodology, which tries to improve the classification performance of ensembles combining diverse, individually generated feature sub-spaces (Bolón-Canedo & Alonso-Betanzos, 2019). However, creating a feature selection ensemble poses the problem of deciding the feature selector to apply and the number of features to select for each base model. In this study, we overcome this problem by considering all the features (without any feature selection) for learning various classification model candidates with adversarial training. The neural models trained through adversarial training are expected to be more robust than the original model used to generate adversarial samples (Szegedy et al., 2014). In addition, we use explanation information to select ensemble members that give more importance to different sub-areas of the input feature space. This study also differs from traditional ensemble learning approaches, such as random forests, which use a random feature sub-spacing mechanism (randomisation) (Ho, 1998) to select different random sets of features in the generation of candidates of splitting nodes. The use of randomisation increases the diversity among the trees of a random forest without a step to prune tree candidates. Instead, the approach described in this study trains an ensemble of deep neural models with inner layers to learn an embedding representation of input data. As the original input features are no longer available for being randomised at the inner layers, we resort to the candidate pruning mechanism to enhance the ensemble diversity. For the pruning mechanism, we leverage the explanation information produced by DALEX.

With regard to the accuracy performance of the neural models selected for the fusion in the ensemble, we are aware that several researchers agree that fusing accurate ensemble members is a mandatory requirement to check, in order to be able of gaining accuracy with the ensemble system. In various studies, ensemble members are selected by accounting for the accuracy of model candidates (Puuronen & Tsymbal, 2001; Tsymbal et al., 2005). In this study, we give away an explicit step to verify and ensure the accuracy of individual ensemble members. However, we somewhat account for an accuracy performance evaluation as DALEX performs an accuracy-based measurement of the feature importance. Thereby, our pruning-based strategy accounts for the effect of input features on the accuracy of the model decisions in place of the accuracy of model decisions in itself.

Motivations for adopting this strategy can be mainly founded in the peculiarities of the network intrusion detection problems that are cyber-threat detection problems that we mainly address in this study. In network intrusion detection problems, samples of different attack families commonly have signatures involving different features. For example, as illustrated by Andresini et al. (2022), “the time between the SYN ACK and the ACK response” is relevant for detecting shellcode intrusions, while it becomes less important when detecting other types of attacks. Shellcode, in fact, is an exploiting attack in which the attacker penetrates a piece of code from a shell to control a target machine using the standard TCP/IP socket connections. As reported by Wang et al. (2020), “the number of wrong fragments/packets sent between hosts” is a relevant characteristic for recognizing Pod (Ping of Death) attacks (a subcategory of DoS intrusions), while it becomes less important when detecting other types of attacks. In fact, in Pod attacks, the target is flooded with a high number of malformed or oversized packets, which are used to crash the target. On the other hand, “the count of the number of packets from source to destination” conveys relevant information for detecting worm attacks, which are self-replicating computer programs that spread automatically and can flood the Internet in a very short time (Chen et al., 2003).

Based upon these studies, our point of view is that being able to fuse deep neural models that give relevance to different network traffic feature signatures (and, consequently, input feature sub-spaces) may help in improving the accuracy of a multi-class deep neural ensemble trained to recognize different cyber-attack patterns such as various categories of network traffic intrusions. According to this view, it may happen that a model that appropriately maps the feature signature of a specific network intrusion class may not necessarily achieve the higher, individual, overall accuracy on all intrusion classes. In any case, its inclusion in the fusion schema may foster the final ensemble performance. So, based on these peculiar characteristics of the network traffic data, we propose to use an ensemble system that combines diverse model candidates that potentially map feature signatures of multiple attack classes. In addition, we prune ensemble model candidates according to the information enclosed in the explanation of how each feature has an effect on the individual model accuracy. This is to improve the possibility of selecting ensemble models that give importance to diverse cyber-threat patterns and boost the possibility of learning an ensemble system that is able to gain accuracy on different categories of cyber-attacks. Our argument is mainly supported by experiments performed with three benchmark network intrusion detection datasets, namely NSL-KDD (Tavallaee et al., 2009), UNSW-NB15 (Moustafa & Slay, 2015) and CICIDS17 (Engelen et al., 2021). These datasets comprise multiple real categories of network traffic intrusions (comprising rare attacks). In addition, to explore the adaptability of the proposed method to other cyber-threat detection problems, we also evaluate the effectiveness proposed method in a benchmark malware detection problem, namely CICMalDroid20 (Mahdavifar et al., 2022), since we expect that, similarly to network traffic intrusions, different malware categories may have diverse feature signatures. With reference to these four cybersecurity problems, we explore empirically the effectiveness of our idea of accounting for candidate decision explanation to prune the neural model candidates that have been learned to capture the expected variety and multiplicity of cyber-threat signatures. To this aim, we show empirically that, in the considered cybersecurity problems, the proposed DALEX-based ensemble pruning strategy outperforms traditional ensemble pruning strategies such as the strategy described by Puuronen and Tsymbal (2001), which selects the candidates with the higher accuracy, and the strategy described by Tsymbal et al. (2005), which selects candidates accounting for the inconsistency of predictions. In addition, we show that even using accuracy in addition to explainability to select ensemble candidates does not necessarily lead to better final ensemble models.

3 Related work

We focus this literature overview on recent studies applying deep learning with adversarial learning, ensemble learning and XAI in cybersecurity.

3.1 Adversarial learning

Adversarial learning has recently attracted great attention in cybersecurity, where the majority of studies have investigated the offensive perspective. Offensive studies explore how to inject slight perturbations (adversarial samples) into input samples, to evade classification models by increasing their mis-classification rate. Various adversarial sample generators have been formulated in the last decade [see (Liang et al., 2022) for a survey]. In particular, Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) is one of the most popular adversarial sample generators that is prone to catastrophic overfitting (Andriushchenko & Flammarion, 2020). This is a white-box gradient-based method that finds the loss (e.g., the cross-entropy) to apply to an input sample, to make decisions of neural models less robust for a specific class. Projected gradient descent (PGD) (Madry et al., 2018) extends FGSM by performing an iterative version of FGSM. A different approach, named DeepFool, performs an iterative procedure to find the minimum adversarial perturbations on both an affine binary classifier and a general binary differentiable classifier. It integrates the one-versus-all strategy to be applied to multi-class problems. We note that although FGSM, its iterative version PGD and DeepFool are state-of-the-art baselines for gradient-based methods in the image domains, they are also used for tabular data. LowProFool (Ballet et al., 2019) has been recently defined to generate imperceptible adversarial samples on tabular data. It uses the gradient data to guide the perturbation towards a target class in an iterative manner. At the same time, it penalizes the perturbation proportionally to the feature importance associated with the features. Both PGD, DeepFool and LowProFool spend more training time than FGSM, since they perform multiple trials to generate perturbations. Although FGSM, PGD and DeepFool are originally formulated for imagery data, they are also used for tabular data. For example, Khamis and Matrawy (2020) describe several studies that use both FGSM, PGD and DeepFool on tabular data extracted in network intrusion detection problems. Recently, Xu et al. (2023) have explored the problem of generating adversarial samples of categorical data. Specifically, they transfer the categorical adversarial attacks in the discrete space to an optimization problem in a continuous probabilistic space. Thus, they are able to apply gradient-based methods to find adversarial categorical samples.

Meanwhile, defensive studies formulate various defense strategies. Adversarial training is a well known defence method strategy (Szegedy et al., 2014) that accounts for adversarial samples with the correct class in the training stage, in order to train a new classification model that is more robust than the attacked model based on given metrics. Andriushchenko and Flammarion (2020) show that adversarial training methods are commonly appreciated by practitioners, since they increase the empirical robustness of classification models by scaling deep neural networks and performing equally well for different attack models. This preference motivates the focus of this study on adversarial training. Under the umbrella of adversarial training, Wang et al. (2021) test adversarial training with generative adversarial networks, in order to improve the performance of classification models trained for malware detection. Andresini et al. (2021a) generate adversarial samples to expand and balance the training set of a CNN-based model and increase the distribution of malicious samples processed in a binary setting. On the other hand, Andresini et al. (2021) describe a different approach that improves the performance of cyber-threat detection model by changing the class of the training samples closest to the boundary during the training stage.

In this study, we also explore the achievements of adversarial training in cybersecurity applications. Previous studies mainly generated adversarial samples to train a single cyber-threat detection model. We process training data augmented with adversarial samples to train an ensemble of cyber-threat detection models.

3.2 Ensemble learning

The weighted ensemble learning methods are formulated in the machine learning literature, to balance diversity and individual accuracy of ensemble individuals (Mao et al., 2019; Sesmero et al., 2021). Alternatively, various studies on ensemble selection focus on selecting the subset of base models that may perform better than the whole ensemble system. These studies commonly formulate the ensemble selection as an optimization problem that can be solved by heuristic optimization or mathematical programming (Dong et al., 2020). Jan and Verma (2020) eliminate base models from the ensemble system in multiple rounds whenever they do not contribute to the overall ensemble accuracy. Puuronen and Tsymbal (2001) describe an accuracy-based approach to select ensemble base models. Bian and Chen (2021) compare two pruning methods to select a subset of the original ensemble, according to a measurement of diversity based on the error decomposition. Tsymbal et al. (2005) describe several diversity metrics, that mainly compare labels predicted by base models. Mauri et al. (2023) have recently described an ensemble system that uses clustering to separate training data in clusters based on a risk index. This risk index is computed as the proximity of the training samples proximity to class separation surfaces. Finally, it trains an ensemble system by fusing classification models learned from separate clusters.

Ensemble learning has been recently studied in several research areas. For example, deep ensemble learning architectures are widely explored in recommender systems for feature combination learning. For this purpose, Guo et al. (2017) propose a deep factorisation approach to handle high-dimensional sparse features commonly generated from categorical data (e.g., gender, organization) through the one-hot-encoding strategy. They describe a deep parallel architecture that fuses a Factorization Machine component and a Deep Neural Network component. This approach is generalised by Lian et al. (2018) who learn linear regression weights for FM layers instead of linking units of FM layers to the output unit without any coefficients.

Ensemble learning techniques have been recently explored also in computer vision both for generating adversarial images and for performing adversarial defense in image classification problems. A survey of recent advancements in this area is described by Lu et al. (2022).

Ensemble learning also attracts wide interest in cybersecurity as a means to improve detection accuracy in the presence of possible changes in cyber-threat behaviours (Jing et al., 2022). A survey of recent studies exploring ensemble learning in intrusion detection has been described by Tama and Lim (2021). Ensemble learning is coupled with multi-view learning for Android malware detection by Appice et al. (2020). In the later study, diversity is achieved by leveraging independent features set to learn separate base models. On the other hand, Jing et al. (2022) train an ensemble system of diverse base models selected by comparing 24 state-of-the-art sequence-based classifiers.

In this study, we propose a new ensemble method for cybersecurity problems. Our proposal differs from previous work on ensemble learning in cybersecurity, where the diversity of individual base models was mainly achieved by using different base algorithms or independent feature sets. In our study, we take advantage of adversarial samples to build diverse base models that can gain accuracy in adversarial attacks. Another novelty is that our study exploits global explanations of decisions produced by base models to enhance the diversity of ensemble models.

3.3 XAI

XAI techniques are mainly investigated in cybersecurity to produce explanations for the decisions of neural models trained for various cyber-threat detection problems. XAI techniques are used by Wang et al. (2020) and Andresini et al. (2021) to identify the most relevant input features for recognising each category of intrusion. Andresini et al. (2022) use the attention mechanism to improve the accuracy of neural models, by focusing the attention of the deep neural network on factors which are more relevant for recognising attacks in network traffic data.

A few recent studies have explored XAI coupled with adversarial learning in cybersecurity. Marino et al. (2018) explore the effectiveness of an adversarial learning approach adopted to explain why some network traffic intrusions are mis-classified by a deep neural network. Kuppa and Le-Khac (2021) illustrate a black-box attack approach that uses XAI techniques to compromise confidentiality and privacy properties of underlying intrusion detection classifiers. Al-Essa et al. (2022) explore how to couple XAI with adversarial training. They use local explanations of decisions to guide the fine-tuning of a neural model trained through adversarial training in both network intrusion detection problems and malware detection problems. Vardhan et al. (2021) describe an approach to recognize adversarial samples using an ensemble of explanation techniques.

Our study continues to explore an XAI-based approach coupled with adversarial training, in a new attempt to exploit XAI to increase the separability of base neural models trained for an ensemble system through adversarial training. To the best of our knowledge, this study first uses XAI to improve the diversity of neural models selected for the ensemble fusion.

4 Proposed method

Let us consider a dataset \(\mathcal {D} = \{ \left( \textbf{x}_i, y_i \right) \}_{i=1}^N\) of N training samples, where \(\textbf{x} \in \mathbb {R}^d\) is a d-dimensional vector of input features that describe cyber-data samples,Footnote 1 and \(y \in \{ 1, \ldots , K \}\) is the label variable with K classes (benign class and several categories of cyber-threats), according to labels of samples historically collected. The proposed cyber-threat detection method, illustrated in Fig. 1, is based on five steps:

  • The training of an initial neural model \(M_\theta :\mathbb {R}^d \mapsto Y\) with parameter \(\theta\) learned from \(\mathcal {D}\).

  • The generation of an adversarial set \(\mathcal {A}\) produced by \(\mathcal {D}\) with data perturbation threshold \(\epsilon\) by using \(M_\theta\).

  • The training of \(\eta\) neural model candidates learned from \(\mathcal {D}\), augmented with subsets of \(\sigma\) adversarial samples randomly selected from \(\mathcal {A}\).

  • The use of both a post-hoc global XAI technique to explain the decisions of neural model candidates and a clustering stage to group the neural model candidates with high similarity in decision explanations.

  • A multi-headed neural network that fuses together base neural models selected through clustering.

    figure a

    PANACEA algorithm

Input parameters of the proposed method are: (1) \(\epsilon\) that represents the amount of data perturbation considered to generate adversarial samples; (2) \(\sigma\) that defines the number of adversarial samples randomly selected for learning each neural model candidate with the adversarial training strategy; (3) \(\eta\) that is the number of distinct neural model candidates learned with the adversarial training strategy. The pseudo code of PANACEA is described in Algorithm 1.

Fig. 1
figure 1

Schema of PANACEA

First, the neural model \(M_\theta\) is learned from \(\mathcal {D}\) (line 2, Algorithm 1). Subsequently, the FGSM algorithm is used to build \(\mathcal {A}\) (line 3, Algorithm 1). FGSM founds in the gradient formula:

$$\begin{aligned} g(\textbf{x})=\nabla _{\textbf{x}} J(\theta ,\textbf{x},y), \end{aligned}$$
(1)

where \(\nabla _{\textbf{x}}\) represents the gradient computed with respect to \(\textbf{x}\), and \(J(\theta ,\textbf{x},y)\) is the loss function of \(M_\theta\). Specifically, FGSM identifies the minimum perturbation \(\epsilon\) to add to a training sample \(\textbf{x}\) to create an adversarial sample, in order to maximize J(). Therefore, given \(\epsilon\), for each \((\textbf{x},y) \in \mathcal {D}\), a new sample \((\mathbf {x^{adv}},y) \in \mathcal {A}\) can be generated so that:

$$\begin{aligned} \mathbf {x^{adv}}=\textbf{x}+\epsilon \times sign(g(\textbf{x})). \end{aligned}$$
(2)

Both \(\mathcal {D}\) and \(\mathcal {A}\) are processed through an adversarial training strategy, in order to train \(\eta\) neural model candidates (line 6, Algorithm 1). In this step, the parameter \(\theta _i\) of each neural model candidate \(M_{\theta _i}\) is learned from an enhanced training set \(\mathcal {D}_{\oplus \mathcal {A}_i}\). Each \(\mathcal {D}_{\oplus \mathcal {A}_i}\) is generated by augmenting \(\mathcal {D}\) with \(\sigma\) new samples randomly selected from \(\mathcal {A}\), so that:

$$\begin{aligned} \mathcal {D}_{\oplus \mathcal {A}_i}=\mathcal {D} \cup sample(\mathcal {A},\sigma ). \end{aligned}$$
(3)

For each trial i, the function sample() denotes the stratified sampling algorithm run to select \(\sigma\) samples from \(\mathcal {A}\). This guarantees the selection of a number of random samples for each class that is proportional to the class frequency in the original training set \(\mathcal {D}\). Note that the proposed approach works similarly to bootstrap as we use stratified sampling to generate \(\eta\) different, possibly overlapping, stratified subsets of \(\mathcal {A}\). This prompts the independent adversarial training of \(\eta\) neural model candidates \(M_{\theta _i}\) that potentially show different estimations of \(\theta _i\). Thanks to the use of the adversarial training strategy, \(M_{\theta _i}\) is expected to be more accurate than \(M_\theta\) to possible adversarial attacks.

Successively the XAI framework DALEX is used to extract global explanations of the behaviour of neural model candidates (line 7, Algorithm 1). The extracted explanations measure the effect of the observed input features on the training sample classifications. Specifically, PANACEA integrates the global explanation methodology of DALEX, which uses a permutation-based variable-relevance black-box algorithm, in order to measure the global effect of input features on decisions of neural model candidates. For each input feature, its effect is removed by permuting the values of the feature, and a loss function compares the classification performance before and after this re-sampling operation. If an input feature is relevant for a decision, the random permutation of its values will cause an increase in the feature loss. By inspecting how the relevance of each input feature changes for each neural model candidate, a feature-vector explanation of the neural model can be generated. This feature-vector explanation collects the global relevance, measured by DALEX, of each input feature in the considered neural model. In this way, a new dataset \(\mathcal {X}\) is built with size \(\eta \times d\), where each row i corresponds to the neural model candidate \(M_{\theta _i}\), and each column j corresponds to the input feature \(X_j\). In particular, the cell \(\mathcal {X}[i,j]\) stores the global relevance value of the feature \(X_j\) in the neural model \(M_{\theta _i}\) as it has been calculated on \({\mathcal {D}_{\oplus \mathcal {A}_i}}\) by DALEX. Thus, row \(\mathcal {X}[i]=(\mathcal {X}[i,1],\ldots , \mathcal {X}[i,d])\) denotes the feature explanation vector of the neural model candidate \(M_{\theta _i}\), i.e., the vector of the global relevance values measured with DALEX for each input feature involved in the label decision process of \(M_{\theta _i}\).

The information enclosed in \(\mathcal {X}\) is leveraged to promote the ensemble diversity by selecting base neural models that help to achieve label decisions by seeing higher relevance in possibly different sub-spaces of the input feature space. Based upon the ensemble diversity theory described by Tsymbal et al. (2005), the diversity of a neural model candidate \(M_{\theta _i}\) can be measured as the pairwise diversity for all the pairs of produced neural models including \(M_{\theta _i}\). Hence, the total ensemble diversity is the average of the diversity of all the neural model pairs in the ensemble system. In this study, we measure differences in the global relevance of input features for decisions yielded by different neural model candidates. Therefore, the pairwise diversity of two neural model candidates \(M_{\theta _i}\) and \(M_{\theta _j}\) is computed as the square Euclidean distance between the feature explanation vectors \(\mathcal {X}[i]\) and \(\mathcal {X}[j]\), respectively, that is:

$$\begin{aligned} div(i,j)= {\displaystyle \sum _{k=1}^d{(\mathcal {X}[i,k]-\mathcal {X}[j,k])^2}}. \end{aligned}$$
(4)

The clustering step is performed to group neural model candidates with similar feature explanation vectors in the same clusters, and neural model candidates with dissimilar feature explanation vectors in separate clusters (line 9, Algorithm 1). According to the clustering theory, cluster prototypes are expected to be distant from each other, which, in this case, means performing decisions that give more importance to different sub-spaces of the input feature space. In PANACEA the clustering step is performed with the k-medoids method (Kaufman & Rousseeuw, 2008) (using PAM as k-medoids algorithm) executed on the rows of \(\mathcal {X}\). This is a clustering algorithm that divides \(\mathcal {X}\) rows into k clusters by selecting k medoids as the base neural models for the ensemble fusion. Each cluster medoid is a neural model candidate that acts as the cluster’s prototype. In particular, a medoid is a neural model candidate in the cluster, whose sum of distance from all the neural model candidates in the cluster is minimal. In this study, the distance is computed as the square Euclidean distance between the explanation feature vectors, according to Eq. 4.Footnote 2 To automate the selection of k, the Elbow method is used. This is an empirical strategy commonly adopted in cluster analysis to automatically determine the number of clusters in a dataset (Thorndike, 1953). This method helps to identify the optimal number of clusters, independently of the level of similarity among the samples to divide into clusters. Specifically, the inertia (one of the most used metrics for clustering algorithms) of the clusters discovered through the k-medoids method is measured. The inertia is computed as the within-cluster sum-of-square Euclidean distances between the explanation feature vectors of each neural model candidate and the medoid model of the cluster it is assigned to. Formally,

$$\begin{aligned} inertia=\displaystyle \sum _{i=1}^{\eta } div(i,medoid(i)), \end{aligned}$$
(5)

where div() is computed according to Eq. 4 and medoid(i) is the medoid neural model of the cluster to which the i-th neural model candidate is assigned during the clustering step. Let us consider the inertia curve where inertia (axis Y) is a function of k (axis X) with k varying between 1 and \(\eta\). The value k of the knee point of the inertia curve is the elbow k used for the clustering (i.e., the value at which a higher k stops adding useful information and makes the clusters harder to separate). The knee point detection algorithm (Satopaa et al., 2011) is used for the detection of the elbow k on the inertia curve.

Finally, let \(M_{\theta _{i_1}}, M_{\theta _{i_2}}, \ldots , M_{\theta _{i_k}}\) be the base neural models selected through the clustering step. These models are fused together as sub-networks of a multi-headed neural network architecture. This fusion architecture also comprises a concatenated output layer followed by layers for the final classification. The layers added to the multi-headed neural network after the concatenation is called layer fusion sub-network. The fusion architecture takes the base deep neural networks with the parameters pre-trained on different enhanced training sets as sub-networks and fine-tunes them on the original training set \(\mathcal {D}\) by using the back-propagation strategy (line 10, Algorithm 1). This strategy allows us to fine-tune the parameters of each base neural model simultaneously with the concatenation layer of the ensemble system. The advantage of fine-tuning is to initialise base sub-networks of the ensemble architecture by adapting the previously learned information to obtain better results. In addition, fine-tuning coupled with back-propagation allows us to share the information among the pre-trained sub-networks during their ensemble fusion, thus reducing the overfitting of each single neural model and producing classification decisions that are more accurate on unseen data.

Notice that the performance of PANACEA may depend on input parameters \(\epsilon\), \(\sigma\) and \(\eta\). In general, the perturbation \(\epsilon\) is selected as a small value in the range between 0 and 0.1 (Bai et al., 2021), in order to scale the noise and ensure that perturbations are small enough to remain undetected to the human eye, but large enough to fool the attacked neural model. In this study, an approach for the automatic selection of \(\epsilon\) is proposed. This is based on the characteristics of adversarial samples. In particular, this approach founds in the idea that the value at which a lower \(\epsilon\) stops perturbing training samples, by diminishing the number of mis-classified adversarial training samples, may correspond to an adequate value of \(\epsilon\) for gaining accuracy with the adversarial training strategy. Hence, for each \(\epsilon\) in the range [0, 0.1], the adversarial set \(\mathcal {A}_\epsilon\) is considered. This comprises the adversarial samples produced for all training samples of \(\mathcal {D}\), through FGSM, with \(\epsilon\) according to Eq. 2. The OA(\(\mathcal {A}_\epsilon\)) is computed as the percentage of correct classification decisions yielded by \(M_\theta\) on samples of \(\mathcal {A}_\epsilon\). Subsequently, the Elbow method is used to pick the knee of the OA(\(\mathcal {A}_\epsilon\)) curve as the elbow value of \(\epsilon\).

Finally, the elbow \(\epsilon\) is selected to build the adversarial set for the ensemble system of PANACEA. Notably, this procedure for the automatic selection of \(\epsilon\) is independent of both \(\sigma\) and \(\eta\) that remain user-defined parameters. In fact, OA(\(\mathcal {A}\)) can be measured before learning \(\eta\) neural model candidates from the training set augmented with \(\sigma\) random adversarial samples.

5 Time complexity

In this section, we describe the time complexity of the learning stage of PANACEA. Let l be the maximum number of layers in the neural network architectures trained to learn both the initial neural model and the neural model candidates, r be the maximum number of neurons per layer and e be the number of epochs. Let \(w=r^2\) be the maximum number of weights per layer. Let N be the number of samples in the original training set, \(N+\sigma\) be the number of samples in each augmented training set considered to learn a neural model candidate and d be the number of the independent variables in the input space of the learned neural models. Let \(\eta\) be the number of neural model candidates and k be the number of base neural models selected from the candidate collection for the ensemble system through the clustering step. The time cost of PANACEA is the sum of the cost of the following stages.

Training an initial neural model \(M_\theta\) The cost of training \(M_\theta\) mainly depends on the cost of computing the gradient descent (in the back-propagation stage) on a training set of N samples, for e epochs, and across \(l\times w\) weights of the deep neural network architecture. This is \(\textbf{O}(l \times w \times N \times e)\).

Generating the adversarial set \(\mathcal {A}\) The cost of building \(\mathcal {A}\) is the cost that FGSM spends computing the gradient g() on a set of N samples. g() is computed through the \(l \times w\) weights in \(M_\theta\). This cost is \(\textbf{O}(l\times w\times N)\) as described by Wong et al. (2020).

Training neural model candidates \(M_{\theta _i}\) The cost of training a neural model candidate \(M_{\theta _i}\) is the cost of training a deep neural network by processing a training set of \(N+\sigma\) samples. As discussed above, this cost is \(\textbf{O}(l \times w \times (N+\sigma ) \times e)\). Hence, the cost of training \(\eta\) neural model candidates is \(\textbf{O}(\eta \times l\times w \times (N+\sigma )\times e)\).

Generating feature vector explanations \(\mathcal {X}[i]\) The cost of generating the feature vector explanation \(\mathcal {X}[i]\) of \(M_{\theta _i}\) is the cost that DALEX spends computing the difference of the loss for \(M_{\theta _i}\) before and after resampling the set of \(N+\sigma\) samples processed to learn \(M_{\theta _i}\). The cost of computing the loss of \(M_{\theta _i}\) on a set of \(N+\sigma\) samples is \(l \times w \times (N+\sigma )\). The cost of performing the resampling operation is \(N+\sigma\). Hence, the cost of computing the difference of loss is \((N+\sigma )+ 2\times (l \times w \times (N+\sigma ))\), that is, \(\textbf{O}(l \times w \times (N+\sigma ))\). This operation is repeated for each independent variable of the input space. Hence, the cost of computing \(\mathcal {X}[i]\) is \(\textbf{O}(d \times l \times w \times (N+\sigma ))\), while the cost of generating \(\eta\) feature vector explanations \(\mathcal {X}[i]\) for \(\eta\) neural model candidates \(M_{\theta _i}\) is \(\textbf{O}(\eta \times d \times l \times w\times (N+\sigma ))\).

Selecting base neural models for the ensemble system The cost of selecting k base neural models for the ensemble system requires performing the clustering step on the collection of feature vector explanations produced for the neural model candidates. The cost of running the k-medoid algorithm to perform the clustering step of \(\eta\) feature vector explanations is proportional to \(n_{iter} \times k\times (\eta -k)^2\), where \(n_{iter}\) is the number of iterations performed to complete the clustering step. This cost is described by Reynolds et al. (2006). The cost of computing the inertia is proportional to the number of feature vector explanations \(\eta\), that is, the cost of computing Eq. 5. The automatic estimation of k with the Elbow method requires that the clustering step is repeated with k ranging among 1 and \(\eta\). So the final cost of this step is \(\displaystyle \sum _{k=1}^{\eta }{\left( n_{iter} \times k\times (\eta -k)^2\right) }+\eta ^2\). By considering that \(\eta ^2 < \displaystyle \sum _{k=1}^{\eta }{\left( n_{iter} \times k\times (\eta -k)^2\right) }\), the term \(\eta ^2\) can be dropped and the time cost of this stage is \(\textbf{O}(\displaystyle \sum _{k=1}^{\eta }{\left( n_{iter} \times k\times (\eta -k)^2\right) })\).

Generating the ensemble system The cost of training the multi-headed deep neural network of the final ensemble system is \(\textbf{O}((k\times l\times w+l_c\times w_c)\times N\times e)\). This is the cost of computing the gradient descendent on a set of N samples for a deep neural network that includes k-headed networks and a fusion sub-network. Specifically, each headed network contains l layers and w weights per layer. The fusion sub-network contains \(l_c\) layers and \(w_c\) weights per layer.

A summary of the time cost of the several stages of PANACEA is reported in Table 1.

Table 1 Time complexity summary

6 Empirical evaluation and discussion

The performance of PANACEA is evaluated by performing several experiments on four benchmark cybersecurity datasets. These experiments aimed to explore the performance of the ensemble architecture, as well as the achievements of coupling both adversarial training and explainability. The datasets are presented in Sect. 6.1, while the performance metrics are described in Sect. 6.2. The implementation details of the proposed method are reported in Sect. 6.3. The results are illustrated in Sect. 6.4.

6.1 Dataset description

Four multi-class datasets, i.e., NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20 were considered, in order to evaluate the performance of PANACEA. These datasets contain feature-vector data collected in both network security (NSL-KDD, UNSW-NB15 and CICIDS17) and malware security (CICMalDroid20) applications.

Table 2 Data description

NSL-KDD This datasetFootnote 3 comprises normal network flow traces and four categories of network traffic attacks. In particular, the dataset contains two rare attacks that are: user to root (U2R) and remote to local (R2L). The training set is made up of 21 different attack sub-categories, while the test set is composed of 37 different attack sub-categories ( with 16 novel attacks in the test set). This dataset contains 3 categorical features, 38 numerical feature and 1 class feature. A detailed description of these features is reported in Tavallaee et al. (2009). While this dataset may not represent perfectly existing real-world networks, recent, state-of-the-art studies still use it as an effective benchmark dataset to help researchers compare different multi-class classification network intrusion detection methods. This study was conducted using the data setting that includes KDDTrain+20Percent as the training set and KDDTest+ as the testing set (Tavallaee et al., 2009).

UNSW-NB15 This datasetFootnote 4 includes realistic normal activities and synthetic attack behaviours extracted from network traffic monitored in 2015. Both the training set and the testing set contain normal network flow traces and nine categories of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. Backdoors, Analysis, Shellcode and Worms are rare classes. This dataset contains 3 categorical features, 40 numerical features and 1 class feature. A detailed description of these features is reported in Moustafa and Slay (2015).

CICIDS17 This dataset collects the data extracted with the CICFlowMeter tool from the Canadian Institute for cybersecurity. Data concern a network testbed composed of various devices and operating systems with distinct victim and attacker networks communicating over the Internet. The refined version of CICIDS17Footnote 5 recently released by Engelen et al. (2021) was used in this study. This version addressed flaws in the original dataset by removing meaningless artifacts, dataset errors and mislabeled traces. In particular, the revised dataset retained 71 numeric features from the original input set and 1 class feature. A detailed description of the features is reported in Engelen et al. (2021). For the experiment a stratified sampling without replacement was done, in order to extract two independent sets of 100,000 samples. These sets were processed as the training set and testing set, respectively. In accordance with the original distribution of the data, we extracted \(80\%\) of benign samples and \(20\%\) of attacks for both the training and testing set. The processed dataset comprised 8 types of attacks and the benign class.

CICMalDroid20 This datasetFootnote 6 includes recent samples of Android apps that have been collected from several sources such as the VirusTotal service and the Contagio security blog. The dataset collects samples labeled in the benign class and four distinct malware categories. As described by Mahdavifar et al. (2022), each app has 40 numeric features that represent the top-dynamic, observed behaviours, ranked using Mutual Information and 1 class feature. A detailed description of the features is reported in Mahdavifar et al. (2022). For the experiment, the stratified division of the dataset in the training set (70%) and the testing set (30%) was used.

As summary of the characteristics of datasets is reported in Table 2.

6.2 Performance metrics

Standard multi-class classification metrics, i.e., WeightedF1, MacroF1 and OA, were measured on decisions yielded on the testing sets. WeightedF1 and MacroF1 are expected to measure close values in balanced domains, while WeightedF1 may provoke a misleading evaluation of performances of minority classes in imbalanced domains, due to the prominence of the majority classes in the metric. As this study includes both balanced and imbalanced datasets, the accuracy performance of the proposed method was explored along all these metrics in the experimentation. In addition, the efficiency performance was analysed with the computation time spent training the ensemble model and the average time spent predicting the class of every testing sample. The testing samples were collected on a Linux machine with an Intel(R) Core(TM) i7-9700F CPU @ 3.00GHz and 32GB RAM. All the experiments were executed on a single GeForce RTX 2080. The total training TIME was measured in minutes, while the average testing TIME was measured in milliseconds. Furthermore, the total training TIME was measured assuming that classification model candidates were learned in parallel with the ensemble fusion. Finally, the ensemble DIVERSITY of the base neural models fused through the ensemble system was measured. This is the average square Euclidean distance between the explanation feature vectors of all the pairs of base neural models, i.e., \(DIVERSITY=\displaystyle \frac{1}{k(k-1)}\sum _{i=1}^k{\sum _{j\ne i, j=1}^k{div(i,j)}}\), where div() is computed according to Eq. 4.

6.3 Implementation details

PANACEA was implemented with Python 3.9 and Keras 2.7 – a high-level neural network API integrated in TensorFlow.Footnote 7 In the pre-processing step, input categorical features were mapped into numerical features using the one-hot-encoder strategy.Footnote 8 Input numeric features were scaled using the min-max normalization.Footnote 9 Due to the large number of features generated through the one-hot-encoding of the categorical features in UNSW-NB15, we used the PCA algorithm,Footnote 10 to select the number of components with the amount of variance set equal to 97%.

Table 3 Hyper-parameter search space

For each dataset, the hyper-parameters of the neural models were optimized with the tree-structured Parzen estimator algorithm, using \(20\%\) of the entire training set as a validation set according to the Pareto Principle. The Parzen estimator algorithm was implemented in the Hyperopt library. The configuration of the parameter that achieved the lowest validation loss was selected. The hyper-parameters and their corresponding possible values are reported in Table 3.

Both the initial neural network considered to generate the adversarial set and each base deep neural network candidate considered for the ensemble system were defined with 3 fully-connected (FC) layers, one dropout layer and one batch-normalization layer, to prevent the overfitting phenomenon. The number of neurons in the FC layers was selected with the hyper-parameter optimization. The output probabilities were obtained using the softmax activation function in the last layer. The Rectified Linear Unit (ReLU) activation function was used in all the other hidden layers. The ensemble neural network was implemented as a multi-headed neural network architecture that combined the base deep neural networks as sub-network blocks with trainable layers. A concatenation layer was added, in order to concatenate outputs of base networks and fed the output of the concatenation to an FC layer. This FC layer populates the fusion sub-network. All the deep neural networks were trained with mini-batches using the back-propagation strategy. The gradient-based optimization was completed using the Adam update rule. The weights were initialized according to the Xavier scheme. In addition, the maximum number of epochs was set equal to 150 and an early stopping approach was adopted. The early stopping was based on the lowest loss on the same validation set considered in the hyper-parameter optimization, in order to retain the best models.

The FGSM algorithm (as implemented in the adversarial robustness toolbox libraryFootnote 11) was considered to build the adversarial set. To automate the selection of \(\epsilon\), the Elbow method was run on the set {0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001}. The DALEX Python package 1.2.0Footnote 12 was integrated to measure the global relevance of input features in base neural model decisions and the k-medoid algorithmFootnote 13 to perform clustering of base neural models. The k-medoid algorithm was run with k-medoid++ as the medoid initialization method. To automate the selection of k, the Elbow method was used on k ranging between 2 and \(\eta\). The implementation of the Elbow method available in the Yellowbrick python packageFootnote 14 was integrated in PANACEA.

6.4 Results and discussion

The main goals of the experimental study are:

  • To study the sensitivity of the performance of the proposed ensemble system to \(\sigma\) (i.e., the number of adversarial samples considered to learn separate neural model candidates for the ensemble) and \(\eta\) (i.e., the number of neural model candidates trained with adversarial training) once \(\epsilon\) is automatically selected with the Elbow method (see Sects. 6.4.1 and 6.4.4).

  • To explore the sensitivity of the performance of the proposed ensemble system to various state-of-the-art algorithms (i.e., FGSM, PGD and DeepFool) used for producing the adversarial set for the adversarial training step (see Sect. 6.4.2).

  • To perform an ablation study to assess the effectiveness of considering both the XAI information and the multi-headed deep neural network in the ensemble fusion (see Sects. 6.4.3 and 6.4.5, respectively).

  • To study the sensitivity of the performance of the proposed ensemble system to the set-up of \(\epsilon\), \(\sigma\) and \(\eta\), decided at the same time for the three parameters (see Sect. 6.4.6).

  • To explore the results that are reported in the recent cyber-threat detection literature, and which have been achieved by processing the datasets also considered in our study (see Sect. 6.4.7).

  • To illustrate an example that shows the accuracy of a cyber-threat detection operation performed by PANACEA (see Sect. 6.4.8).

6.4.1 Analysis of the adversarial sampling amount \(\sigma\)

Table 4 The value of elbow \(\epsilon\) estimated in PANACEA, and the value of WeightedF1(\(\mathcal {A}\)), MacroF1 (\(\mathcal {A}\)), and OA(\(\mathcal {A}\)), measured by using the baseline neural model to classify the entire adversarial set produced with elbow \(\epsilon\)

The performance of PANACEA depends on the user-defined set-up of \(\sigma\) and \(\eta\), whereas PANACEA considers that \(\epsilon\) can be automatically chosen, based on the characteristics of the produced adversarial set. Table 4 reports the values of elbow \(\epsilon\) that were automatically chosen during the training stage of PANACEA in all the datasets of this study. This table also reports the accuracy metrics (WeightedF1, MacroF1 and OA) computed on the adversarial set produced in correspondence to elbow \(\epsilon\). In all subsequent analyses illustrated in this experimentation we will explore the performance of the proposed method (and its baselines and competitors). This is done by measuring the accuracy metrics on the testing set, as we intend to assess the accuracy of the ensemble on unseen data (used neither in the training of the final model nor in the selection of the parameters).

The results collected in Table 4 show that the Elbow method allowed us to choose \(\epsilon =0.01\) in three datasets (NSL-KDD, UNSW-NB15 and CICIDS17), which collected network traffic data, while it chose \(\epsilon =0.00001\) in the dataset, which collected the Android apps (CICMalDroid20). Although the set-up of \(\epsilon\) may depend on the specific cybersecurity problem, \(\epsilon =0.01\) seems to be a reasonable elbow choice for network intrusion detection, while \(\epsilon =0.00001\) seems to be a reasonable choice for android malware detection. We also note that elbow \(\epsilon\) generates a higher percentage of misclassified adversarial samples in UNSW-NB15 and CICMalDroid20 than in NSL-KDD and CICIDS17. This highlights that the behaviour of the adversarial sample generator (FGSM in this study) depends on \(\epsilon\), as well as on the characteristics of processed data.

Table 5 WeightedF1, MacroF1 and OA of PANACEA with \(\sigma =5\%\) and \(10\%\) of the training set size and BASELINE. k denotes the number of distinct neural models automatically selected in the clustering step of PANACEA over \(\eta =100\) neural model candidates
Table 6 F1 per class: BASELINE vs PANACEA with \(\sigma =5\%\) and \(\sigma =10\%\) of the training set size and \(\eta =100\)

In the following, we explore the effect of \(\sigma\) on the accuracy performance of PANACEA. This analysis is based on the results collected in the experiments that ran PANACEA with \(\sigma =5\%\) and \(10\%\) of the training set size, considering the values of elbow \(\epsilon\) reported in Table 4 and fixing \(\eta =100\) for all datasets. Table 5 reports the number of neural models (k) that the clustering step of PANACEA selected for the ensemble fusion in this experimental configuration, as well as WeightedF1, MacroF1 and OA of PANACEA. All the accuracy metrics were measured on the testing set of each dataset. As BASELINE, we considered the deep neural network that was trained in the first step of PANACEA as the initial neural model for the adversarial sample production. We recall that the number of clusters k was automatically identified during the clustering step of PANACEA. The results show that PANACEA outperforms BASELINE, independently of the number \(\sigma\) of adversarial samples processed in UNSW-NB15, CICIDS17 and CICMalDroid20. In these three datasets, the gain in accuracy is commonly observed equally along WeightedF1, MacroF1 and OA. The only exception is the MacroF1 of PANACEA with \(\sigma =5\%\) in UNSW-NB15. However, both WeightedF1 and OA of PANACEA outperform WeightedF1 and OA of BASELINE also in this configuration. In addition, there is at least one tested configuration of PANACEA that outperforms BASELINE in NSL-KDD. Finally, also in NSL-KDD the gain in accuracy is observed along WeightedF1 and OA, but not along MacroF1. This is presumably due to the presence of minority classes in NSL-KDD, as well in UNSW-NB15. In fact, in both datasets, the ensemble strategy allows us to gain accuracy by better classifying samples of majority classes, while we may lose accuracy by classifying samples of minority classes. This intuition is confirmed by the analysis of detailed F1 per class, reported in Table 6, for all the datasets.

Both, BASELINE and PANACEA achieve higher F1 on the majority classes, and lower F1 on the minority classes. In particular, both methods achieve F1 values equal to or close to zero in various highly minority classes. This result is not surprising since the scarcity of labelled examples for a minority class often limits the ability of a classifier to learn a model capable of correctly disentangling minority classes from remaining classes. For example, the deep neural method described by Andresini et al. (2022) also achieves F1 equal to 0 on class Worms of UNSW-NB15. In any case, the results of F1 per class show that PANACEA achieves higher F1 than BASELINE in all the classes of both CICIDS17 and CICMalDroid20. The only exception can be observed with the minority class DoS Slowhttptest of CICIDS17 when PANACEA was run with \(\sigma =5\%\) of the training set size. Notably, both CICIDS17 and CICMalDroid20 are also the datasets, where we observe the highest gain in accuracy per class of PANACEA with respect to BASELINE, although the gain in accuracy depends on \(\sigma\) also in these datasets. In fact, the gain in accuracy is higher for all the classes of CICIDS17 when PANACEA is run with \(\sigma =10\%\) of the training set size, while it is higher for all the classes of CICMalDroid20 when PANACEA is run with \(\sigma =5\%\) of the training set size. On the other hand, PANACEA loses accuracy in a few classes of NSL-KDD and UNS-NB15 for which very few examples are available for training the ensemble, e.g., U2R in NSL-KDD and Worms in UNSW-NB15. These results provide empirical evidence that the ensemble strategy of PANACEA may suffer in the presence of minority classes, and the appropriate choice of \(\sigma\) may not be sufficient to fix this limitation in all the cases. This paves the way for future investigations aimed at exploring new, ad-hoc set-ups of both the adversarial training strategy and the ensemble pruning strategy, in order try to reduce the gap between the highly minority classes and the remaining classes and gain accuracy on challenging classes.

In conclusion, this experiment indicates that the use of the proposed ensemble strategy allows PANACEA to achieve overall higher accuracy performance than BASELINE in almost all the tested configurations. The only exception is observed with \(\sigma =5\%\) of the training set size in NSL-KDD. In any case, the set-up of \(\sigma\) may affect the performance of the ensemble strategy in some datasets. In addition, PANACEA may lose accuracy in the presence of minority classes, since the appropriate selection of \(\sigma\) helps the ensemble strategy of PANACEA to gain accuracy in a few minority classes only. Finally, this experiment shows that the best performance of PANACEA is achieved using \(\sigma =10\%\) of the training set size in network intrusion detection (NSL-KDD, UNSW-NB15 and CICIDS17) and \(\sigma =5\%\) of the training set size in malware detection (CICMalDroid20). Nevertheless, the identification of specific guidelines to automate the set-up of \(\sigma\), based on the input features of both the training and adversarial samples, requires further investigation in the future. Moreover, ad-hoc selection strategies may be explored to handle the imbalanced data condition better in the adversarial training stage.

6.4.2 Analysis of the adversarial sample generation algorithm

Table 7 WeightedF1, MacroF1 and OA of BASELINE and PANACEA. PANACEA was trained with FGSM, PGD, DeepFool, and LowProFool, \(\eta =100\) and \(\sigma \in \{5\%, 10\%\}\) of the training set size

We analyze the performance of three white-box adversarial sample generation algorithms, that is, FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2018), DeepFool (Moosavi-Dezfooli et al., 2016) and LowProFool (Ballet et al., 2019). FGSM is used in the rest of this experimental study. LowProFool is a targeted approach (i.e., it causes a classifier to choose a specific target class as wrong class). So we ran LowProFool by randomly selecting, for each sample, the target class from the wrong classes. Table 7 reports the results of WeightedF1, MacroF1 and OA collected for both BASELINE and PANACEA. The results of PANACEA were collected by using FGSM, PGD, DeepFool and LowProFool for adversarial training, setting \(\eta =100\) and \(\sigma =5\%\) or \(\sigma =10\%\) of the training set size.

The results show that PANACEA commonly outperforms BASELINE independently of the adversarial sample generation algorithm used with a few exceptions that concern the MacroF1 of some configurations of PANACEA measured in NSL-KDD, UNSW-NB15 and CICIDS17. In these configurations, the loss of MacroF1 of PANACEA is commonly coupled with the gain in both WeightedF1 and OA (except for PANACEA (FGSM) and PANACEA (DeepFool) in NSL-KDD with \(\sigma =5\%\)). This behaviour shows that the use of adversarial training may lead PANACEA to lose accuracy performance on minority classes and gain accuracy performance on majority classes. This supports the conclusion drawn in Sect. 6.4.1 that the adversarial training step performed by PANACEA may suffer in the presence of minority classes (which occur in both NSL-KDD, UNSW-NB15, and CICIDS17). The results also show that adversarial training with FGSM performs commonly better than adversarial training with both PGD, DeepFool and LowProFool. The only exception is observed with PANACEA (LowProFool) in UNSW-NB15 with \(\sigma =10\%\). In general, the best configuration of FGSM is sensitive to the presence of minority classes in a lower number of configurations than PGD, DeepFool and LowProFool.

Finally, we note that these results support the idea that the amount \(\sigma\) of adversarial samples required to achieve the highest performance of PANACEA may change according to the cybersecurity problem independently of the adversarial sample generation algorithm. In fact, the highest accuracy of PANACEA continues to be achieved with \(\sigma =5\%\) of the training set size in CICMalDroid20 and \(\sigma =10\%\) of the training set size in NSL-KDD, UNSW-NB15 and CICIDS17. Notably, PANACEA (FGSM) outperforms PANACEA (PGD), PANACEA (DeepFool) and PANACEA (LowProFool) in these configurations for NSL-KDD, CICIDS17 and CICMalDroid20.

6.4.3 Analysis of the XAI-based neural model selection strategy

Table 8 WeightedF1, MacroF1 and OA of NO_SEL, D_SEL, A_SEL, \(\mathsf{A+XAI\_SEL}^{(1)}\), \(\mathsf{A+XAI\_SEL}^{(2)}\) and PANACEA trained with \(\eta =100\) and \(\sigma \in \{5\%,10\%\}\) of the training set size in NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20

To assess the effectiveness of the XAI-based approach for selecting diverse base neural models involved in the ensemble fusion, we compare the performance of PANACEA to that of its baseline configuration NO_SEL that performed no selection. Based on this naive strategy, NO_SEL performed the multi-headed deep neural fusion of all the neural model candidates generated. This comparative study allows us to explore how the proposed ensemble system can actually gain accuracy through the use of the XAI-based pruning mechanism.

In addition, we compare the performance of PANACEA to that of traditional ensemble model pruning strategies, which are based on inconsistency of prediction results or accuracy of neural model candidates. These model pruning strategies are formulated in the previous ensemble learning literature to select the base neural models for the ensemble fusion. In particular, this analysis aims at assessing the effectiveness of our intuition that using an XAI method to measure differences in the features importance to the models can actually allow PANACEA to select diverse base neural models for an accurate ensemble system. For this purpose, we consider the following traditional strategies, commonly used in previous ensemble learning studies, as baselines:

  • D_SEL that performed the fusion of neural model candidates selected by accounting for the diversity in terms of inconsistency of the prediction results. To this aim, we calculated the plain disagreement measure, that is one of the most commonly used measures for diversity in ensembles (Tsymbal et al., 2005). For two neural models \(M_{\theta _i}\) and \(M_{\theta _j}\), the plain disagreement is equal to the proportion of the training samples on which the models make different predictions, i.e., \(div\_plain(M_{\theta _i},M_{\theta _j})=\frac{1}{N}\displaystyle \sum _{i=1}^N{\delta (M_{\theta _i}(\textbf{x}),M_{\theta _j}(\textbf{x}))}\) with \(\delta (M_{\theta _i}(\textbf{x}),M_{\theta _j}(\textbf{x}))=0\) iff \(M_{\theta _i}(\textbf{x})=M_{\theta _j}(\textbf{x})\), otherwise 1. We used \(div\_plain\) as the distance for the k-medoid step and selected cluster medoids as base neural models for the multi-headed deep neural fusion. This is one of the classical ensemble diversity metrics, which has been recently tested by Sesmero et al. (2021).

  • A_SEL that applied the dynamic accuracy-based selection described by Puuronen and Tsymbal (2001). According to this strategy, the neural models that fall into the upper half of the overall accuracy interval of candidates were chosen to perform the multi-headed deep neural fusion.

Finally, in this experiment, we also explore the idea that model diversity and model accuracy must be balanced in ensemble systems (Shiue et al., 2021) also when an XAI method is used to measure model diversity. Hence, we compare the performance of PANACEA to that of the following two variants of PANACEA:

  • \(\mathsf{A+XAI\_SEL}^{(1)}\) that, similarly to PANACEA, performed the k-medoid clustering step of the neural model candidates based on their feature vector explanations computed with DALEX. For each cluster, it selects the neural model candidate with the highest accuracy in the cluster.

  • \(\mathsf{A+XAI\_SEL}^{(2)}\) that performed the clustering step of PANACEA on the neural models selected by A_SEL, i.e., the neural model candidates that fall into the upper half of the overall accuracy interval of candidates.

In both \(\mathsf{A+XAI\_SEL}^{(1)}\) and \(\mathsf{A+XAI\_SEL}^{(2)}\), the selected models base were considered for the multi-headed deep neural fusion. Specifically, this comparative analysis comprising \(\mathsf{A+XAI\_SEL}^{(1)}\) and \(\mathsf{A+XAI\_SEL}^{(2)}\), aims at exploring how the performance of the XAI-based pruning mechanism adopted in PANACEA can possibly be improved by accounting for the accuracy of the neural model candidates.

Table 8 reports the WeightedF1, MacroF1 and OA of NO_SEL, D_SEL, A_SEL, \(\mathsf{A+XAI\_SEL}^{(1)}\), \(\mathsf{A+XAI\_SEL}^{(2)}\) and PANACEA with \(\eta =100\) neural models. The accuracy metrics were computed on the testing set of each dataset. All the experiments were conducted with elbow \(\epsilon\) reported in Table 4 for all the datasets. We also reported the results achieved with \(\sigma =5\%\) and \(\sigma =10\%\) of the training set size for the neural model candidate generation.

These results show that the use of XAI information, computed through DALEX, to select diverse base neural models aids the ensemble system of PANACEA to gain accuracy, compared to the naive baseline NO_SEL, that omits any neural model selection, as well as the state-of-the-art strategies D_SEL and A_SEL. The only exception is observed with NO_SEL and D_SEL that outperform PANACEA along the MacroF1, when PANACEA, NO_SEL and D_SEL were run with \(\sigma =5\%\) of the training set size of UNSW-NB15. In any case, the experiment also shows that measuring the neural model candidate diversity in D_SEL, based on prediction inconsistencies, may aid in gaining accuracy on minority classes in this specific dataset only.

PANACEA also gains accuracy compared to its variants \(\mathsf{A+XAI\_SEL}^{(1)}\) and \(\mathsf{A+XAI\_SEL}^{(2)}\) that adopt a hybrid selection strategy. The only exceptions are observed with UNSW-NB15 and CICMalDroid20. In UNSW-NB15, \(\mathsf{A+XAI\_SEL}^{(1)}\) outperforms PANACEA along the MacroF1 when both methods were run with \(\sigma =5\%\) of the training set size. In CICMalDroid20, both \(\mathsf{A+XAI\_SEL}^{(1)}\) and \(\mathsf{A+XAI\_SEL}^{(2)}\) outperform PANACEA along the WeightedF1, MacroF1 and OA when both methods were run with \(\sigma =10\%\) of the training set size. In any case, the highest accuracy is still achieved with PANACEA also in CICMalDroid20 by running PANACEA with \(\sigma =5\%\) of the training set size. Therefore, this experiment shows that accounting for the model accuracy when using XAI information to measure model diversity only occasionally improves the accuracy of the final ensemble. This motivates our decision of neglecting model accuracy in the reaming experiments conducted using the XAI-based model pruning mechanism of PANACEA.

Fig. 2
figure 2

DIVERSITY of NO_SEL, D_SEL, A_SEL, \(\mathsf{A+XAI\_SEL}^{(1)}\), \(\mathsf{A+XAI\_SEL}^{(2)}\) and PANACEA trained with \(\eta =100\) and \(\sigma \in \{5\%,10\%\}\) of the training set size in NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20

To complete this analysis, we explore the explanation DIVERSITY of the ensemble systems trained in this experiment. Figure 2 compares the DIVERSITY of the neural models fused in the ensemble systems trained by NO_SEL, D_SEL, A_SEL, \(\mathsf{A+XAI\_SEL}^{(1)}\), \(\mathsf{A+XAI\_SEL}^{(2)}\) and PANACEA. The results show that the use of XAI information allows PANACEA to select base neural models that give more importance to different features. This allows PANACEA to increase the overall explanation diversity of the ensemble system. So, these results provide empirical evidence that explaining models by estimating the effect of features on model accuracy may be a valuable selection criterion in ensemble systems trained for cybersecurity problems that often comprise attack categories whose signatures may be based on different features. Notably, all the compared methods achieve higher DIVERSITY in CICIDS17 than in NSL-KDD, UNSW-NB15 and CICMalDroid20. On the other hand, all the methods achieve higher WeightedF1 and OA in CICIDS17 than in NSL-KDD, UNSW-NB15 and CICMalDroid20. This result supports the idea that the ability to train an ensemble system that fuses base neural models with high diversity in explanations can help to gain overall accuracy through the ensemble system.

Fig. 3
figure 3

Top-15 feature ranking map of the base neural models selected through the clustering step of PANACEA

Finally, we perform a visual inspection of the diversity of the XAI information in the base neural models selected through the clustering step of PANACEA. Figures 3a-3d depict the ranked feature relevance to illustrate the effect of the top-15 input features on the global decisions of the base neural models selected in NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20, respectively. We selected the configurations of PANACEA with \(\sigma =10\%\) of the training set size for NSL-KDD, UNSW-NB15 and CICIDS17 and \(\sigma =5\%\) of the training set size for CICMalDroid20. These configurations were chosen as the best ones in Table 5. Feature ranking of UNSW-NB15 refers to the principal components extracted as described in Sect. 6.3, to handle the large number of features generated in this dataset through the one-hot-encoder strategy. Feature ranking maps show how diverse input features play prominent roles in explaining the decisions of the distinct base neural models selected for the ensemble fusion in PANACEA.

For example, in CICIDS17, the input feature “Idle Std" is ranked in first place for the neural model medoids of clusters 4, 6 and 7. The same feature is ranked in tenth place for the neural model medoid of cluster 1. This feature is not even in the top-15 feature ranking for the neural model medoid of cluster 9. In CICMalDroid20, the input feature “network_access" is ranked in third place for the neural model medoids of clusters 5, 8 and 11. However, this feature is not even in the top-15 for the medoid of cluster 1. Notably, an exception is the input feature “fs_access(create_write)" in CICMalDroid20, which is a top-relevant feature to explain the decisions produced by all the neural models. This highlights how that feature remains the most relevant for decisions produced by all the neural model candidates trained for the CICMalDroid20 problem. Similarly, the feature “dst_host_serror_rate" remains the most relevant for decisions produced by all the neural model candidates trained for the NSL-KDD problem, while the principal component “pc2" remains the most relevant for decisions produced by all the neural model candidates trained for the UNSW-NB15 problem. On the other hand, “serror_rate", that is ranked in third place for the neural model medoids of clusters 2, 3 and 7 of NSL-KDD, is not even in the top-15 for the medoid of cluster 6. Similarly, the principal component “pc1", that is in second place for the neural model medoid of clusters 2, 3, 4, 6, 7, 8, 9, 10 and 11 of UNSW-NB15, is in the seventh place for the medoid of cluster 5, while it is not even in the top-15 for the medoid of cluster 1.

6.4.4 Analysis of the number of neural model candidates \(\eta\)

Fig. 4
figure 4

NSL-KDD: WeightedF1 (4a and 4g), MacroF1 (4b and 4h), OA (4c and 4i), Train TIME spent in mins (4d and 4j) and Average Test TIME spent in millisecs (4e and 4k) of both PANACEA and NO_SEL; the number of clusters k (4f and 4l) of PANACEA. The results were collected by varying \(\eta\) among 25, 50, 75 and 100, and \(\sigma\) between 5% (4a-4e) and 10% (4g-4k) of the training set size

Fig. 5
figure 5

UNSW-NB15: WeightedF1 (5a and 5g), MacroF1 (5b and 5h), OA (5c and 5i), Train TIME spent in mins (5d and 5j) and Average Test TIME spent in millisecs (5e and 5k) of both PANACEA and NO_SEL; the number of clusters k (5f and 5l) of PANACEA. The results were collected by varying \(\eta\) among 25, 50, 75 and 100, and \(\sigma\) between 5% (5a-5e) and 10% (5g-5k) of the training set size.

Fig. 6
figure 6

CICIDS17: WeightedF1 (6a and 6g), MacroF1 (6b and 6h), OA (6c and 6i), Train TIME spent in mins (6d and 6j) and Average Test TIME spent in millisecs (6e and 6k) of both PANACEA and NO_SEL; the number of clusters k (6f and 6l) of PANACEA. The results were collected by varying \(\eta\) among 25, 50, 75 and 100, and \(\sigma\) between 5% (6a-6e) and 10% (6g-6k) of the training set size

Fig. 7
figure 7

CICMalDroid20: WeightedF1 (7a and 7g), MacroF1 (7b and 7h), OA (7c and 7i), Train TIME spent in mins (7d and 7j) and Average Test TIME spent in millisecs (7e and 7k) of both PANACEA and NO_SEL; the number of clusters k (7f and 7l) of PANACEA. The results were collected by varying \(\eta\) among 25, 50, 75 and 100, and \(\sigma\) between 5% (7a–e) and 10% (7g–k) of the training set size

In this Section, we explore the sensitivity of the performance of PANACEA to \(\eta\). For this sensitivity analysis, we consider NO_SEL as baseline, since this is the configuration of PANACEA that used all the trained neural model candidates as base models of the ensemble system. Specifically, this sensitivity analysis aims at exploring how the outcome of the model pruning mechanism of PANACEA is affected by the number of neural model candidates originally learned. Figs. 4, 5, 6 and 7 compare accuracy metrics (WeightedF1, MacroF1 and OA), efficiency metrics (Total Train TIME spent in minutes completing the training stage and Average Test TIME spent in milliseconds predicting the class of a test sample), measured for PANACEA in NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20, respectively. In addition, they report the number k of base neural models that were automatically identified during the clustering step of PANACEA for the ensemble fusion. The accuracy metrics were measured on the testing sets of all the considered datasets. All the metrics were collected by varying \(\eta\) among 25, 50, 75 and 100, with \(\sigma =5\%\) and \(\sigma =10\%\) of the training set size, and considering the elbow values of \(\epsilon\) reported in Table 4 for all the tested datasets.

The results on WeightedF1, MacroF1 and OA show that PANACEA commonly outperforms NO_SEL in all tested configurations. This behaviour is observed independently of \(\sigma\) and \(\eta\), with a few exceptions. In fact, NO_SEL performs similarly to (or slightly better than) PANACEA in the configurations of NSL-KDD and CICIDS17 with \(\eta =25\), \(\sigma =5\%\) and \(\sigma =10\%\) of the training set size, as well as in the configurations of UNSW-NB15 with \(\eta =25\), \(\eta =50\), \(\eta =75\) and \(\sigma =10\%\) of the training set size. In general, PANACEA tends to gain accuracy as \(\eta\) increases in NSL-KDD, UNSW-NB15 and CICIDS17 with \(\sigma =10\%\) of the training set size, as well as in CICMalDroid20 with \(\sigma =5\%\) of the training set size. Notably, these are the top configurations of \(\sigma\) already identified for these datasets in the analysis illustrated in Sect. 6.4.1. This confirms that the selection of \(\sigma\) may have a critical effect on the accuracy performance of PANACEA. However, after an appropriate selection of \(\sigma\), the higher the value of \(\eta\), the greater, in general, the accuracy of the ensemble system learned with PANACEA.

These results show the expected condition that a sufficient number of neural model candidates must be available for choosing the diverse base neural models of the ensemble, to allow PANACEA to exploit better both XAI and clustering. Furthermore, the results show that \(\eta =100\) is a plausible choice to allow PANACEA to take advantage of the XAI and clustering-based step in all the tested datasets.

Further considerations are formulated to explain the unusual fact that the MacroF1 of PANACEA decreases in UNSW-NB15 with \(\sigma =5\%\) of the training set size as \(\eta\) increases. To this end, we recall the conclusions drawn from the analysis of the per-class F1 scores of PANACEA (see Sect. 6.4.1). This analysis showed that PANACEA may perform poorly predicting several minority classes in UNSW-NB15. Hence, the experiment of this sensitivity study confirms that the ability of PANACEA to recognise minority classes may decrease in UNSW-NB15 as \(\eta\) increases. This highlights the need to strengthen the performance of the fusion approach in the imbalanced domains.

Further conclusions can be drawn from the analysis of the number k of base neural models that the clustering step of PANACEA decides to fuse together through the ensemble system. This investigation is here coupled with the analysis of the computation TIME spent in the training stage and the testing stage of both PANACEA and NO_SEL. As expected, the higher the input value of \(\eta\), the higher the estimated value of k. However, k is always significantly lower than \(\eta\) in all the datasets, independently of the input value of \(\eta\). This allows PANACEA to complete both the training and testing stage more efficiently than NO_SEL, thanks to its ability to spend less time both completing the fine-tuning of the multi-headed ensemble architecture and using the finally trained ensemble system to predict the class of any new cyber sample. This trend in the computation time results is observed in all the datasets, independently of the tested values of both \(\sigma\) and \(\eta\).

6.4.5 Analysis of the fusion strategy

Table 9 WeightedF1, MacroF1 and OA of PANACEA and MAJORITY RULE tested with \(\eta =100\) neural model candidates and trained with \(\sigma =10\%\) of the training set size in NSL-KDD, UNSW-NB15 and CICIDS17, and \(\sigma =5\%\) of the training set size in CICMalDroid20

We compare the performance of PANACEA to that of MAJORITY RULE. Both configurations produce ensemble systems with neural models selected by running the k-medoids algorithm on the initial bag of \(\eta\) neural model candidates trained with the adversarial training. However, MAJORITY RULE was run by leaving out any additional deep training stage of the ensemble system. It simply took the majority rule of the predictions of the base neural models to be combined. Table 9 reports the WeightedF1, MacroF1 and OA of both PANACEA and MAJORITY RULE run with the elbow \(\epsilon\) reported in Table 4, \(\eta =100\) and \(\sigma = 10\%\) of the training set size for NSL-KDD, UNSW-NB15 and CICIDS17, and \(\sigma = 5\%\) of the training set size for CICMalDroid20. Accuracy metrics were computed on the testing set of each dataset. The results show that the use of a multi-headed deep neural architecture to train the final ensemble commonly contributes to outperforming (or performing equally to) MAJORITY RULE in all the datasets. The only exception is observed in MacroF1 of UNSW-NB15, where training the ensemble seems to introduce some decision artifacts on minority classes. This is possibly due to the risk of classification overfitting on majority classes during the neural fusion. However, PANACEA achieves higher WeigthedF1 and OA than MAJORITY RULE also in UNSW-NB15.

6.4.6 Joint analysis of \(\epsilon\), \(\sigma\) and \(\eta\)

Table 10 WeightedF1, MacroF1 and OA of PANACEA by varying \(\epsilon\), \(\sigma\) and \(\eta\) in CICIDS17
Table 11 WeightedF1, MacroF1 and OA of PANACEA by varying \(\epsilon\), \(\sigma\) and \(\eta\) in CICMalDroid20

In this Section we explore the effect of changing \(\epsilon\), \(\sigma\) and \(\eta\) contemporaneously on the accuracy of PANACEA. We perform this sensitivity study on CICIDS17 and CICMalDroid20, which are the most recent datasets considered in this study. In fact, both CICIDS17 and CICMalDroid20 present a good picture of contemporary network intrusion detection problems and malware detection problems, respectively. We analyze the accuracy performance of PANACEA collected by varying \(\epsilon\) from 0.000001 to 0.1, that is, varying \(\epsilon\) in the search space explored by PANACEA with the Elbow method, \(\sigma\) between \(5\%\) and \(10\%\) of the training set size and \(\eta\) among 25, 50, 75 and 100. \(elbow\epsilon\) denotes the elbow \(\epsilon\) estimated automatically with the Elbow method during the training stage of PANACEA. In particular, \(elbow\epsilon =0.01\) in CICIDS17 and \(elbow\epsilon =0.00001\) in CICMalDroid20 (as reported in Table 4). In addition, we explore the performance of PANACEA with \(\epsilon =0.2\), although the study of Bai et al. (2021) suggested the generation of adversarial samples with small perturbations obtained with \(\epsilon \le 0.1\). Tables 10 and 11 report WeightedF1, MacroF1 and OA collected in both CICIDS17 and CICMalDroid20 by varying \(\epsilon\), \(\sigma\) and \(\eta\) as described above.

The results show that the accuracy of PANACEA depends on parameters \(\epsilon\), \(\sigma\) and \(\eta\). Notably, the highest accuracy is achieved with elbow \(\epsilon\), \(\sigma =10\%\) and \(\eta =100\) in CICIDS17, and elbow \(\epsilon\), \(\sigma =5\%\) and \(\eta =100\) in CICMalDroid20, respectively. This conclusion provides some empirical evidence of the effectiveness of the approach used in the training stage of PANACEA for the automatic selection of \(\epsilon\).

On the other hand, this sensitivity study highlights, once again, that the accuracy performance of PANACEA varies with \(\sigma\) and \(\eta\). In fact, the results show that although the highest accuracy performance is obtained with the elbow \(\epsilon\), not all configurations run with the elbow \(\epsilon\) obtain the highest accuracy values in the sensitivity study. Specifically, the best \(\sigma\) changes with the data domain (\(10\%\) with network traffic data recorded in CICIDS17 and 5% with Android apps recorded in CICMalDroid20). We note some variability also in results obtained varying \(\eta\). The configurations achieving the highest accuracy use \(\eta =100\) in both datasets. However, there are configurations of \(\epsilon\) and \(\sigma\), where better accuracy performances can be obtained by increasing \(\eta\), while there are configurations of \(\epsilon\) and \(\sigma\), where better results can be obtained reducing \(\eta\). Hence, this study shows that the parameter selection may be critical and exploring the characteristics of data that may provide guidelines for the automatic set-up of \(\sigma\) and \(\eta\) requires further investigation in the future.

6.4.7 Analysis of related methods

Table 12 OA and training TIME (in minutes) of PANACEA vs. related methods in the multi-class classification of the testing set of NSL-KDD, UNSW-NB15, CICIDS17 and CICMalDroid20

In this Section we compare the performance of PANACEA to that of relevant methods, selected from the recent state of the art in the cybersecurity literature. In particular, we considered the following methods:

  • Andresini et al. (2022), Wang et al. (2020), Tang et al. (2020) and Zhao et al. (2022), which combined Deep Neural Networks and XAI.

  • Al-Essa et al. (2022), which combined Adversarial Learning and XAI.

  • Andresini et al. (2021b) and Tang et al. (2020), which combined Autoencoder and Deep Neural Networks.

  • Bedi et al. (2021), Bedi et al. (2020) and Guo et al. (2017), which adopted an Ensemble strategy.

  • Gao et al. (2020), Vinayakumar et al. (2019), Isra and Najwa (2021) and Kasongo and Sun (2020), which used Deep Neural Network architectures.

  • Lopez-Martin et al. (2017), which combined Autoencoder and Multi-Layer Perceptron.

  • Ma and Shi (2020), which combined Deep Reinforcement Learning and Data Augmentation.

  • Yin et al. (2020); Caminero et al. (2019), which combined Deep Neural Networks and Adversarial Learning.

  • Gao et al. (2019), Isra and Najwa (2021) and Kasongo and Sun (2020) which combined Deep Neural Networks and Feature synthesis.

The related methods that integrate an ensemble learning strategy (i.e., Bedi et al. (2020), Lin et al. (2021) and Guo et al. (2017)), an XAI technique (i.e., Al-Essa et al. (2022), Wang et al. (2020), Andresini et al. (2022) and Tang et al. (2020)) or an adversarial learning approach (i.e., Al-Essa et al. (2022), Yin et al. (2020) and Caminero et al. (2019) are the closest to PANACEA. The codes of Al-Essa et al. (2022),Footnote 15 Andresini et al. (2022),Footnote 16 and Andresini et al. (2021b)Footnote 17 are publicly available to reproduce the results reported in this study. The code of the method described by Guo et al. (2017)Footnote 18 is publicly available for binary classification. We adapted this code to the multi-class problems considered in this study. In addition, this method requires that the input space of the ensemble includes both categorical and numeric features. Hence, it was tested on NSL-KDD and UNSW-NB15 that are the two datasets that include both numeric and categorical features in this study. The accuracy results of the remaining related methods were taken from reference papers, as their code was not publicly available for repeating the experiments. However, the comparative analysis is safe as we compared methods experimented on the same training and testing set division of all datasets. In particular, the division of NSL-KDD and UNSW-NB15 were made publicly available by Tavallaee et al. (2009) and Moustafa and Slay (2015), respectively. Table 12 collects the OA and the training TIME (in minutes) of the compared methods. We considered the OA metric for the comparative study as it was provided in all the reference studies. On the other hand, we collected the training TIME for the methods whose code was publicly provided to run the experiments.

The results show that PANACEA outperforms most of its related methods, including the ensemble-based methods recently evaluated on NSL-KDD (Bedi et al., 2020; Guo et al., 2017) and UNSW-NB15 (Lin et al., 2021; Guo et al., 2017). PANACEA outperforms the deep neural method described by Al-Essa et al. (2022), which integrates XAI and adversarial training. The only exception is observed in CICMalDroid20, where the method of Al-Essa et al. (2022) achieves the highest OA, with PANACEA as the runner-up (0.89 vs 0.90).

Final considerations concern the training TIME spent completing the training stage. PANACEA completed the learning stage by spending more computation time than the methods described by Andresini et al. (2022), Andresini et al. (2021b) and Guo et al. (2017). On the other hand, PANACEA completed the learning stage by spending less computation time than the method described by Al-Essa et al. (2022). The method described by Andresini et al. (2021b) is a deep metric learning method that uses neither XAI nor adversarial training nor ensemble learning. The method in Andresini et al. (2022) uses deep intrinsic explanations (i.e., attention) with neither adversarial training nor ensemble learning. On the contrary, the method described in Guo et al. (2017) performs ensemble learning by fusing a factorization approach with deep learning. In addition, the method described by Al-Essa et al. (2022) performs both adversarial learning and XAI without ensemble learning. In particular, the method of Al-Essa et al. (2022) performs adversarial training with FGSM, but it uses local XAI with SHAP (Lundberg & Lee, 2017). So, the more computation time spent completing the learning stage in Al-Essa et al. (2022) can be mainly explained by the use of a local XAI approach.

6.4.8 Analysis of an example

We complete this analysis by illustrating an example that shows how the ensemble model of PANACEA gains accuracy in a cyber-threat detection task compared to the single model of BASELINE. For this purpose, we consider an R2L sample of the test set of NSL-KDD that was wrongly classified by BASELINE in the class Normal, while it was correctly recognised in the class R2L by PANACEA. We analyse this sample by using SHAP (Lundberg & Lee, 2017), that is a local algorithm to measure the effect of an input feature on the assignment of a sample to a class with a neural model. Figure 8 shows the five most important input features identified by SHAP to see the sample in the class R2L with the models learned by both BASELINE and PANACEA. Let us consider that only PANACEA predicted this sample in the class R2L.

Fig. 8
figure 8

Top-5 input features considered by both BASELINE (8a) and PANACEA (8b) to recognize an R2L attack in the class R2L

Both BASELINE and PANACEA share the same top-3 features, i.e., service_http, service_ftp_data and dst_host_srv_count. Notably, these three features are recognised as important to detect R2L attacks also in Sabhnani and Serpen (2003). The input feature in the fourth place of the feature ranking of PANACEA is protocol_type_tcp. We note that protocol_type_tcp does not appear in the feature ranking of BASELINE. Wang et al. (2020) report that the simultaneous use of the TCP protocol and the FTP service is to be considered a symptom of a possible Warez Master attack in network traffic. Warez Master is a subcategory of R2L attacks, where attackers exploit a system bug associated with FTP to send packets of illegal software to a target host (Wang et al., 2020). We note that FTP is a service based on the TCP protocol. Therefore, this example shows how the ensemble model of PANACEA manages to bring out the existence of feature patterns useful for the recognition of attack classes that are often ignored by the single model of BASELINE. These conclusions are also supported by the study of Sabhnani and Serpen (2003), that identifies both service_ftp_data and protocol_type_tcp features as the most important features to detect R2L attacks. In addition, BASELINE, differently from PANACEA, identifies serror_rate as one of the most relevant features for recognizing the sample as an R2L attack. However, neither Sabhnani and Serpen (2003) nor Wang et al. (2020) identify this feature as one of the most prominent features for this type of attack.

In short, we consider that the emergence of protocol_type_tcp as an important input feature instead of serror_rate motivates the ability of PANACEA in correctly recognising the considered R2L sample and, in general, the ability of outperforming BASELINE in the recognition of R2L attacks (as shown in Table 6).

7 Conclusions

In this paper, we have presented a deep learning method for multi-class classification of cyber-data. The proposed method trains an ensemble of base neural models, whose weights are initialised with an adversarial training strategy. We use an XAI-based approach to measure the diversity of neural model candidates and increase the diversity of the neural models selected to be fused together through the ensemble system. We adopt a neural ensemble architecture that allows us to share knowledge among selected base neural models. Extensive experimentation was performed to show the effectiveness of the proposed ensemble system in three benchmark network intrusion detection problems and one malware detection problem.

One limitation of the proposed method is the absence of transparency in the decisions finally produced through the ensemble system. Despite the use of XAI for increasing diversity in the base neural models selected for the ensemble fusion, there is no specific mechanism that explains the final decisions of the ensemble system. A research direction is to continue the exploration of XAI techniques, to provide interpretable insights into the input features that mainly condition the global ensemble decisions.

Another limitation is that the performance of the final ensemble may depend on the number of adversarial samples produced for the adversarial training stage of each neural model candidate. The experiments showed that the number of adversarial samples to be processed may depend on the cyber-data features. The sensitivity study also shows that some variability may occur in the accuracy results due to the number of neural model candidates generated for the ensemble pruning and fusion. So, the investigation of how input features may provide useful guidelines for automating the selection of the amount of adversarial samples to be considered for the candidate generation and, possibly, the number of candidates to generate requires further investigation in the future.

An additional direction for further future work includes the extension of the proposed approach to deal with the presence of a condition of class imbalance in the collected cyber-data. For example, we intend to explore the performance of techniques to achieve the balanced condition in the learning stage (e.g., during the adversarial training stage).