1 Introduction

In March of 2019, only two months after a similar attack on Altran Technologies, the LockerGoga ransomware was used against Norsk Hydro, the largest aluminum manufacturer in Europe, hiring over 35000 people and having sites in more than 50 countries all across the globe. The attack caused a serious decrease in production and issues with the execution of the ongoing contracts. The losses were estimated to equal millions of dollars per day, and the grand total of losses was estimated to reach hundreds of millions of dollars. The attack occurred on 18/19 March 2019, mostly impacting the infrastructure in Norway, and other countries to a lesser extent. It resulted in the shutdown of the global Norsk Hydro network.

The attack affected work at the offices (causing, for example, problems with order documentation) as well as the industrial manufacturing, where, besides other issues, the manufacturing drivers had to be uploaded manually through usb drives.

The attack was a cyber-criminal case committed for financial gains. The ransomware had turned off a part of the system’s security mechanisms, as well as the data backup processes, before starting data encryption. All the local user passwords were changed.

The ‘Ransom’ was not paid, the recovery of data from backups took months. As of March 2019, the LockerGoga ransomware was undetectable by 67 of the state-of-the-art antiviruses. The experts noted that better anomaly detection systems could have prevented the incident.

In June 2019, a vulnerability in the Amazon Ring Video Doorbell was discovered. The flaw in the product’s security made it possible to connect to the home WiFi and possibly exploit other connected devices [2]. A similar issue was discovered in the Amazon Blink XT2 security camera. The security flaw, which was discovered in August 2019, allowed unauthorised users to view the footage from the cameras and listen to their audio. In fact, the flaw made it extremely easy to gain root access to the device [16].

The ’Attack Landscape’ report illustrates that a number of network attacks are carried over Telnet and Secure Shell (SSH) with a high probability of targeting IoT devices [7].

Therefore, in this paper, we propose a new innovative method to detect anomalies in IoT environment.

The major contribution of this work is the proposition of a time window embedding solution with a transformer-based classification scheme.

The remainder of this paper is structured as follows: In Sect. 2, the related work is overviewed, in Sects. 3- 5 the proposed method is described, experimental setup is presented in Sect. 6, while the results are reported in Sect. 7. Conclusions are drawn thereafter.

2 Related work

In the literature, there are two approaches to intrusion detection, namely the signature-based and anomaly-based ones. Typically, when the attack is deterministic, one can develop a signature that will allow for its detection. However, nowadays attackers use various obfuscation techniques to evade such detection mechanisms. Therefore, the cybersecurity community is investing its efforts in the anomaly detection system. These turn to be more effective in detecting new and unknown (so called 0-day) cyber-attacks [21].

In [3], the authors performed a survey of current tendencies in cybersecurity and concluded that two major trends emerge - one is that old, proven methods are still in use in many applications. The other is that machine-learning-based (ML) approaches are increasingly more prominent. Furthermore, [17] points out that ML is now used on both the malware and security sides.

When it comes to network traffic analysis, two popular approaches are used by experts from the cybersecurity domain. One is based on deep packet inspection [4], while the other relies on network flows analysis [11]. One of the most popular protocols for network flow data collection is NetFlow [6]. That kind of data is often captured by Internet service providers for auditing and performance monitoring purposes. NetFlow samples do not contain much of sensitive data and therefore are widely available. However, the disadvantage is that such samples do not contain the raw content of network packets. Such details are valuable and can improve the effectiveness of malware detection. However, these are rarely available because of the encryption, which is often utilized by the end-points terminal.

The current research shows that the network flow data can be effectively analyzed using various machine-learning techniques such as unsupervised clustering [8], RandomForests (RF) [22], or deep learning [19]. The authors of [5] present a range of deep neural network topologies and test the influence of hyperpaprameter setups on the accuracy of the solution. On the flip-side, in [12], a stream processing framework capable of employing a range of ML algorithms for intrusion detection is presented.

Obviously, the different methods vary in the way they process the NetFlow data. For instance, in [10], the authors proposed a solution called CCDetector. It uses a state-based behavioral model of the Command and Control channels. The author of this algorithm adapts the Markov Chain to model malware behavior and to detect similar traffic in unknown real networks. The difference from the BClus (and our approach) is the fact that instead of analyzing the complete traffic of an infected computer as a whole, the authors separate each individual connection from each IP address and treat these as an independent connection. The results obtained with this method are very promising. However, one of the concerns is the complex and time-consuming learning phase.

In opposite to that, in [19], the authors have adapted recurrent neural networks (RNN) with long short-term memory (LSTM) units on top of the NetFlow data. In addition to that they also used a flexible distributed architecture to handle the curation of large amount of data.

An interesting approach, which maps the NetFlow data to the image representation, has been presented in [13]. In order to construct the images, the authors have used such techniques as feature correlation analysis and correlation matrices. The images have been analyzed with a convolution neural network (CNN) in order to detect intrusions. According to the authors, this method achieves high accuracy.

A CNN for flow-based malware detection is also proposed in [20]. The authors advocate that current detection systems are overreliant on certain network features, like the port number, which could introduce a blind spot in the system. Thus, they calculate 35 features with the use of Netmateto to fully express the state of the network and provide those to the CNN and other ML algorithms.

The authors of [18] present a deep network model capable of automatic feature extraction, which takes time-related characteristics into consideration. To achieve that a GRU network along with a multilayer perceptron (MLP) is used. The authors also test a network with LSTM cells.

The authors of [14] evaluate autoencoders (sparse, denoising, contractive, convolutional), LSTM, and CNN for network intrusion detection. Autoencoders obtain the latent representation of the feature set. When the hidden layer has fewer neurons than the input/output layers, it is called a bottleneck, discriminative, coding, or abstraction layer. Using such a bottleneck forces the topology to acquire the most significant features.

In [9], instead of flow classification, the flow prediction approach is used. In order to achieve this, the authors combine an RNN (with gated recurrent units) with the so-called linear regression layer, which allows for producing prediction in a similar fashion as auto-regressive integrated moving average (ARIMA) models do with time series.

In [1], the authors used the auto-regressive fractionally integrated moving average (ARFIMA) model and proposed the Hyndman-Khandakar algorithm to estimate the polymonials parameters and the Haslett and Raftery algorithm to estimate the differencing parameters for network anomaly detection.

3 Proposed method

The proposed solution (see Fig. 1) captures network flows (as streams), calculates feature vectors over a predefined time window, provides these vectors to a binary classifier, which eventually produces the detection output (benign for normal traffic or anomaly for traffic containing suspicious patterns). In the next section, the details on each of these processing steps will be provided. First, the overview of the input data is given, and then, the effective methods for feature extraction are elaborated upon. Finally, a brief description of the classification methods incorporated in this work is provided.

Fig. 1
figure 1

Tumbling windows - example of window statistics embedding

3.1 Flow-based data acquisition

Conceptually, in this approach, the data are collected from the network in the form of communication flows traversing such devices as switches, routers, or hosts. This kind of data captures aggregated network properties and statistics. From the architectural point of view, the network traffic going through the flow-enabled devices is collected and later on sent to the collectors - the network elements, which store them and keep them for the operator for later analysis. In particular, network flows are often used by the network administrators for auditing purposes. A single flow aggregates such characteristics as:

  • incoming and outgoing number of bytes

  • IP addresses taking part in the communication

  • utilized source and destination ports

  • utilized type of protocol (e.g., Transmission Control Protocol (TCP) or User Datagram Protocol (UDP))

(e.g., the number of bytes sent and received) about packets that have been sent by a specific source address to a specific destination address. It is obvious that it must be possible to identify some patterns of anomalous behavior of network nodes from such kind of data. Some of these patterns may be related to malware infection or help the network administrator to identify adversaries.

3.2 Time window embedding with probabilistic data structures

The rationale behind the proposed embedding is to encode a network flow using only its nearest neighborhood in the time domain. This approach allows us to capture some short-term malicious behavior of specific network elements and nodes.

In the proposed approach, we calculate the statistical properties of a group of flows that have been collected for a specific source IP address within short and fix-length time spans called time windows.

As it was presented in the previous section, a single flow exhibits various characteristics describing the two-way communication (e.g., number of flows, destination IP, etc.). For each of these characteristics, it is possible to calculate various statistical properties such as mean, median value, min/max values. In order to make it clear what is being calculated for each of those characteristics, we have included Table 1. Overall, it leads to the situation where the traffic characteristics produced by a single IP address within the considered tumbling time window are described by 37 values.

Table 1 Overview of all the features extracted from network flows

In general, counting the number of flows and/or accumulating the sum of inbound and outbound packets may be trivial, but the situation is different when it comes to distinct counting or finding the most frequent element (e.g., destination port) in the stream of data. The straightforward approach would be to maintain a dynamic list. Whenever a new element is retrieved from the stream, one must scan the entire list and check whenever that element is there. If not, the list needs to be resized and new element added. Moreover, there is another level of complexity when we want to merge the results obtained from two concurrent processes that perform distinct counting.

There is a class of data structures which are known as probabilistic (or sketchy). These have the ability to describe remarkably large sets with sub-logarithmic or constant space complexity. That implies that there is no need to scale-up the data processing system when it undergoes the transition from thousands, to millions, or even billions of records that need to be analyzed.

Probabilistic data structures rely on various mechanisms to compress data, and often, these mechanisms may cause them to contain inaccurate information.

However, this inaccuracy should not have a strong impact on the detection part. The assumption is based on the observation that the classifiers can handle such changes to some extent and be able to return the correct decision. It must be noted that here we are talking about changes that are within a range of 1–2% for one of the features building the vector.

Probabilistic data structures have several advantages. First of all, the size of such structures grows significantly slower with respect to the input data. In many cases, it is orders of magnitude smaller. Moreover, it is also possible to make the trade-off between the accuracy of prediction vs. the size of the data structure. They are naturally suited for measuring network traffic, which has a form of streaming data, where each item in the stream needs to be analyzed quickly and in needs to update a data structure that summarises some properties (e.g., the number of distinct IP addresses or the most frequently used service). A substantially useful property of the probabilistic data structures is the ability to be merged. It means that when the stream is split into two parts and the summary over them is calculated separately, the result will be the same as if it was calculated over the entire (original) stream. As a result, this makes the probabilistic structure highly parallelizable and suitable for distributed computing platforms (e.g., Hadoop, Spark, Druid, etc.)

4 Frequent items and distinct counting - the problem overview

In order to calculate the most frequent destination port or destination host (or even concatenation of both) originating from a specific IP address, one may use data structures such as a hash table to accomplish that. In such a case, the new item is put in the hash table and the counter for that item is set to 0. Whenever the entry in the hash table already exists, a counter is just incremented. However, such an approach may quickly become impractical when the amount of input data is significant, because of two reasons. Firstly, along with the growing size of the input data, the hash table will grow as well and eventually its capacity will exceed the amount of available RAM memory. Secondly, the collisions in the hash table are handled as a linked list. It means that whenever a new item is hashed to the bucket that is already taken, it will be appended (linked) after the existing one. Therefore, when the list becomes longer and longer, the access time to such elements in the hash table will be substantially longer as well. Also, the dynamic allocation of the memory (for a new element) is also time consuming. In this section, the most important sketchy data structures that have been used in this research are described.

4.1 Count-min data structure

Count-min (CM) data structure allows for counting of items that are of a different type, e.g., how many times a specific IP address has contacted port 8080. From the scientific point of view, CM is an array of width w and depth d CM[1, 1] ...CM[d, w]. It uses a set of d hash functions:

$$\begin{aligned} \begin{aligned} h_1 \dots h_d : \{1 \dots n\} \rightarrow \{1 \dots d\} \end{aligned} \end{aligned}$$
(1)

which belong to a random pairwise-independent family of functions. The width w and depth d are chosen based on allowed by user error rates and are calculated as:

$$\begin{aligned} \begin{aligned} \begin{array}{cc} w = {\left\lceil \frac{e}{\epsilon } \right\rceil }&d = {\left\lceil ln \frac{1}{\sigma } \right\rceil } \end{array} \end{aligned} \end{aligned}$$
(2)

where \(\epsilon\) (an acceptable error in estimation) is the error factor and \(\sigma\) is the error probability. At the beginning, the CM array is initialized with 0 values. Each time the count needs to be updated for a specific value x, the hash functions are calculated and modded with the width w. This yields the column number \(col=h_n(x) \% w\). Finally, the cell at position (n, col) in the CM array is incremented by one. We use a similar approach when querying the data structure. We simply take the modded values obtained from hash functions and find the minimum. The visual representation of the concept behind the count-min data sketch is presented in Fig. 2.

Fig. 2
figure 2

Count-min data structure - overview of the architecture

4.2 HyperLogLog sketch

HyperLogLog (HLL) belongs to a family of algorithms that aim at estimating the cardinality of a dataset. It relies on the probabilistic counting method. Assuming that we have a large data set with duplicated entries, we can evenly distribute the elements in the dataset using a hashing function and estimate the cardinality using the hashed values. The common approach is to count the leading zeros in the binary representation of the hashed values. The probability of observing n leading zeros is equal to \(\frac{1}{2^n}\). In other words, if we denote \(p(v_1)\) as a number of leading zeros in \(v_1\), we can calculate the cardinality as \(n=2^R\), where:

$$\begin{aligned} \begin{aligned} \begin{array}{cc} R=max(p(v_1),p(v_1),\dots ,p(v_m)) \end{array} \end{aligned} \end{aligned}$$
(3)

The visual representation of the HyperLogLog data sketch is presented in Fig. 3.

Obviously, a single estimator of that kind is subject to high variance. Therefore, the common approach is to use several estimators and to average the results. This can be achieved using several independent hash functions.

Table 2 IoT23 dataset - scenarios setup (x and – indicate training and validation sets, respectively)
Fig. 3
figure 3

Architecture of HyperLogLog data structure

Fig. 4
figure 4

The architecture of the proposed transformer-based classifier

5 Classification with transformer

Transformers have been proposed in the area of natural language processing (NLP). This is a relatively novel architecture that aims at solving sequence-to-sequence problems while handling long-range dependencies. The original transformer architecture involves the so-called encoding and decoding parts. In this research, only the encoder is used (see Fig.4), since the aim is to encode the behavior of specific node elements using the latent representation produced by the encoder part of the transformer architecture.

The data are ingested into the transformer using the time windows embedding technique described in the previous section. In general, several network flows belonging to a specific time window and related with a specific IP address are encoded using the technique leveraging probabilistic data sketches. This operation results in vectors of a fixed length. The transformer works with sequences, which in our case describe the behavior of a specific network element over the defined time period. The sequence is composed by putting several embedding vectors one next to another.

Next, the sequence of vectors goes through the positional encoding, which allows us to capture the order of the vector in the input sequence. This looks quite useful from our perspective, because usually the attacker executes some series of actions in order to carry out a successful cyberattack.

Afterward, the positionally encoded input reaches the multiheaded attention layer. From a high perspective, this layer allows the model to look at other positions in the input sequence for clues that can improve the final detection. For example, the infected machine would try to contact the botmaster if a few moments before it was infected with malware.

In the next step, the information coming from the self-attention is normalized and goes to the feed-forward layer, the output of which is normalized as well. As it is depicted in the diagram, there are two residual connections: one around the self-attention and the other around the feed-forward layer.

Because the input leaving the transformer layer contains one output vector for each element in the input sequence, we use the average pooling layer. It takes the mean across all the elements in the input. Finally, on top of the entire model, we used two layers of the feed-forward network with two outputs, one to indicate the benign sequence and the other to indicate the anomalous sequence.

Table 3 Effectiveness comparison for different classifiers and scenarios
Fig. 5
figure 5

Comparison of the f1-score achieved for various algorithms and scenarios

6 Experiments

6.1 The goal of the experiments

The goal of the experiments is to compare the proposed approach with the state-of-the-art methods. In this paper, we have considered two different evaluation scenarios. Firstly, we investigate how the proposed time window embedding technique operates along with the transformer-based anomaly detection architecture. In that regard, we have compared our approach with RandomForest (with 500,100,10 trees, respectively), vanilla REPTree (decision/regression tree), and AdaBoosted version of REPTree. Secondly, we investigate to what extends the proposed model is able to generalize to unknown malware infection scenarios. In that regard, we have considered various test-case scenarios where the models are trained and evaluated on malicious samples that have been recorded for different attacks (or infections).

6.2 Aposemat IoT-23 dataset

IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and three captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. This IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labelled IoT malware infections and IoT benign traffic for researchers to develop machine-learning algorithms. This dataset and its research is funded by Avast Software, Prague. In addition to easier reproducibility of the study, using a benchmark dataset allows to handle various privacy issues, as outlined in [15].

6.3 Experimental protocol

The dataset was split into training and testing parts. This is explained in Table 2. Different scenarios are used, where different parts of the original dataset have been employed. We followed such an approach in order to prove that the proposed method can generalize well to unknown malware families. In that regard, we have used different scenarios (malware families) for training and testing the models. Nonetheless, some malware names appear both in training and testing. However, these malware are recorded in different network captures, which concern different contexts (different network elements, different IoT devices, etc.)

Additionally, for the training part of the dataset, fivefold cross-validation was adapted. This allowed to calculate the standard deviation of measured performance characteristics. Moreover, this also enables the discussion on the significance of differences obtained for various approaches. For that matter, t-test statistical hypothesis test was used.

6.4 Evaluation metrics

Before using various machine-learning algorithms, the raw network flows are processed in order to produce the time window embedding vectors. The procedure is detailed in the previous sections. The procedure for calculating metrics is as follows:

  1. 1.

    communication flows are aggregated into time windows (here we have used 3 min. time windows).

  2. 2.

    for given time, windows embedding vectors are calculated

  3. 3.

    within the ground-truth communication flows, labels are examined against the predicted ones and the TP, TN, FP, and FN errors (true and false positives and negatives) are measured.

  4. 4.

    finally, Recall, Precision, F-measure Rate are estimated and reported.

7 Results

The results for classification parts are presented in Table 3. We have compared the proposed transformer-based approach against various and popular machine-learning techniques. On the list, there is a decision tree and classifiers ensembles (one is AdaBoost, while the other is the well-known RandomForest).

The t-test statistical hypothesis test was used to validate that the results obtained by the proposed approach are significantly different from the other compared methods. In that regard, the values followed \(by \pm symbol\) (Table 3) indicate standard deviation.

It must be noted that we have used different scenarios for training and testing the models. This proves the approach can generalize well to unknown malware families.

What can be found in Table 3 is that these base-line methods behave quite well. However, the transformer-based approach outperforms other methods in most of the considered scenarios. It is quite vivid for f1-score metric that has been visually presented in Fig.5. It must be noted that the transformer-based approach allowed us to achieve remarkably good results for the fourth scenario, where less than 40% of the original datasets is used.

The second good results are reported for the RandomForest. The experiments show that increasing the number of trees above 100 does not bring much to the effectiveness. The Adaptive Boosting (AdaBoost) ensemble of REPTrees performs well for the first and the second scenarios; for the scenarios 3 and 4, this method stays a bit behind the RandomForest.

What can be found interesting is the observation that the f1-score does not change that much between scenarios 1,2,3 and 4 for the proposed method. On the other hands, one can observe quite significant fluctuations for the other approaches. In particular, this change is much more significant for Reduced Error Pruning Decision Tree (REPTree) and Adaptive Reduced Error Pruning Decision Tree (AdaREPT) classifiers.

8 Conclusions

In this paper, we propose innovative anomaly detection that utilizes innovative time windows embedding solutions that efficiently process a massive amount of data, while having a low-memory-footprint at the same time. The core anomaly detection is based on the transformer’s encoder unit followed by a two-layer feed-forward neural network. In the paper, we have formally evaluated various machine-learning schemes in order to compare these with the proposed approach and to discuss their effectiveness in the IoT-related context. The proposal is supported by detailed experiments that have been conducted on the recently published Aposemat IoT-23 dataset. Our experiments show that the proposed approach that leverages the transformer-based classification performs best in most of the considered scenarios.