1 Introduction

COVID-19 (2019) is a highly infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Compared to the previous epidemic diseases like Ebola (2014) and SARS (2002), COVID-19 has a much lower fatality rate but spreads much faster. As of now, approximately 225 countries and territories have reported more than 591 million cases of COVID-19. Due to this, the World Health Organization (WHO) has declared COVID-19 a global pandemic.Footnote 1 Further, the initial symptoms of COVID-19 are quite similar to common flu, as both are respiratory diseases and could cause fever, cough, and fatigue. Thus, it is very important to quickly identify the people with such symptoms and send them to quarantine, where formal tests like Reverse Transcription Polymerase Chain Reaction (RT-PCR) could be conducted to verify the presence of coronavirus.

Fig. 1
figure 1

Architecture of Flu-Net based on ResNeSt50 and I3D models. Our proposed framework has three steps. First, frame difference operation is performed to remove background details and retain only foreground motion information. Second, frame differences are directly fed into I3D stream, while an RGBMI is created for ResNeSt50 stream. Finally, an optimal set of features are selected using GWO technique to train a MLP classifier

As per WHO, coronavirus can spread through contact, droplets and fomites, and may survive on surfaces for up to 72 h. Further, since no vaccine has 100% efficacy against COVID-19, the most effective method to prevent the spreading of the virus is to implement a total lockdown, where the general public is advised to stay in their homes. However, lockdown is only a short-term solution to this health crisis. Once the lockdown is lifted, people will come out in the open environment, and again become susceptible to infection. The riskiest places are airports, transport stations, shopping malls, conference venues, public administrations, etc., where the chance of people coming in contact with each other is very high. This could result in a new wave of coronavirus, which will put our healthcare systems under unimaginable strain, especially in densely populated countries like India and China. However, if we could easily detect people in public places who show potential symptoms of the infections, we could isolate them fast to prevent spreading the infection further. To this end, we propose a two-stream heterogeneous network, called Flu-Net, to recognize Flu-like symptoms in people present in video streams. As shown in Fig. 1, our framework is built upon ResNeSt (Zhang et al. 2020) and I3D (Carreira and Zisserman 2017) models to learn complementary information from input RGB frames. Specifically, we first perform frame difference operation to suppress the influence of background information and focus on foreground motion areas. These RGB frame differences are directly fed into the I3D model to learn spatio-temporal features from the input video sequences, while for the ResNeSt50 model, we first stack frame differences together to generate RGB motion image (RGBMI) (Imran and Raman 2019), and then use them to train the network. Finally, the features extracted from both the streams are carefully fused together using GWO-based feature selection technique and classified using a Multi-layer Perceptron (MLP). To validate the efficacy of our technique, we conduct experiments on a Sneeze-Cough video dataset (Thi et al. 2014) containing eight common activities like waving, sneezing, coughing, etc. The results show that such a system could be installed at surveillance stations to quickly identify human subjects with potential Flu-like symptoms.

To summarize, our contributions are two-fold:

  1. 1.

    We are the first to propose such a heterogeneous network based on ResNeSt and I3D networks for video classification task.

  2. 2.

    We also optimized the feature selection process using GWO algorithm and achieve superior results than other state-of-the-art methods.

The rest of the paper is organized as follows. Section 3 presents the explanation of our two-stream heterogeneous network. Section 4 discusses the findings of our experimental results. Finally, Sect. 5 concludes the paper.

2 Related works

Deep neural networks like ConvNets and Long short-term memory (LSTM) have shown tremendous success in a variety of computer vision tasks like image classification (Krizhevsky et al. 2012; Li et al. 2022), video classification (Wang et al. 2016; Kujani and Kumar 2021), object detection (Wang et al. 2022a), gesture recognition (Zhang et al. 2018; Mohammed et al. 2022), video captioning (Natarajan et al. 2022; Wang et al. 2022b), medical image analysis (Deepak and Ameer 2021; Zhou et al. 2022), etc. ConvNets are loosely based on how all humans perceive the world around them. This basically involves a hierarchical series of feature recognition, starting off with simpler features like edges, blobs, etc., and progresses towards complex or abstract recognitions, like combinations of shapes, and finally, the classification of entire objects. A ConvNet architecture is built by stacking together a series of convolutional layers (Conv), pooling layers (Pool), batch normalization (BN) layers, fully connected (FC) layers, and softmax layers. A Conv layer is made up of a number of learnable filters, which convolve with the given input to extract features and pass them to the next layer. Pool layer is used to combine the output of the previous layer into a single value in the next layer. This is achieved by either taking the average of all the neurons (called AvgPool) or by selecting the maximum value among all the neurons (called MaxPool). BN helps to speed up the training process by standardizing and normalizing the output of the previous layer. FC layer connects every neuron in the previous layer to all the neurons in the next layer. Softmax layer is typically present at the end and performs classification by generating a probability distribution of all the classes.

In this section, we briefly discuss the application of deep neural networks in the area of video action recognition which has applications in several domains like human-computer interaction (HCI), healthcare services, video surveillance, video summarization, and similar other tasks.

2.1 Video action recognition using 2D ConvNets

The first application of ConvNets in video classification tasks is by Karpathy et al. (2014). They propose to input successive RGB frames into a multi-resolution AlexNet-based model for performing action recognition in videos. They highlight that training the network at two different resolutions could help maintaining the same level of accuracy without downgrading the network size. However, this method only fuses spatial features, while giving no consideration to motion features which are most critical for any video classification task. Simonyan and Zisserman (2014a) propose a two-stream network consisting of a spatial stream and a temporal stream. The spatial stream is trained using RGB video frames (spatial information) while the temporal stream utilizes optical flow (motion information). The softmax scores obtained from both the streams are normalized to train an SVM. Wang et al. (2015) propose trajectory-pooled deep convolutional descriptor (TDD) to integrate ConvNet features with handcrafted features to build a more robust feature descriptor for the video classification task. Donahue et al. (2015) propose an encode-decoder architecture called Long-term recurrent convolutional network (LRCN) by stacking LSTM layer with ConvNets, and training the entire network in an end-to-end fashion. Yue-Hei Ng et al. (2015) study two methods for handling full length videos. In the first method, they examine various convolutional temporal feature pooling architectures to analyze various ConvNet architectures. In the second method, they employ LSTM units on top of ConvNet features to realize video classification as a sequence classification task. Feichtenhofer et al. (2016) explore different fusion strategies of ConvNet streams, and concluded that fusion at the convolutional layer can reduce the number of parameters without any loss in accuracy. They also show that pooling of abstract convolutional features along with the use of high-quality optical flow methods also improves the results. Wang et al. (2016) propose Temporal Segment Network (TSN) to model long-range temporal dependencies in a video sequence, along with several other best practices like sparse sampling, network pre-training, and weighted average fusion of different modalities. Diba et al. (2017) propose a temporal linear encoding (TLE) layer which could be trained in an end-to-end fashion with any ConvNet model. TLE is capable of aggregating sparse feature maps over the entire video, and then projecting it to lower dimensional feature space, making it compact and computationally efficient to process. Girdhar et al. 2017 discuss learnable video-level feature aggregation using a vector of locally aggregated descriptors (VLAD) by splitting the descriptor space into k cells and pooling inside each of the cells. Wu et al. 2018 suggest that videos could be compressed to remove redundant information. They directly feed the motion vectors obtained from video compression into ConvNets, and extract more meaningful information for improved classification. Temporal Relation Network (Zhou et al. 2018) focus on modeling multi-scale temporal relation in videos under a sparse sampling strategy. Lin et al. (2019) propose a Temporal Shift Module (TSM) that can be plugged into 2D ConvNets and could perform temporal modeling by shifting the channels along the temporal dimensions in both forward and backward directions without increasing any computational cost. Feichtenhofer et al. (2019) develop SlowFast network which has two pathways: Slow pathway to capture the semantics of an object, and Fast pathway to capture motion. There are lateral connections from fast pathway to slow pathway so that the motion information can be combined with the semantic information, which ultimately improves the model’s performance. Recently, Ryoo et al. (2019) have propose a method called AssembleNet to automatically find neural network architectures at multiple temporal resolutions using different modalities like RGB and optical flow.

2.2 Video action recognition using 3D ConvNets

In addition to previous techniques, where spatial and temporal information is modeled separately, some recent works are based on 3D convolutions, which can directly extract spatio-temporal features over multiple frames. Ji et al. (2012) first apply 3D convolutions to perform human action recognition. Tran et al. (2015) propose C3D model capable of performing 3D convolutions and 3D pooling operations. Varol et al. 2017 propose long-term temporal convolutions (LTC) to prove that the action recognition accuracy of 3D ConvNets could be improved by learning long-term video representations. Carreira and Zisserman (2017) propose Inception 3D (I3D) model by inflating kernels and filters of 2D ConvNets in the temporal dimension, thereby reusing already established 2D ConvNet architectures pre-trained on ImageNet dataset. Qiu et al. (2017) propose Pseudo-3D Residual Net (P3D ResNet) by combining one 1\(\times\)3\(\times\)3 convolutional layer and one 3\(\times\)1\(\times\)1 convolutional layer so that both spatial and temporal information could be learned simultaneously using non-linear residual connections. Tran et al. (2018) present R(2+1)D model based on a single type of spatiotemporal residual block consisting of 2D spatial convolution followed by 1D temporal convolution. Hara et al. (2018) demonstrate that Kinetics dataset has sufficient data to train deep 3D ConvNets, and such models can easily outperform complex 2D architectures on smaller datasets. Xie et al. (2018) investigate I3D architecture to find out how to reduce its space and time complexity. They show that using seperable 3D convolutions (S3D), network parameters could be reduced significantly while achieving higher accuracy than I3D.

Compared to all these techniques, we propose to combine 2D and 3D ConvNets to build our heterogeneous network. This leverages to extract complementary features from videos that could accurately detect flu-like symptoms in an efficient manner. In order to further boost the recognition accuracy, we propose to apply GWO to perform feature pruning. To summarize, this paper has three main contributions: First, we propose Flu-Net which is a heterogeneous network based on ResNeSt50 and I3D architectures. Second, we investigate the application of GWO and other evolutionary algorithms to select the best features that enhance the performance of the proposed system. And finally, our Flu-Net obtains state-of-the-art results on BIISC dataset along with flu-specific activities extracted from NTU-RGBD dataset (Shahroudy et al. 2016).

3 Proposed method

Although ConvNets are a very powerful tool for any AI-based task, however, deciding the number and types of layers in a ConvNet architecture can be quite tricky and time-consuming. To overcome this issue, the most common solution prescribed in various computer vision literature is to pick a pre-trained ConvNet (like AlexNet (Krizhevsky et al. 2012), ResNet (He et al. 2016), MobileNet (Howard et al. 2017), C3D (Tran et al. 2015), etc.), and fine-tune it according to the task at hand. Following this guideline, we now explain the design of our two-stream heterogeneous network.

3.1 2D ConvNet stream based on ResNeSt50 model

In 2012, (Krizhevsky et al. 2012) first proposed a deep ConvNet model called AlexNet, and won the ImageNet challenge (Karpathy et al. 2014). Later, deeper models like VGG-16 (Simonyan and Zisserman 2014b) and Inception (Szegedy et al. 2015) are proposed to further improve the recognition accuracy. However, (He et al. 2016) show that with increasing network depth, accuracy gets saturated, and then degrades rapidly. This is because error gradients are not able to propagate back, and the network starts overfitting. To overcome this, the authors propose to add skip connection between the layers to provide an alternate pathway for data and gradients to flow, thus simplifying the process of training deeper networks. As a result, the layers in ResNet architecture are explicitly reformulated to learn residual functions as:

$$\begin{aligned} H(x) = F(x) + x \end{aligned}$$
(1)

where F(x) denotes the output (also called as feature maps) of one or more convolution and pooling layers.

Fig. 2
figure 2

A ResNeSt block has two hyperparameters: Cardinality (k) and radix (r). Each input feature is divided into k groups with r splits within each group. Attention weights are first calculated across all groups which are then multiplied by each feature map to generate the output representation

However, the main drawback of ResNet architecture is its inability to exploit cross-channel information. To overcome this limitation, (Zhang et al. 2020) propose to incorporate the attention mechanism into the ResNet model to come up with an improved architecture called Split-Attention Network (ResNeSt). As shown in Fig. 2, a ResNeSt block is based on the idea of cardinality from ResNeXt model (Xie et al. 2017) and attention mechanism from Squeeze and Excitation Network (SENet) (Hu et al. 2018). The term cardinality (k) defines the number of bottleneck blocks which breaks channel information into smaller groups, thereby increasing the recognition accuracy of the model. The SENet then squeezes each channel into a single numeric value by applying Global Average Pooling. Finally, excitation is performed by adding non-linearity using ReLU and smooth gating using sigmoid functions.

In our proposed framework, we use ResNeSt50 model and add two FC layers of 256 neurons with 60% dropout before the Softmax layer. Further, since ResNeSt50 accepts an RGB image (typically of size \(224\times 224\)) as input, we first convert each video into a single motion template called RGB motion image (RGBMI). As described in (Imran and Raman 2019), an RGBMI is computed by stacking absolute frame difference over the entire video sequence as:

$$\begin{aligned} RGBMI = \sum _{k=2}^{N}k*|frame^k-frame^{k-1}|, \end{aligned}$$
(2)

where N denotes the number of frames.

Fig. 3
figure 3

Samples of video frames (top row) and corresponding RGBMI (bottom row) for six action classes in BIISC dataset (Thi et al. 2014)

Some samples of RGBMI are shown in Fig. 3. We can observe that frame difference operation removes most of the irrelevant background information while stacking all the frame differences together helps to summarize the motion pattern into a single image. Each RGBMI is then resized to \(224\times 224\), and used to train ResNeSt50 stream. Once the network is sufficiently fine-tuned, \(256-D\) feature vectors for all the RGBMIs are extracted from the FC layer.

3.2 3D ConvNet stream based on I3D model

(Tran et al. 2015) first proposed a 3D ConvNet called C3D, which could perform convolution and pooling operations in temporal domain. However, C3D contains a huge number of parameters, making it extremely difficult to fine-tune on small datasets. Recently, (Carreira and Zisserman 2017) proposed an Inflated 3D ConvNet (I3D) based on Inception-V1 (Szegedy et al. 2015) as the backbone network. Specifically, all the \(N\times N\) 2D filters (pre-trained on ImageNet dataset) present in Inception-V1 are repeated N times in time dimension to create \(N\times N\times N\) 3D filters in I3D model (Fig. 4). Then the resulting network is trained on approximately 16K clips belonging to 400 classes of Kinetics dataset. This provides a strong initialization point to I3D model, making it possible to fine-tune even on small datasets.

Fig. 4
figure 4

The 2D Inception module from Inception-V1 architecture (Szegedy et al. 2015) (left), and inflated 3D Inception module from I3D architecture (Carreira and Zisserman 2017) (right). 3D convolutions have the advantage of directly extraction features from spatial and temporal dimensions simultaneously

In our proposed I3D stream, we add two FC layers with 512 neurons and 70% dropout before the Softmax layer as shown in Fig. 1. For input, we first perform frame difference operation and resize the resulting images to 112\(\times\)112 to reduce computational complexity. 64 RGB frame differences are then randomly selected from each video sequence to fine-tune I3D stream. Finally, \(512-D\) feature vector corresponding to each sample is extracted from the last FC layer.

3.3 Fusion of two-streams

Let \(F^{2D}=\{f^{2D}_1,f^{2D}_2,...,f^{2D}_n\}\) and \(F^{3D}=\{f^{3D}_1,f^{3D}_2,...,f^{3D}_n\}\) denotes the set of feature vectors extracted from our 2D and 3D ConvNet streams corresponding to n video samples, respectively. As stated previously, \(f^{2D}_i\in \mathbb {R}^{256}\) and \(f^{3D}_i\in \mathbb {R}^{512}\). Now there could be different fusion strategies (Feichtenhofer et al. 2016) to predict the actual class label:

  1. 1.

    Softmax Score Fusion The Softmax scores of both streams can be combined by using Sum rule (Equ. 3) or Product rule (Equ. 4).

    $$\begin{aligned} c = \mathop {\textrm{argmax}}\limits _j \sum _{i=1}^{n}P(\hat{y}_j|\textbf{f}_i) \end{aligned}$$
    (3)
    $$\begin{aligned} c = \mathop {\textrm{argmax}}\limits _j \prod _{i=1}^{n}P(\hat{y}_j|\textbf{f}_i) \end{aligned}$$
    (4)

    where k is the number of streams (in our case k = 2), and c is the number of classes. \(P(\hat{y}_c|\textbf{f}_i)\) denotes the probability of predicting class \(\hat{y}_c\) for a given feature vector \(\textbf{f}_i\).

  2. 2.

    Feature Selection using GWO The feature vectors extracted from both the streams can be stacked together and then passed through a classifier like linear SVM or MLP. However, simply concatenating the feature vectors not only increases the computational complexity but also results in lower accuracy due to the addition of irrelevant features. To overcome these limitations, we propose to perform feature selection using GWO algorithm developed by Mirjalili et al. (2014). GWO is a population-based meta-heuristics algorithm that simulates the leadership hierarchy and hunting mechanism of grey wolves. The social hierarchy of grey wolves is divided into four levels: alpha, beta, delta, and omega. The alpha wolf is considered the dominant wolf in a pack and all his/her orders should be followed by pack members. The second level is beta which are discipliner for the pack and also subordinate and helps the alpha in decision making. The beta wolf can be either male or female and is considered the best candidate to be the alpha when the alpha passes away or become very old. The third level is called delta. The delta wolves have to submit to the alpha and beta but they dominate the omega. The lowest level is omega who have to obey all their superiors in the hierarchy.

During hunting, all the wolves try to surround the prey as shown in Fig. 5. Mathematically, this behavior can be represented as

Fig. 5
figure 5

Position updating of alpha, beta, delta and omega wolves during hunting operation in GWO algorithm

$$\begin{aligned} \overrightarrow{D} = \left| \overrightarrow{C}\overrightarrow{V_p}(t) - \overrightarrow{V}(t) \right| , \end{aligned}$$
(5)
$$\begin{aligned} \overrightarrow{V}(t+1) = \overrightarrow{V_p}(t) - \overrightarrow{A} \cdot \overrightarrow{D}, \end{aligned}$$
(6)

where t is the current iteration, \(V_p\) is the position vector of the prey, V indicates the position vector of a grey wolf, and A and C are coefficient vectors which are calculated as

$$\begin{aligned} \overrightarrow{A} = 2\overrightarrow{a} \cdot \overrightarrow{r_1} - \overrightarrow{a}, \end{aligned}$$
(7)
$$\begin{aligned} \overrightarrow{C} = 2 \cdot \overrightarrow{r_2} \end{aligned}$$
(8)

where components of a are linearly decreased from 2 to 0 over the course of iterations and \(r_1\), \(r_2\) are random vectors in [0,1]. During each iteration, the position of omega wolves is updated based on the positions of the best three wolves (alpha, beta, delta) in the current generation of the population. The following equations are used:

$$\begin{aligned} \overrightarrow{D_\alpha } = \left| \overrightarrow{C_1}\overrightarrow{V_\alpha } - \overrightarrow{V} \right| , \quad \overrightarrow{D_\beta } = \left| \overrightarrow{C_1}\overrightarrow{V_\beta } - \overrightarrow{V} \right| , \nonumber \\ \overrightarrow{D_\delta } = \left| \overrightarrow{C_1}\overrightarrow{V_\delta } - \overrightarrow{V} \right| \end{aligned}$$
(9)
$$\begin{aligned} \overrightarrow{V_1} = \overrightarrow{V_\alpha } - \overrightarrow{A_1} \cdot \overrightarrow{D_\alpha }, \quad \overrightarrow{V_2} = \overrightarrow{V_\beta } - \overrightarrow{A_2} \cdot \overrightarrow{D_\beta }, \nonumber \\ \overrightarrow{V_3} = \overrightarrow{V_\delta } - \overrightarrow{A_3} \cdot \overrightarrow{D_\delta } \end{aligned}$$
(10)
$$\begin{aligned} \overrightarrow{V}(t+1) = \frac{\overrightarrow{V_1}+\overrightarrow{V_2}+\overrightarrow{V_3}}{3} \end{aligned}$$
(11)
Fig. 6
figure 6

Flowchart of GWO algorithm

The entire evolutionary process is shown in Fig. 6, which is optimized using a two-stage feature selection (2SFS) approach proposed by Xue et al. (2012). In stage 1, the fitness function (Equ. 12) is used to minimize the classification error rate obtained by the selected feature subset during the evolutionary training process.

$$\begin{aligned} Fitness1 = Error\,Rate = \frac{FP+FN}{TP+TN+FP+FN} \end{aligned}$$
(12)

where TP, TN, FP and FN denote the true positives, true negatives, false positives, and false negatives, respectively. Stage 2 begins with the solution achieved in stage 1 and tries to minimize the number of features while maximizing the classification performance using the fitness function shown in Eq. 13.

$$\begin{aligned} Fitness2 = \alpha * \frac{\#Selected\,Features}{\#All\,Features} + (1-\alpha )*\frac{Error\,Rate}{ER} \end{aligned}$$
(13)

where \(\alpha\) is a constant and \(\alpha \in [0,1]\). \(\#Selected Features\) denotes the number of features selected and \(\#All Features\) represents the total number of available features. ErrorRate is the classification error obtained on SelectedFeatures, while ER denotes the classification error corresponding to the AllFeatures on the training set. \(\alpha\) decides the tradeoff between classifier performance (classification accuracy in our case) and the number of selected features with respect to the number of all features.

4 Experiments

In this section, we describe the dataset details, network training, and experimental results. We also analyze the results obtained to get the further insight into our proposed framework.

4.1 Dataset

We evaluate our method on BIISC video action dataset (Thi et al. 2014). As shown in Fig. 3, this dataset consists of eight action types: ‘answer phone call’, ‘cough’, ‘drink’, ‘scratch face’, ‘sneeze’, ‘stretch arm’, ‘wave hand’ and ‘wipe glasses’. For each class, there are 120 videos, performed by 20 human subjects (12 males and 8 females) in the age group between 20 to 50 years. Each subject perform each action six times under 3 view-points (front, left and right) in 2 different positions (standing and walking). All the videos are also horizontally flipped, thus producing a total of \(20\times 8\times 3\times 2\times 2=1920\) videos. For evaluation, we follow the protocol set by (Thi et al. 2014), and use subjects# 2, 3, 4, 5 and 6 for testing, and the remaining subjects for training.

Since BIISC is the only dataset available for flu detection, we explored other publicly available datasets and found that NTU RGB+D dataset (Shahroudy et al. 2016) contains few actions (out of total 60 actions) that could be useful in evaluating our proposed technique. As shown in Fig. 7, we select 10 actions classes: ‘phone call’, ‘sneeze/cough’, ‘drink’, ‘wipe face’, ‘brush teeth’, ‘hand wave’, ‘rub hands’, ‘clapping’, ‘nausea/vomiting’, and ‘nod head’. We extract 204 videos per class and called it NTU-Flu dataset. Out of the total 204\(\times\)10 = 2040 videos, 75% (1530 videos) are randomly selected for training, and the remaining 25% (510 videos) are used for testing. The rest of the details of both datasets are shown in Table 1.

Fig. 7
figure 7

Sample frames of 10 classes from NTU-Flu dataset

Table 1 Details of BIISC and NTU-Flu dataset

4.2 Network training

We perform experiments on a PC with Intel Corei5(TM) CPU @ 3GHz, 16GB RAM and NVIDIA GTX 1060 GPU. Frame difference operation is performed using Matlab R2018a, while Keras library is used to implement deep neural networks. We use Adam optimizer to fine-tune both the streams with initial learning rate set to 0.0001 and 0.0005 for ResNeSt50 and I3D stream, respectively. The batch size is kept as 16. The learning rate is halved when the training error plateaus.

4.3 Results

Table 2 Comparison of results on 2D ConvNet stream using different state-of-the-art 2D ConvNets as backbone architectures
Fig. 8
figure 8

Attention score maps obtained on NTU-Flu dataset using ResNeSt50 model. Our trained model accurately focuses on the important area while giving less attention to the surroundings. For instance, the heat-maps generated for ‘Sneeze/Cough’ and ‘Hand wave’ actions clearly shows that our model gives more emphasis (red) to the face and hands region, respectively

Results on 2D ConvNet stream:

Table 2 presents the results on 2D ConvNet stream. We utilize four different backbone networks during our experiments: MobileNetV3 (Howard et al. 2019), EfficientNetV2-S (Tan and Le 2021), ResNet50 (He et al. 2016) and ResNeSt50 (Zhang et al. 2020). It is clear that ResNeSt50 achieves the best results on both datasets due to its ability to focus on relevant areas using the spilt-attention mechanism. This can also be visualized from the attention score maps shown in Fig. 8, Therefore, we utilize ResNeSt50 to implement our 2D ConvNet stream.

Results on 3D ConvNet stream:

We test our 3D ConvNet streams at three different settings: 32-frames, 48-frames and 64-frames clips. In each clip, the frames are extracted randomly, and the results obtained are presented in Table 3. We can observe that the best results are obtained for 64-frames clip size. This is because the average number of frames in both datasets is greater than 64, thus, extracting any fewer frames results in a loss of accuracy. We do not increase the clip size beyond 64 due to the increase in computational complexity.

Table 3 Comparison of results on 3D ConvNet stream using different clip size
Table 4 Comparison of results obtained using score fusion
Table 5 Comparison of results obtained on BIISC dataset using GWO algorithm
Table 6 Comparison of results obtained on NTU-Flu dataset using GWO algorithm

Results using Softmax score fusion:

Table 4 presents the results obtained by fusing softmax scores of 2D and 3D ConvNet streams. We find that product fusion achieves higher results than score fusion method on both datasets. In fact, the sum fusion degrades the performance of NTU-Flu dataset. Thus, product fusion proves to be more effective for fusing Softmax scores in applications based on human activity recognition.

Results using GWO-based feature selection:

Next, we present the results of the GWO-based feature selection technique. We vary the population size between 5 to 50 (in steps of 5) while keeping the max number of iterations to 10. We could not find any improvement in the results if maximum iterations are increased beyond 10. The features selected using 2SFS defined in equations 12 and 13 are finally classified using MLP. The results obtained for BIISC and NTU-Flu datasets are presented in Table 5 and Table 6, respectively. For BIISC datatset, the best accuracy obtained is 70% when 161 features are selected using a population size of 20. This result is 1.2% higher than product fusion. Similarly, an improvement of about 1.6% is noticed for NTU-Flu dataset when 157 features are selected at a population size of 25, resulting in an overall accuracy of 86.07%.

Table 7 Comparison of accuracy using different optimization algorithms (OAs) for feature selection (FS)

Comparison with other optimization algorithms:

In order to verify the efficacy of the GWO algorithm, we compare our results with five other optimization algorithms (OAs) that could be used for feature selection:

  1. 1.

    Bat Optimization Algorithm (BAT) (Yang 2010)

  2. 2.

    Bees Optimization Algorithm (BEE) (Pham et al. 2006)

  3. 3.

    Cuckoo Search Algorithm (CSA) (Yang and Deb 2009)

  4. 4.

    Moth-Flame Optimization (MFO) (Mirjalili 2015)

  5. 5.

    Particle Swarm Optimization (PSO) (Kennedy and Eberhart 1995)

Again, all the experiments are performed by varying the population size between 5 to 50 while fixing the number of iterations to 10. Classification is performed using MLP and the best results thus obtained are reported in Table 7. We can observe that without any feature selection (FS) algorithm (row 1 of Table 7) yields the worst performance. The GWO algorithm (last row of Table 7) achieves the highest accuracies on both datasets while selecting the minimum number of features. Other OAs are not as effective in reducing the dimensionality of features extracted from both streams. Only MFA is able to achieve the same accuracy as GWO on NTU-Flu dataset, but the number of features selected by MFA (314) is twice as more than selected by GWO (157).

Comparison with previous results:

Table 8 shows the comparison of our results against the four baseline results reported in (Thi et al. 2014): Cuboid + BoW + \(\chi ^2\), HOGHOF + BoW + \(\chi ^2\), Cuboid + AMK II, and HOGHOF + AMK II. AMK I and II are two new types of Action Machine Kernels proposed by Thi et al. (Thi et al. 2014) to integrate space-time layout and Bag-of-Words-based local features. Both of them fall under the category of handcrafted features. Compared to them, our ResNeSt50 stream gives only 57.5% accuracy, while I3D performs much better to recognize 64.6% of actions correctly. On combining both the streams, we get 68.8% and 70% accuracy using Product and GWO-based fusion, respectively.

Table 8 Accuracy comparison with existing methods on BIISC dataset
Fig. 9
figure 9

Class-wise comparison of accuracy between ResNeSt50 vs I3D streams on BIISC dataset. I3D stream performs better than ResNeSt50 streams on most of the actions like ‘call’, ‘drink’ and ‘sneeze’

Fig. 10
figure 10

Class-wise comparison of accuracy between ResNeSt50 vs I3D streams on NTU-Flu dataset. I3D stream performs better than ResNeSt50 streams on most of the actions like ‘sneeze/cough’, ‘drink’, ‘wipe face’ and ‘rub hands’

Fig. 11
figure 11

Class-wise comparison of accuracy between Product and GWO fusion on BIISC dataset

Fig. 12
figure 12

Class-wise comparison of accuracy between Product and GWO fusion on NTU-Flu dataset

Fig. 13
figure 13

Normalized confusion matrix of GWO fusion on BIISC datatset. Highest confusion between ‘cough’ and ‘sneeze’ actions occurs due to their similar spatial and temporal pattern

Fig. 14
figure 14

Normalized confusion matrix of GWO fusion on NTU-Flu datatset. Highest confusion between ‘rub hands’ and ‘clap’ actions occurs due to their similar spatial and temporal pattern

4.4 Analysis

To get further insight into the performance of individual streams, we plot the comparison between ResNeSt50 and I3D models in Figs. 9 and 10 on BIISC and NTU-Flu datasets, respectively. For BIISC dataset, I3D performs significantly better on ‘call’, ‘drink’, and ‘scratch’ classes, while ResNest50 is more suitable for ‘cough’ class. The remaining classes are almost equally recognized by both the streams. A similar observation is made for NTU-Flu dataset, where I3D achieves superior accuracy than ResNeSt50 on all classes except ‘hand wave’ and ‘clapping’ actions. This is because I3D is pre-trained on both ImageNet and Kinetics datasets, and could perform convolutions in both spatial and temporal dimensions.

Next, we plot the class-wise accuracy obtained using Product and GWO-based fusion techniques on BIISC and NTU-Flu dataset in Figs. 11 and 12, respectively. Except for ‘wave’ class, the GWO fusion achieves better results on all classes of BIISC dataset. Similarly, GWO fusion is better for all except ‘rub hands’ and ‘nausea/vomiting’ classes on NTU-Flu dataset. These results verify the utilization of GWO-based optimization technique of feature selection when combining features extracted from different streams. This method not only reduces the dimension of the input feature vector but also increases the recognition accuracies significantly.

Finally, we analyze the normalized confusion matrix obtained by GWO fusion in Figs. 13 and 14 corresponding to two datasets. We see observe that ‘call’ action has the least accuracy of 46.67% in BIISC dataset. This is because, while answering a phone call, a person moves his/her hand towards the head, which has a very close resemblance to other actions like ‘cough’ and ‘scratch’ classes. Similarly, ‘cough’ and ‘sneeze’ classes also show high confusion between them due to their identical motion patterns. However, most of the classes show high accuracies in NTU-Flu dataset. This is because NTU-Flu contains high resolution (1920\(\times\)1080) videos, resulting in much cleaner features extracted by both the streams. Only ‘rub hands’ and ‘clapping’ actions show some confusion due to the involvement of hands in both classes.

5 Conclusion

In this paper, a deep learning-based recognition method has been used to identify Flu-like symptoms in videos. We built a two-stream heterogeneous network to extract complementary features from RGB frame differences using 2D and 3D ConvNets. We performed multiple experiments to decide the best models for our framework, i.e., ResNeSt50 as 2D ConvNet and I3D as 3D ConvNet. The split-attention block in ResNeSt50 helps to focus on important regions within the input RGBMIs, while a larger (64-frame) clip size helps to capture the entire spatio-temporal information in the I3D stream. To further boost the final performance, we use GWO-based feature selection algorithm to select the most relevant feature set, thereby reducing computational complexity and improving classification accuracy. Finally, we perform an in-depth analysis to know the strengths and weaknesses of our proposed framework. In the future, we would consider collecting larger datasets and adding other modalities (like depth or infrared videos) to improve the robustness of our technique. Inertial sensors like Accelerometers and Gyroscopes data could also be incorporated in future experiments so as to overcome the limitations of occlusions and further improving the results by adding an extra modality into our existing framework.