1. Introduction
Artificial Intelligence (AI) refers to the ability of machines to perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI forms the basis of any technology used for Human Activity Recognition (HAR) [
1,
2]. HAR systems identify human activities using data captured by wearable devices or smartphones [
3,
4]. The main goal of HAR is to enhance health outcomes for people who are suffering from chronic diseases, such as diabetes, and Parkinson’s disease, and even for older adults [
1,
5,
6]. HAR has numerous applications in the healthcare and wellness sector for individuals who wish to monitor their health and fitness [
3,
4].
A variety of sensors can be utilized to implement HAR. One compelling sensor in this domain is the accelerometer. The accelerometer can detect high- and low-frequency movements [
3,
7]. A gyroscope and a magnetometer can be used to integrate accelerometer data [
3,
4]. The more information passed to an AI system, the easier it is to learn a pattern. However, in practical applications, more sensors mean more computational resources, and usually they may not be available on a device with limited capacities, such as a smartwatch or smartphone [
8]. Therefore, developing a solution that utilizes only one type of sensor renders it more appropriate for use in real-world scenarios.
Traditional Machine Learning (ML) necessitates having an expert to extract valuable information from signal sensors. This would require a significant amount of manual effort [
3,
4]. Conversely, we have Deep Learning (DL) models that can extract features as good as, if not better than, classical models [
3]. A DL model has a large number of hidden layers that enable it to extract high-level features from the data, giving DL models an advantage over conventional machine learning models, which can only learn from hand-crafted features. However, it remains uncertain as to whether the learned features of DL models have the same level of generalizability as hand-crafted features, or if they are dataset-specific.
Convolutional Neural Networks (CNNs) are among the most widely utilized DL models and have emerged as the state-of-the-art method for numerous tasks. In recent years, One-Dimensional Convolutional neural networks (1D CNNs) have been effective in classifying human activity from wearable sensors, such as accelerometers and gyroscopes [
3,
4,
7,
8,
9]. One-dimensional CNNs are similar to standard convolutional neural networks, but instead of processing two-dimensional images, they process one-dimensional signals, such as time series data. One-dimensional CNNs can be trained end-to-end to extract features directly from raw data.
AI systems are now making significant decisions on our behalf, including influencing criminal sentencing and curating online content [
10,
11]. However, without understanding why these systems make certain decisions, people are reluctant to trust and rely on such systems. The term “eXplainable AI” (XAI) is used to describe an artificial intelligence system that can provide a human with an explanation of how it reached its decisions [
11]. The main issue with current deep learning models is that they cannot provide explanations for their decisions, making them difficult to trust. Novel techniques, such as Gradient-weighted Class Activation Mapping (grad-CAM), are being developed to improve the interpretability of deep models [
8,
12].
T-Distributed Stochastic Neighbor Embedding (t-SNE) and Principal Component Analysis (PCA) are machine learning techniques for visualizing high-dimensional data [
13,
14,
15]. They possess the capability to condense a set of variables into a smaller set of information, while striving to retain as much input information as possible. These methods are often utilized with raw signals, such as images or time series data. However, the potential for combining these techniques with a deep network has yet to be explored.
This paper describes a method for generating visual explanations of features learned by CNNs applied to HAR. By integrating the training model’s learned features with the t-SNE technique, we offer several explanations of the model’s decision-making process. Moreover, we demonstrate the transferability of the learned features from one dataset to another through the proposed 2D visualization. To the best of our knowledge, this is the first study to present an approach for accurately visualizing learned features in 1D CNNs applied to HAR tasks.
The remainder of this paper is organized as follows.
Section 2 introduces the standard protocol used in developing activity recognition applications.
Section 3 discusses related works that have utilized the SHO or HAPT databases, the most commonly used XAI methods in HAR, and how t-SNE can be applied.
Section 4 describes the methodology, including 1D CNN architectures, the proposed framework, and databases employed.
Section 5 presents the metric results achieved for each experiment performed with the SHO and HAPT datasets.
Section 6 covers the explainable results. Finally, in
Section 7, we conclude our work, presenting some limitations of this study.
4. Methodology
This section explains the architectures implemented, the public database used and the proposed framework. In practical applications, more sensors mean more computational resources, which are usually not available on a device with limited capacities, such as a smartwatch or smartphone [
8]. Therefore, developing a solution that utilizes only one type of sensor renders it more appropriate for use in real-world scenarios.
4.1. Architectures
The simulations were conducted utilizing CNN1 and CNN2 convolutional network architectures, initially introduced by Aquino et al. [
8]. The models were trained and implemented using the TensorFlow 2 framework [
20]. Initially, we emphasized the commonalities between both designs. In both instances, the Adam optimizer was employed with its default hyperparameters. The models underwent 300 training epochs. Each convolutional block comprised two layers, each of which contained 100 feature maps. The ReLU activation function was utilized. In every max-pooling layer, the pool size and stride were set to 2. For dropout, a value of 0.5 was applied [
47]. The Softmax layer contained
n neurons, where
n represents the number of classes for the problem. A custom callback was employed to create checkpoints after each training epoch, selecting the optimal model based on the macro F1-score metric for the validation set. The batch size was set to 256. Both architectures received identical
signals at their inputs.
The differences between the two architectures are now discussed. The first architecture comprised four convolutional blocks, each with two convolutional layers, while the second architecture incorporated only three convolutional blocks. Furthermore, CNN1’s kernel size was eight, whereas CNN2’s was four.
Figure 2 illustrates both architectures.
4.2. Databases
This section provides specific information about the datasets used in this study. Human activity recognition (HAR) can employ various sensors, such as accelerometers, gyroscopes, and magnetometers. Accelerometers are highly effective in detecting movements, and many related studies have used this type of sensor alone, due to its relevance and reliability in HAR [
3,
4]. Although combining sensors can improve accuracy, this may not be feasible for devices with limited processing capacity, such as smartphones or smartwatches [
3,
8]. Therefore, using a single sensor, such as an accelerometer, is a more practical approach for real-world HAR applications [
8,
9,
48].
4.2.1. SHO
The authors collected data for seven physical activities: walking, running, sitting, standing, jogging, biking, walking upstairs, and walking downstairs [
28]. Ten volunteers performed each task for 3–4 min as part of the data capture project. All ten subjects were men between the ages of 25 and 30. Except for cycling, the studies were conducted inside a university facility. Each of these participants was equipped with five smartphones for use in five different body postures: right pocket, left pocket, belt position toward the right leg using a belt clip, right upper arm, right wrist.
For the trials, Samsung Galaxy SII (i9100) smartphones were used. For each movement, data was collected at a rate of 50 samples per second for all five places simultaneously. Accelerometer, gyroscope, magnetometer, and linear acceleration sensor information were collected [
28].
In this work, for the dataset generation, a 3-s window was used, with 50% overlapping. The data were collected only from the waist position using a belt, and only the accelerometer sensor was considered. For the subject-dependent (SD) method, we implemented a randomized partitioning strategy with shuffling, adhering to a conventional 70/30 split for training and validation. Additionally, for the subject-independent (SI) approach, we endeavored to maintain the same proportion in the partitioning, yielding 7 subjects for training and 3 subjects for validation. Validation was performed using subjects 1, 2, and 3.
SHO is a high balanced dataset. For our setup, a total of 4130 samples was obtained, with exactly 590 samples per class, and all ten subjects provided the same number of samples.
4.2.2. HAPT
The UCI Human Activity Recognition Using Smartphones Dataset was expanded in this dataset [
18]. Instead of the pre-processed signals from the smartphone sensors that were supplied in version 1, this version offered the original unprocessed raw inertial signals from the sensors. Additionally, the activity labels were revised to incorporate postural changes that were absent from the earlier dataset version.
The public HAPT dataset was obtained from thirty participants ranging in age from 19 to 48 years [
18]. The dataset contained raw inertial signals obtained from 3-axial linear acceleration and 3-axial angular velocity sensors integrated in a smartphone equipped at the waist by the user. It included 6 basic activities (BAs): standing, sitting, laying, walking, walking upstairs, and walking downstairs#, and 6 postural transitions (PTs) between three static postures, stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand.
The data was collected at a constant sampling frequency of 50 Hz. Signals were then synced with experiment recordings so that they could be used as the ground truth for hand labeling [
18].
For dataset generation the same setup used in SHO dataset was applied, namely, a 3-s time window with 50% overlapping, and only an accelerometer sensor. The data were collected only from the waist position using a belt.
Table 3 displays more details about the data distribution of the dataset. As shown in the N° Sample column, HAPT was an unbalanced dataset. Finally, as shown in the N° subjects’ column, we can see that not all participants performed all activities, especially for PT.
For the subject-dependent (SD) method, we implemented a randomized partitioning strategy with shuffling, adhering to a conventional 70/30 split for training and validation. In the SI approach, we took into account that not all individuals performed all physical activities. Therefore, for the validation subset, the first nine individuals, who performed all activities, were chosen: subjects 1, 2, 3, 4, 5, 6, 13, 17, and 18. The remaining subjects were used in the training subset.
4.3. Proposed Framework
To summarize, we propose the framework illustrated in
Figure 3. In the dataset building step, we loaded the SHO and HAPT databases and performed an exploratory analysis of the data.
This exploratory analysis allowed us, for example, to identify that not all individuals performed all activities in the HAPT database. We also defined a window size of 3 s with 50% overlapping. After this, the data was subdivided in the splitting step according to SD or SI validation strategy. Then, the models were implemented and evaluated, with in training and evaluation steps, and their results contrasted using performance metrics. Again, we used the CNN1 and CNN2 architectures. Finally, we conducted an in viewing learned features step, in which we obtained the embeddings for the architectures, and used this as input for the t-SNE. By doing this, we had a visualization that clearly explained how the model performed the data distribution from the learned features.
In this work, we proposed using a framework to visualize the power of features learned from a dataset or how features learned from one dataset behave in another unseen dataset. For evaluation on the same trained dataset we, first, used the dataset to train the two CNN architectures, CNN1 and CNN2. We chose the best architecture based on the achieved numerical results. We obtained XAI visualizations and provided explanations of the decision-making process.
In other words, the HAPT samples were passed through this network, and a vector with information from one layer before classification was obtained. After that, new visualizations were performed, which allowed us to conclude the limitations of the learned features.
The proposed framework allows for the visualization of the effectiveness of features learned within a single dataset and the examination of their performance on an unseen dataset. This is achieved by implementing the proposed activity recognition protocol and including a visual explanation step to gain insight into the ability of learned features to differentiate activities on another dataset.
6. XAI Results
In the XAI results section, we explore the outcomes obtained with the deep network CNN1 on the SHO and HAPT datasets, employing the proposed t-SNE visualization through the learned features. The results are scrutinized and interpreted to emphasize the efficacy of t-SNE in detecting bias issues and identifying mislabeled samples. The visualization further illustrates the deep network’s generalization capability in distinguishing activities and its constraints when applied to a distinct dataset. Ultimately, the discussion offers insights into how t-SNE visualization can augment the comprehension of deep networks in HAR tasks.
6.1. SHO
Following the training of the models and evaluation of the numerical results, we proposed employing XAI to generate visual representations that elucidate the model’s learning process. We extract the model’s embeddings up to one layer before the Softmax, and, subsequently, trained a t-SNE based on this information. The outcome was a two-dimensional embedding that endeavored to preserve the information present in the model’s output. This approach enabled clear visualization of the model’s acquired knowledge regarding data characteristics and how the data was partitioned, based on the learned features.
Figure 6 presents the results graph.
The visual results obtained corresponded with the numerical results presented in
Table 5, wherein the model achieved exceptional performance with an accuracy of 0.98. In the plot, the data from different classes were dispersed. Confusion occurred between activities, such as downstairs and standing. This outcome was also evident in the confusion matrix shown in
Figure 4, where the downstairs class attained lower precision than other activities.
As depicted in
Figure 6, black x marks were drawn on samples with incorrect predictions to identify the model’s errors. The resulting representation enabled verification that the incorrect predictions were readily observed and explained. For instance, we examined the misclassified sample within the jogging cluster, indicated by a black x. The ground truth for this sample was the downstairs class. This confusion arose due to the features learned by the model, which positioned this sample near the jogging cluster. This may have occurred as a result of a limitation in the trained model, which might have learned characteristics that did not optimally separate the data. Alternatively, this sample could have been mislabeled. Nonetheless, the obtained visualization allowed for an understanding of why the model made an error.
In the visualization, samples from downstairs were more dispersed, indicating that the downstairs class was the most challenging in the dataset. The model may be more susceptible to predicting this class in real-world scenarios. Moreover, by analyzing the results of the obtained graph, it became evident that there were subgroups within the same class, such as sitting, which featured three distinct subgroups scattered throughout the view. Similarly, even walking had two subgroups. These analyses were not possible solely through numerical results, and the visualization presented various opportunities to explain the model’s predictions.
An intriguing observation was made by analyzing the sitting activity. There were three clusters, which corresponded to the three different subjects present in the validation subset. This finding suggested that the learned features may have the potential to not only distinguish the sitting activity, but also differentiate the subjects.
To further highlight the capabilities of this visualization,
Figure 7 displays a visualization using PCA. The method applied was similar to t-SNE. Initially, we utilized the model’s embeddings output, one layer before the classification, as input for the PCA. Next, we applied PCA to derive only three components, which were employed to generate the plot. We calculated merely three components to demonstrate that there was no combination of components for which the resulting visualization was as informative as the one obtained with t-SNE.
It was anticipated that the data would exhibit clear class separation; however, this visualization did not align with the numerical results. Providing explanations was challenging, since, in this visualization, the majority of classes were situated close to each other. This outcome might have arisen because PCA lacks the capability to extract relevant information from non-linear data, which could have been the case with the features learned by the model.
The representation in
Figure 6 demonstrates how the model learned to separate the data based on the features it acquired. We plotted the graph displayed in
Figure 8 to analyze the distribution of raw accelerometer data. To generate this plot, we concatenated the data from the X, Y, and Z axes of the accelerometer and used it as input for the t-SNE algorithm. Our work introduces the innovation of applying t-SNE one layer before the model’s classification, setting it apart from the standard approach of applying t-SNE on raw data, as seen in previous works [
13,
14,
45,
46].
By visualizing the plot of the learned features during the model’s training, it became evident that the model underwent a transformation in regard to the input data. In its original form, the raw data presented a challenge in differentiating between human activities. However, after the model learned and distilled the relevant characteristics, the separation between classes became more evident and distinguishable. This visualization offered valuable insights into the model’s inner workings and assisted in enhancing its performance.
6.2. HAPT
For the HAPT dataset, based on the results presented in
Table 7, a more significant confusion between samples was expected compared to the representation obtained with the SHO dataset. Examining
Figure 9, it is possible to observe a lesser separation between samples of different classes. The main issue lay in the Postural Transitions (PT) classes. This result was also anticipated when observing the confusion matrix in
Figure 5.
To demonstrate that the model learned relevant features for the other activities, excluding Postural Transitions (PTs),
Figure 10 is presented. Other works have already considered these PTs as one or two subgroups. Some studies only use PTs to enhance the performance in recognizing other activities.
The model encountered a familiar challenge, similar to the SHO dataset, with the standing class proving difficult to classify. However, in contrast to the SHO dataset, confusion arose between standing and sitting classes. Further analysis revealed the presence of subgroups within the laying, standing, and sitting classes.
To better illustrate the original data distribution,
Figure 11 displays the raw accelerometer data using t-SNE. The same process utilized to generate
Figure 3 was applied to the HAPT dataset.
As occurred with SHO, utilizing raw data as t-SNE input was a challenge because there were no highly relevant features, in contrast to
Figure 9, where the features were those learned by the neural network and had mathematical relevance.
Overall, the models demonstrated their ability to learn meaningful features, as evidenced by the t-SNE visualizations and performance metrics obtained on both the SHO and HAPT test sets. Next, we assess whether the features learned in one dataset could accurately classify human activities in another dataset.
6.3. SHO Features into HAPT
To analyze the relevance of the features, we used a network trained on the SHO dataset to extract features from the HAPT dataset. To do this, we performed a prediction with the SHO model, propagating the HAPT samples and considering the result before the classification layer. This result was used as input to the t-SNE algorithm.
Figure 12 shows the resulting representation. However, the SHO dataset’s learned features needed to be universal to be useful for other datasets. In this case, the SHO features caused the pattern of the HAPT dataset to be more confusing than the raw data. Although
Figure 6 successfully distinguished between activity types using the SHO dataset as a reference, this approach produced unsatisfactory results when applied to the HAPT dataset.
6.4. HAPT Features into SHO
To analyze the relevance of the features, we used a network trained on the SHO dataset to extract features from the HAPT dataset, as displayed in
Figure 13. To do this, we propagated the HAPT samples through the SHO model and considered the output before the classification layer. We then used this output as input to the t-SNE algorithm.
When examining the HAPT dataset, it became evident that utilizing learned features resulted in a more precise differentiation of the data than using the raw accelerometer data.
Figure 11 shows that the contrast between the two approaches was clear. The Sitting, Walking, and Biking classes were well separated, unlike what was observed when analyzing the raw data dispersion, shown in
Figure 8, where the Biking class was grouped with other classes. The Upstairs, Downstairs, Standing, and Jogging classes were grouped in close regions, but there was still noticeable separation between the samples of these classes.
7. Conclusions
T-SNE is frequently utilized to explore the separation of raw data in a comprehensible manner. By applying this method to the output of a DL model, we introduce a novel post-hoc and model-specific approach to the XAI field. Implementing t-SNE on the output of a DL model generates a visualization that can convey a general understanding of the model’s learned features during the training process. This is distinct from the explanations produced by other established XAI methods, such as decision tree induction, rule extraction, or gradient-based techniques. In contrast to these alternatives, the proposed methodology is adept at presenting a holistic overview of a DL model across a data subset. By enhancing our understanding of the model’s behavior, this technique can function as a valuable instrument for debugging and optimizing the model’s performance.
The t-SNE embedding visualization demonstrated its potential in offering valuable insights. Nonetheless, certain limitations must be acknowledged when interpreting the results of this study. For instance, the performance of this visualization technique with alternative data types, such as electrocardiogram signals or other sensor data, remains to be explored, potentially impacting the generalizability of the approach across different domains.
Moreover, although the visualization proved effective in analyzing the decision-making processes of a CNN-based model on identical or diverse datasets, additional research is warranted to assess its applicability to various DL algorithms. Investigating alternative network architectures, such as recurrent neural networks, ConvLSTM, Bidirectional networks, and hybrid nets, may furnish a more comprehensive understanding of the t-SNE embedding visualization’s versatility with respect to different deep learning models.
Hence, while the present study contributes valuable insights into the prospective utilization of t-SNE embedding visualization for human activity recognition, based on accelerometer data, it is imperative to recognize the limitations of the proposed approach and continue investigating its efficacy across other domains and with different deep learning algorithms.
Upon applying t-SNE to a network trained on a distinct dataset, the resulting data separation appears less distinct. Instead, the classes display a degree of mixing, as evidenced in
Figure 12, where the distribution was inferior to that observed when utilizing raw data as input, as depicted in
Figure 11. This confusion may arise from the network’s inability to learn pertinent features beyond its original database. Such limitations could stem from discrepancies in data collection, including differences in the individuals involved, the sensors employed, or the collection software implementations. For instance, although both databases position the sensor at the waist using a belt, variations in orientation and characteristics may still arise. If the belt is fastened near the navel, the features obtained may differ from those acquired when the phone is attached closer to the side. The closer the sensor is to the center of mass, the smoother the signal, and this property may influence the neural network’s capacity to extract relevant features.
The features extracted by the model trained on the HAPT dataset proved to be more informative, enabling more accurate discrimination of the data than the raw data itself when applied to the SHO dataset, as can be observed when comparing
Figure 11 and
Figure 13.
The proposed visualization offers a lucid representation of the data distribution based on the network’s learned features. This aids in accomplishing several objectives, such as detecting and rectifying critical confusion, pinpointing biases, identifying mislabeled samples, revealing potential subgroups within a group, and more.
This approach is not confined to HAR alone. It can be employed in any problem involving time series data and deep learning algorithms, particularly convolutional neural networks.
This study paves the way for various research avenues, including the following:
Which validation strategy results in better separation of classes in a HAR dataset, SI or SD?
If additional sensor position data were incorporated into the SHO database, would it improve the separation of HAPT classes?
Is it possible to reverse the process? Starting from t-SNE embedding coordinates, can raw data be retrieved? If so, could this be utilized for data augmentation in the dataset?
How do the embeddings derived from other deep network models, such as LSTM, BiLSTM, GRU, MLP, and hybrid models, perform?
How should the proposed visualization technique be employed to avoid class confusions, modifying the architecture to improve and facilitate the differentiation of challenging classes?