1. Introduction
Falling is a significant health issue in society. The World Health Organisation (WHO) estimates that each year 37.3 million falls require medical attention, while 684,000 falls are fatal [
1], making falls the second leading cause of unintentional death worldwide. Among people who fall, certain groups are at a higher risk due to cognitive or physical impairments, which can be attributed to factors including age [
1,
2], recent surgery [
3], or conditions such as Parkinson’s disease [
4], dementia [
5], stroke [
6], multiple sclerosis [
7], and amputation [
8].
Many technological developments in recent years have led to an increased capability for monitoring gait in people at a high risk of falling, such as the widespread adoption of smartphones and smartwatches containing sensors, the Internet of Things (IoT) and body sensor networks, and improvements in wearable sensors. With these advances, many studies aim to automate the process of gait analysis by collecting real-time data from wearable sensors during tasks such as level-ground walking, navigating ramps, or ascending and descending stairs [
9]. The data from these sensors can be analysed to aid healthcare professionals in diagnosing conditions affecting gait [
10], performing gait analysis [
11], or for use in detecting fall events so that the severity of future falls can be reduced [
9,
12,
13].
However, to enable remote, real-time gait analysis, the context from which the data are extracted must be provided to the specialist who is reviewing the data. Typically, this context is obtained through the process of Human Activity Recognition (HAR), where classification methods are used to determine walking activity in real time from the collected data [
9,
14]. As many of these classification methods are supervised [
9,
14,
15,
16,
17], a training dataset is required to build models capable of identifying activities with high accuracy. Past studies have created such datasets with a wide array of sensors, pre-processing techniques, classification methods, and validation methods, resulting in difficulty determining the most important factors that contribute towards obtaining high accuracy when designing novel sensor systems [
9,
14,
18].
In the literature, Human Activity Recognition (HAR) studies can be separated into two categories that focus on convenience, typically making use of a smartphone or smartwatch [
9,
19], or accuracy by implementing a multimodal sensor system which can be cumbersome to wear [
9,
20,
21]. In addition to the potential for accuracy, multimodal systems typically collect more appropriate quantities of data for remote gait analysis by allowing the system to collect data from multiple areas of interest through a body sensor network [
22].
Existing studies on finding the optimal sliding window parameters for HAR have demonstrated a range of results in different contexts. Banos et al. [
23] studied the effect of window size on classification performance for a single dataset featuring accelerometers placed on each thigh, shank, upper arm, and forearm and the back [
24]. This work highlights the need for a balance between high accuracy and rapid decision times and finds that larger window sizes do not correlate to increased classification performance, with the optimal window sizes occurring below 2 s using Decision Trees (DTs), K-Nearest Neighbors (KNN), naïve Bayes, and a nearest-centroid classifier. Similarly, Niazi et al. [
25] analysed the co-dependency of window size and sample rate to determine what parameters enable the highest classification accuracy using Random Forests (RFs) and a single hip-worn accelerometer. This study found that window sizes of 2–10 s were optimal, contrasting the results of Banos et al. [
23]. Both of these studies highlight that future work is needed to consider additional technologies and sensor types. Li et al. [
26] discuss the difficulty of determining an optimal window size for a given application, instead choosing to use different window sizes for each activity based on the temporal properties of that activity, which increases classification performance. Finally, Dehghani et al. [
27] considered the effects of using overlapping sliding windows against non-overlapping sliding windows with both subject-dependent and subject-independent cross-validation on HAR performance using data collected using inertial sensors with DTs, KNN, naïve Bayes, and a nearest-centroid classifier. This study found that performance across all classifiers was reduced when using subject-independent cross-validation and that, under this condition, the use of overlapping sliding windows did not improve the performance of the models when compared to non-overalpping windows [
27].
Regarding sensor placement, Duan et al. [
28] placed seven accelerometers on the upper arm, wrists, thighs, and chest to determine how sensor location affected classification accuracy. This study found that sensors placed on the subjects’ dominant side, the right side in all cases for this study, exhibited increased performance, with the right wrist being the highest-performing sensor type when used alone. Furthermore, this study evaluated the use of RF models along with deep learning techniques such as convolutional neural networks, transformers, and long short-term memory models with the latter. Kulchyk et al. [
29] analysed the performance of sensors positioned on the sternum, left thigh, right ankle, and right shoulder using a convolutional neural network for both subject-dependent and subject-independent cross-validation. This study found the right ankle to be the optimal sensor location, with multiple pairs of sensors including the ankle sensor resulting in 100% classification accuracy [
29]. Finally, Khan et al. [
30] placed five sensor nodes consisting of accelerometers and gyroscopes on each forearm, the waist, and each ankle and performed HAR using simple logistic regression, naïve Bayes, and sequential minimal optimisation classifiers. The study found that individual sensor performance was dependent on activity type, with sensors on the chest and thigh being optimal for stationary tasks, whilst sensors on the thigh, lower back, and ankle performed better at movement tasks [
30]. Many studies that consider sensor placement for HAR consider only accelerometers or Inertial Measurement Units (IMUs) [
28,
29,
30,
31,
32], leaving much room for sensor position analysis using additional technologies which can capture motion data.
Overall, these studies highlight a gap in the literature for multi-dataset studies which aim to identify trends in both optimal window size and optimal sensor placement across multiple datasets and with additional motion-related technologies and sensors. As stated by Banos et al. [
23], these types of studies form a guideline for future researchers faced with determining sensor locations and sliding window parameters in the future and contribute towards a knowledge database of the interactions between analytical parameters and sensors in HAR using different classifiers so that researchers and system designers can avoid performing lengthy brute-force searches across high-dimensional search spaces for individual applications of HAR.
The contributions of this study, therefore, are to identify these optimal analytical methods, sensor placements, and sensor types which will contribute towards existing knowledge of HAR classification co-dependencies such as window size, sensor type, and sensor location. This novel approach using a normalised cross-comparison of different datasets by controlling variables such as the number of participants, activity types, the sample rate, and window size for the sliding window technique creates a robust analysis that can identify trends with increased generalisability when compared with the current state-of-the-art. Therefore, the results of this study will offer reliable insights into the performance capabilities of individual sensor types and how these differ based on their locations on the body. The results of this analysis will help future researchers effectively design more lightweight sensor systems which decrease the computational burden of HAR while maintaining high levels of accuracy, comfort, and convenience.
2. Materials and Methods
Four datasets were selected for this study which feature a wide variety of sensor systems, an appropriate number of participants for sufficient model generalisation, and walking activities comparable between datasets. A description of each dataset along with the reasons it was chosen for this analysis follows.
2.1. Dataset 1: USC-HAD
The USC-HAD dataset [
33] was published in 2012 and features 14 participants with a mean (standard deviation; std) age, height, and weight of 30.1 (std: 7.2) years, 170 (std: 6.8) cm, and 64.6 (std: 12.1) kg, respectively. Each subject was equipped with a single ‘MotionNode’ IMU containing a 3-axis accelerometer, gyroscope, and magnetometer, totalling 9 data channels. The IMU was mounted to the participants’ anterior right hip in a pouch designed for mobile phones. Data were recorded using a laptop which was held under the arm, pressed to the waist by the subject and connected to the IMU via a cable.
The USC-HAD dataset features 12 activities which were performed at the participants’ own pace [
33]. These activities were walking forwards, left, and right, walking upstairs and downstairs, running, jumping, sitting, standing, sleeping, and going up and down in a lift.
USC-HAD was chosen because this dataset has been widely explored in the literature since its publication [
15,
16,
34]. Therefore, this dataset acts as a control for the newer datasets to validate the chosen methods and models.
2.2. Dataset 2: HuGaDB
The HuGaDB dataset [
35] was published in 2017 and features 18 participants with a mean age, height, and weight of 23.67 (std: 3.69) years, 179.06 (std: 9.85) cm, and 73.44 (std: 16.67) kg, respectively. The sensor system worn by each participant consisted of IMU sensors placed at the thigh, shank, and foot and an Electromyography (EMG) sensor placed on the vastus lateralis, each of which were sampled at around 60 Hz. This setup was mirrored on each leg, for a total of six IMUs and two EMG sensors.
Participants were asked to perform the following 12 activities at a usual pace: walking, running, navigating stairs, sitting (stationary), sitting down, and standing up, standing (stationary), cycling, going up and down in a lift, and sitting in a car [
35].
2.3. Dataset 3: Camargo et al.
Camargo et al. [
36] created an open-source dataset for the study of lower-limb biomechanics in 2021, featuring 22 healthy participants with a mean age, height, and weight of 21 (std: 3.4) years, 170 (std: 7.0) cm, and 68.3 (std: 10.83) kg, respectively. Subjects were equipped with 11 EMG sensors, 3 goniometers, and 4 six-axis IMUs on their right side only. Sensor locations and sample rates can be found in
Table 1.
Whilst participants only performed six basic activities, the transition states were also labelled, raising the activity count to 19 [
36]. With the ‘idle’ class removed as no activities were performed, 18 walking activities remained, consisting of six core activities and the transitions between them. These core activities were ramp ascent, ramp descent, stair ascent, stair descent, stand, turning, and walking.
2.4. Dataset 4: CSL-SHARE
CSL-SHARE is a dataset published in 2021 for the purpose of exploring activity recognition for common sport-related movements [
37]. The sensor system is a multimodal, knee-mounted system featuring 2 6-axis IMUs placed on the thigh and shank, 4 EMG sensors placed on the vastus medialis, tibialis anterior, biceps femoris, and gastrocnemius, a goniometer placed on the lateral knee, and an airborne microphone. Like the Camargo et al. dataset, these sensors were placed on the right leg only. The CSL-SHARE dataset features 22 activities and was upscaled to 1000Hz due to differing sample rates for the various sensors [
37].
2.5. Summary of Datasets
The datasets chosen for this study cover a variety of environments, activities, and sensor configurations. Analysis of the datasets with the same Machine Learning (ML) models and pre-processing methods will provide insight into how sensor configuration and type affect classification accuracy in HAR. A comparison of these datasets can be found in
Table 2.
2.6. Dataset Preprocessing
2.6.1. Normalisation Between Datasets
As this study focuses on the sensor types in the HAR datasets, steps were taken to remove the variations between datasets. Of the variables in
Table 2, participant numbers, activity types, and sample rates were normalised. To achieve this, the number of participants in each dataset was limited to the minimum number available across all datasets, which was 14, with additional participants being excluded from the datasets where appropriate to maintain a fair comparison between the datasets. For example, in CSL-SHARE, participants 2, 11, and 16 contained different data due to varying protocol versions, device communication issues, and a participant stopping early due to knee pain. As such, these participants were removed, before cropping the number of participants down to 14. Of the activities included in the chosen datasets, only walking, standing, stair ascent, and stair descent were common across all datasets and are activities of interest with respect to fall-related research [
38,
39]. Therefore, the additional activities were removed from each dataset. Finally, 100 Hz was chosen as the common sample rate, resulting in the sample rate for the Camargo et al. and CSL-SHARE datasets being subsampled to 100 Hz, whilst HuGaDB was interpolated up to 300 Hz with 5th-order polynomial interpolation, before being subsampled to 100 Hz.
2.6.2. Filtering
Before data could be presented to the Machine Learning models, a series of pre-processing steps had to be performed to prepare the data for use by the Machine Learning models. This process began with a 4th-order low-pass Butterworth filter with a cut-off frequency of 7 Hz before windowing and feature extraction occurred. This cut-off frequency was chosen through testing and laid around the 10 Hz mark, which is typical for analyses using inertial sensors [
19].
2.7. Feature Extraction
As is typical when performing classification with time-series data, semi-overlapping sliding windows are used to extract statistical features such that a single sample represents a larger time window of raw data. The size of these windows and the amount of overlap varies between studies, with lower window sizes being preferable for real-time classification, whilst larger window sizes consider more of the gait cycle per sample which may result in higher classification accuracies. For this study, a search was performed to identify trends in accuracy from a 1 s to 10 s window size, with a 75% window overlap for each window size. This overlap was chosen to combine co-dependent sliding window parameters and reduce computation times.
For each window of the time-series data, a wide array of statistical features were extracted to enable the ML models to make accurate predictions. There is little consensus on which features are necessary for accurate HAR, with many studies considering a mean of 15 features [
15,
40,
41,
42,
43,
44,
45,
46]. This analysis included 22 features from each sensor, including commonly chosen features from existing research [
15,
42,
43,
44,
45,
47]. Most of these features were extracted from the raw data in the time domain, with Fourier transforms being used to obtain additional features from the frequency domain. Feature selection methods were then used to eliminate noisy features before classification. This combination of increased feature numbers with appropriate feature selection techniques to accommodate this ensured that relevant data from each sensor were present to allow a sensor-focussed analysis. The list of included features is as follows:
Maximum value.
Minimum value.
Mean.
Median.
Standard deviation.
Mean absolute deviation.
Median absolute deviation.
Number of zero crossings.
Root mean square.
Maximum gradient.
Kurtosis.
Skewness.
Variance.
Interquartile range.
Entropy.
Energy.
Maximum frequency amplitude.
Mean frequency amplitude.
Maximum power spectral density.
Mean power spectral density.
Frequency kurtosis.
Frequency skewness.
After feature extraction, the data were split into train and test data by leaving out the data from a single subject. Scikit-Learn’s ‘MinMaxScaler’ function was then fit to the train set and applied separately to the train and test sets to scale each feature between 0 and 1. Principal Component Analysis (PCA) was performed to reduce the number of features. As with the scaler, the PCA was fit to the train set and applied separately to the train and test sets. The number of selected principal components varied for each dataset due to the different features which were dependent on the sensors but was controlled by choosing the minimum amount required to retain 95% of the variance of the full feature set. Finally, another round of scaling was performed to prepare the data for the Machine Learning algorithms.
2.8. Cross-Validation and Test Data
Two methods of cross-validation and testing are prevalent in the literature for gait- and fall-related studies: subject-dependent analysis using Train-Test Split (TTS) cross-validation and subject-independent analysis using Leave-One-Subject-Out (LOSO) cross-validation [
27,
48]. TTS cross-validation uses a set percentage of the total data from all subjects as test and validation data, whilst LOSO leaves out the data from a specific subject. Each of these methods of cross-validation offers differing advantages and disadvantages, with TTS creating models with higher accuracies at the cost of poor generalisation, whilst LOSO typically creates models with lower accuracies that perform better with data from new subjects. For this study, both TTS and LOSO cross-validations are used to make the results applicable to both types of devices and to be more comparable with existing and future studies.
2.9. Models
For classification, the KNN, Support Vector Machine (SVM), DT, RF, and Artificial Neural Network (ANN) models, an ensemble voting classifier, and an ensemble stacking classifier were chosen due to their prevalence in the literature. Ensemble models were constructed from each of the individual models (KNN, SVM, DT, RF, and ANN), with either a voting or a logistic regression classifier fusing the decisions. This inclusion of a variety of ML models reduced variations in classifier performance that could be introduced due to the various properties of each model, such as how prone they are to overfitting and how dataset size affects their classification performance.
Hyperparameter tuning was performed using 25 iterations of the Scikit-Optimize Bayesian hyperparameter search. All models were trained on a computer with 32 GB of RAM, a 12th Generation Intel i9-12900K processor, and a 12 GB Nvidia RTX 3060 GPU using the Scikit-Learn library for Python version 3.9.18.
2.10. Performance Metrics and Evaluation
To assess the performance of each model, this study considered both macro-average accuracy and the F1-score. While macro-average accuracy provides a straightforward overview of a model by reporting the mean classification accuracy across all classes, it can be misleading in the presence of large class imbalances, as it does not account for differences in class distribution. To address this, the macro-average F1-score was also reported, which provides a more balanced measure of performance across classes. For each dataset, walking was the primary class, with around 10× more walking data than stair ascent and stair descent data. Standing data varied between datasets but were typically around 2–3× more numerous than data in the stair ascent and stair descent classes.
4. Discussions
The results of the window size analysis did not exhibit a consistent peak or plateau, with accuracies appearing volatile across the four datasets for each window size and trend lines displaying misaligned peaks. Furthermore, the averaging of accuracies across all models at each window size showed no clear single optimal window size across the four datasets and methods of cross-validation.
It must be noted that the performance metrics of the Camargo et al. dataset did not align with the other multimodal datasets in terms of overall classification accuracy. These systems all made use of the same six-axis IMU positioned on the thigh, yet the Camargo et al. dataset achieved significantly reduced accuracies when trained on only this sensor when compared to HuGaDB and CSL-SHARE. Given the large number of controlled variables in this study, this indicates a difference in experimental procedure or activity data distribution, which negatively affects the results of the Camargo et al. dataset.
Figure 9a shows the confusion matrix for an SVM trained on the Camargo et al. dataset, which shows that the misclassifications are between the stair ascend and stair descend classes. This is also shown not to be caused by sample weighting, as
Figure 9b,c show the confusion matrices for the HuGaDB and CSL-SHARE datasets, respectively, which feature more extreme sample weightings than the Camargo et al. dataset whilst achieving 100% accuracy.
Figure 9 highlights SVMs as the most effective individual models for HAR using subject-dependent cross-validation, with ANNs proving more effective when using subject-independent cross-validation. This is likely due to the tendency for ANNs to overfit, which was further pronounced by the use of a TTS in creating test data for subject-dependent cross-validation, whereas SVMs typically perform well in these scenarios due to the maximisation of the margin when creating a decision boundary.
For subject-dependent cross-validation, peak accuracies occurred at smaller window sizes, ranging from 2–5 s. The trend lines in
Figure 1 and
Figure 5 also exhibit rises in accuracy for some models as they approach a 10-s window size, indicating that, if the dataset contains enough samples in each class for this to be viable, larger window sizes offer richer features which lead to higher classification accuracies. For subject-independent cross-validation, the highest-performing model accuracies occurred at 2, 3, 5, and 10 s for the HuGaDB, CSL-SHARE, Camargo et al., and USC-HAD datasets, respectively. Apart from USC-HAD, this further highlights the range of 2–5 s as an effective range of window sizes in achieving high classification accuracy for the core activities of HAR.
Aside from the Camargo et al. dataset, the multimodal datasets achieved much higher classification accuracies when using the same models and window sizes, which allowed high accuracies to be obtained with much smaller window sizes. This has significant implications when considering the delay time, portability, and convenience of systems, as increasing the number of sensors can enable high-accuracy HAR using very computationally inexpensive methods such as DT. These computationally low-cost methods can also allow designers of real-time HAR systems to incorporate low-power computational devices with reduced size profiles and battery consumption, therefore increasing the comfort and convenience of the devices. Additionally, the fact that high accuracies can be obtained in multimodal systems with low window sizes means that much faster response times can be achieved for real-time HAR systems, as some models trained on the CSL-SHARE dataset achieved 100% accuracy using just 1 s windows with a 0.25 s fixed delay time caused by the step size. Whilst it was shown that accuracy at each window size was dependent on the sensor types used in each dataset, further work is needed to identify how model performance varies with window size for each individual sensor type. This will enable the building of a knowledge database to help future researchers choose a window size given a sensor system without the need for lengthy, brute-force approaches to finding the most appropriate window size, combination of sensors, and choice of model for each novel dataset produced in this field.
Regarding individual sensor types, the IMUs and three-axis goniometers generally exhibited the highest accuracies, followed by the two-axis goniometers and finally the EMG sensors. Among IMU locations, accuracy varied among the different locations, with no clear ranking between all datasets. Only the Camargo et al. and CSL-SHARE datasets featured goniometers, with the three-axis goniometers at the thigh and ankle in the Camargo et al. dataset showing large performance improvements over the two-axis goniometers located on the knee in both the Camargo et al. and CSL-SHARE datasets. Goniometers are low-power devices with fewer data dimensions than IMUs which can be incorporated into smart clothing devices to improve comfort and convenience. Given the competitive performance of goniometers in this study, three-axis goniometers should be considered in future datasets and HAR systems. On the other hand, EMG sensor performance was volatile between locations and datasets, which may be due to differences in filtering methods, varying placements on muscles, or changes in experimental procedures. As such, it is not currently possible to compare the locations of these sensors, particularly with so few datasets for reference. More datasets are required to accurately rank the locations of these sensors so that the impact of differences in experimental setup can be minimised.
Regarding the sample rates of each dataset, no correlation was present between the native sample rates of each dataset and the final classification accuracy, with the HuGaDB dataset exhibiting far higher accuracies than USC-HAD and the Camargo et al. dataset, despite having the lowest native sample rate of 60 Hz. As such, whilst sample rate is expected to have an effect at even lower values, 60 Hz can be considered a sufficient sample rate for high-accuracy HAR.
These results align with the findings of Banos et al. [
23], who found that increased window size does not necessarily increase activity classification performance across many datasets. However, our study also offers insight into the reason for this assumption, with subject-dependent cross-validation demonstrating this pattern until accuracy and F1-score began to reduce at larger window size values due to insufficient sample sizes. Crucially, this work considers both subject-dependent and subject-independent methods of cross-validation, which highlights how the choice of cross-validation method impacts the selection of an optimal window size, which was not considered in the study [
23]. Niazi et al. [
25] considered the effect of window size and sample rate on classification accuracy using an RF classifier, where it was reported that window sizes could appear optimal between 2–10 s using subject-dependent cross-validation. Our results support these findings and demonstrate that this also applies to additional classical Machine Learning models such as the ANN, SVM, KNN, and DT. Duan et al. [
28] considered the optimal placement of sensors using deep learning techniques for a single dataset, finding that sensors placed on the right leg exhibited increased performance. Our results align with the findings of this study, with the HuGaDB dataset demonstrating that, when subject-independent cross-validation was used, the performance metrics of the right leg were higher than those of the left. Finally, Khan et al. [
30] report that sensor performance is dependent on the activities being performed in the dataset. By removing the variation between datasets, our study controlled for this factor, resulting in a reliable ranking of sensor locations that achieved high performances and offer future researchers the information necessary to build effective HAR systems.
Finally, this study featured several limitations due to the computational cost of performing this analysis. The first of these limitations was the lack of investigation into the effects of window step size, which was set to 25% of the total window size. This could have been set to a fixed time value for all window sizes or have been individually analysed to explore the co-dependent effects of step size and window size. Furthermore, the availability of datasets which feature a sufficiently large number of participants and sensors, along with the core activities included in this study, was limited, resulting in the inclusion of just four datasets.
5. Conclusions and Future Work
This study is the first of its kind in providing a bias-reduced, normalised, cross-dataset analysis to determine and rank the highest-performing sensor types for Human Activity Recognition. First, ANNs were found to be the highest-performing models across multiple multimodal HAR datasets, closely followed by SVMs, with the optimal window size being in the range of 2–5 s when using the semi-non-overlapping sliding window approach to feature engineering with a 75% overlap. Where datasets were large enough to reduce the impact of class imbalance, or models were sufficiently powerful to generalise with smaller sample numbers, accuracies were also shown to trend upwards with larger window sizes of 9–10 s. Regarding the contributions of individual sensor types to classification accuracy, IMUs placed on the thigh and three-axis goniometers on the thigh and ankle were the overall largest contributors to high-accuracy HAR, whilst EMG sensors were found to exhibit volatile accuracies which was likely due to the difficulty in ensuring that the sensors were in the same place and calibrated equally for different subjects. It remains appropriate for researchers to collect large HAR datasets and to investigate alternative methods of HAR using multimodal sensor systems and smart clothing to investigate how the size and inconvenience of these systems can be minimised whilst maintaining high accuracy using low-computational-complexity classification methods.
This study was limited by the scarcity of open multimodal gait datasets with large numbers of sensors and common activities. As a result, future work in this area should consider more datasets, activities (including fall-related activities), and sensor types to investigate how classifier performance in HAR is affected by these properties. Additionally, elements such as step size, the proportion of data for each activity, and time-series features should be investigated for their contribution towards achieving efficient and convenient high-accuracy HAR. Finally, the time and space complexity of these algorithms should be considered under the various window sizes to evaluate the feasibility of deploying these optimised models in real-world HAR applications.