Abstract
Exoskeletons have decreased physical effort and increased comfort in activities of daily living (ADL) such as walking, squatting, and running. However, this assistance is often activity specific and does not accommodate a wide variety of different activities. To overcome this limitation and increase the scope of exoskeleton application, an automatic human activity recognition (HAR) system is necessary. We developed two deep-learning models for HAR using one-dimensional-convolutional neural network (CNN) and a hybrid model using CNNs and long-short term memory (LSTM). We trained both models using the data collected from a single three-axis accelerometer placed on the chest of ten subjects. We were able to classify five different activities, standing, walking on level ground, walking on an incline, running, and squatting, with an accuracy of 98.1% and 97.8%, respectively. A two subject real-time validation trial was also conducted to validate the real-time applicability of the system. The real-time accuracy was measured at 96.6% and 97.2% for the CNN and the hybrid model, respectively. The high classification accuracy in the test and real-time evaluation suggests that a single sensor could distinguish human activities using machine-learning-based models.
1 Introduction
Recent developments in exoskeletons have enhanced locomotion and increased the quality of life for activities of daily living [1–6]. Such exoskeletons have shown a substantial reduction in physical effort in walking [7–9], running [10], squatting [5,11], and lifting applications [12]. Although tested in a lab setting, the activity-specific assistance is poised to improve the activities of daily living of a large population. However, switching between activities and providing appropriate assistance for these activities is still a significant challenge for exoskeleton system controls.
Several studies have developed an automatic assistance-switching method using sensors embedded in the exoskeleton. These sensors identify and provide assistance based on the activity [3,13]. Kim et al. [3] used a hip and thigh mounted inertial measurement unit (IMU) sensor and developed a heuristic classification method to identify the walking and running activity. Since the heuristic method is subject-dependent and requires substantial parameter tuning, this approach is not practical for general-purpose use. Medrano et al. [13] used the ankle angle and torque sensors in an ankle exoskeleton with an extended Kalman filter to accurately identify the ground elevation. These methods, however, are exoskeleton specific, making it challenging to use the same algorithm for different exoskeletons. In addition, during the developmental stage, the algorithms need to be iteratively developed during the process. To address these issues, using additional sensors for activity recognition instead of the sensors embedded in the exoskeleton would be helpful.
Various types of sensors have been used for activity classification: vision-based devices, such as cameras [14], depth-based Kinect sensors [15], or wearable devices such as inertial measurement units [16–18]. Hussain et al. [19] has performed a detailed analysis of the various types of sensors used for human activity classification and their different applications. Vision-based and depth-based sensors require constant filtering and stabilization of the image due to the large variation from walking/running movement [20], which increases the computation requirements of the classification system. This disadvantage of the vision-based system makes wearable sensors a better choice for activity classification studies.
For wearable sensors, the position of the sensor on the user has been shown to affect the system's performance significantly [21]. Comparison studies have been done to find the best position for the wearable sensor [22]. Qamar et al. [21] has found that a model's accuracy is higher when the sensor is placed on the user's chest than on the hand and/or ankle. In addition, this position would be independent of most lower limb exoskeletons and would not interfere with their application.
A common approach to wearable sensor-based activity recognition is feature extraction from the data collected by the sensors. Janidarmian et al. [22] compared a model's performance using different features such as mean, standard deviation, pitch, roll angle, etc. Though such methods have progressed well using shallow machine learning techniques [23], they face challenges extracting distinctive features of complex activities such as climbing stairs or walking on an inclined surface without hand-tuning [24–26]. A deep learning system could address these issues and develop an activity recognition system with a broad range of activities.
Recently, convolutional neural networks (CNN) have been promising in the field of human activity recognition [27,28]. CNNs are known to capture the relevant feature from the temporal data, and pooling layers can introduce robustness to variance in the measurement [29]. In addition, compared to fully connected layers, CNNs have fewer trainable parameters and hence, are much faster to train [29,30]. Also, CNN is more robust to sensor position and window size of the data than fully connected Neural Networks and traditional machine learning-based approaches like support vector machines (SVM) [31]. In addition to these advantages, CNN has also been shown to have better accuracy per memory usage [32] than other deep-learning architectures.
Recently for human activity recognition (HAR), researchers have also used recurrent neural networks (RNN) due to their efficiency in dealing with time-series data [33]. However, RNNs are plagued by the exploding/vanishing gradient problem which hampers learning model weights. Long short-term memory (LSTM) networks, a form of RNN, can overcome this problem making them more suitable [34]. The forget gate of LSTM controls the amount of long and short-term memory held by the LSTM, thereby enabling them to make unbiased continuous predictions [35]. Britz et al. [36] has shown that LSTMs perform better than other variants of RNNs such as gated recurrent units. Though many variants of LSTMs have been introduced, Greff et al. [37] has shown that these modifications do not significantly improve the standard variant. They also found that the forget gate and the output gate activation influence the performance of an LSTM model. Though LSTMs have shown good performance when dealing with temporal data, they fall behind CNNs due to their computational complexity and memory usage [38,39]. To compensate for these drawbacks, LSTMs have been used in combination with CNNs [40].
Although the current state-of-the-art methods use CNNs and LSTMs, they typically require multiple sensors [41,42]; placed in less practical positions such as ankle [41,42], or hip [43,44]; require many sensors [42,45]; or use longer window size [1.6 s–5.12 s] [42,46], as shown in Table 1. For an exoskeleton application, a model with a low number of sensors or a shorter window size is critical. A large window size could introduce a considerable lag in predictions that could be detrimental to the user. We propose a HAR system with a single sensor and a short time window to address these issues.
In this study, we hypothesized that CNN and LSTM-based models can recognize human activities using a single chest-mounted accelerometer. To address this hypothesis, we developed two models using state-of-the-art deep learning architectures. The first model consists of only CNN layers while the second model is a hybrid of CNN and LSTM layers, as described in Secs. 2.1 and 2.2, respectively. Through a controlled human subject study, we collected 10 subjects' data performing running, walking on level ground, walking on an incline, standing, and squatting activities. We trained and tested both the CNN and hybrid model using the collected data. We then validated the real-time accuracy using a two-subject trial. The data collection process for training and testing is described in Sec. 2.3. Section 3 describes the performance of the two models during the offline and real-time validation. The implications of these results are discussed in Sec. 4. Finally, in Sec. 5, we have provided the concluding remarks and the future scope of this project
2 Methods
2.1 Model Architecture: Convolutional Neural Network.
To develop the CNN model, the architecture followed by Bo Yang et al. [56] was used as a starting point. After preliminary testing, to improve the performance, we increased the number of convolution layers from four to six and increased their kernel size from three to five. Along with these architectural changes, the number of filters in each convolutional layer was also doubled. Padding was used to ensure the input and output tensor lengths remained the same. All convolutional layers use rectified linear unit (ReLU) as the activation. Each layer was then followed by a MaxPool layer with kernel size and stride 2 to reduce the size of the tensor. For every successive convolutional layer, a dropout layer with a dropout rate of 0.3 was added to reduce the tendency of the model to over-fit on the training dataset. Finally, the output of the last convolutional layer is flattened and passed through a fully connected dense layer whose output dimension is equal to the number of classes. For the fully connected output layer, we used the SoftMax layer as the activation, which returns the probability of occurrence of each label for that particular input. This model contains 638,278 trainable parameters. Figure 1(a) illustrates the architecture of the CNN model.
2.2 Model Architecture: Hybrid Model (CNN + LSTM).
For this model, we use CNN layers as a feature extractor and LSTM for the time series feature classification. Figure 1(b) illustrates the model architecture of the hybrid model (CNN + LSTM) where two 1D-Convolutional layers precede four LSTM layers. The two convolutional layers, which act as feature extractors, have 64 and 128 filters, respectively. A 1D-MaxPool layer succeeded each convolutional layer with kernel size and stride 3. A Batch Normalization layer was added after every Maxpool layer to decrease overfitting to the training dataset. Each LSTM layer has 256 cells, and the output of the last LSTM layer is fed into another one-dimensional Convolutional layer with 128 filters. After being passed through the third convolutional layer, the tensor is flattened and passed through a dense layer whose output is the number of different classes, in our case, 5. A SoftMax layer is then applied to obtain the probability of occurrence of each class. All convolutional layers use ReLU activation, while the LSTM layers have the Tanh activation function. The convolutional layers have a kernel size of 9 with stride 1. No padding was added for the convolutional layers in the hybrid model.
2.3 Experimental Protocol.
We performed a one-day experiment and collected ten subjects' data to train and test the activity recognition system (10 male, height: 176.6 ± 8.2 cm, mass: 79.2 ± 10.4 kg, age: 24.6 ± 4.0 years). The data collected included five activities that followed the data-collection protocol described in Table 2. During this protocol, each subject walked on level ground, on an incline (+4 deg), stood on a treadmill for 3 min each, and ran on a treadmill for 4 min. Three minutes of sitting rest was provided between each activity. The walking speed during the study was set to 1.25 m/s using a treadmill, for both walking on level ground and incline conditions. The speed during running was set to 2.5 m/s. The detailed protocol followed during squatting activity can be found in Ref. [5]. We used the day1 data of the [5] study which included 180 squats of each subject with the powered, unpowered, and no exoskeleton conditions. Each squat lasted approximately 2 s and was followed by 6 s of standing rest. All the squats were done on the ground.
During all the trials, the subject wore a Polar H10 (Kempele, Finland) chest strap and sensor to measure the accelerometer data. The three-axis accelerometer data were collected at 200 Hz using a custom python-based Bluetooth and Labstreaminglayer (LSL) application. The range of the three-axis accelerometer was −2000 to +2000 cm/s2 . The data from all ten subjects were combined and randomized for training and testing of the algorithms.
To validate the testing model accuracy and generalizability, we did a 2-subject validation trial that followed the validation protocol, as shown in Table 2. Out of the two subjects (Sub1 and Sub2), Sub1 was an experienced participant who participated in the previous data collection while Sub2 was a novice participant. The experimental pipeline remained the same as the data collection pipeline, but alongside the accelerometer data, the model's prediction was also recorded to analyze the accuracy of the prediction. The models' prediction was conducted every 0.5 s. The study protocol was approved by the University of Illinois at Chicago, Institutional review board, and informed consent was obtained for each participant.
2.4 Data Preparation.
Preprocessing of the raw data included converting the data into small windows of fixed length, each of which would act as a separate input to the model. For both models, we selected the window size as 1 s ( i.e., 200 data points) and a window step size of 0.5 s based on the previous testing. As transitioning between different activities takes about 0.3–0.7 s [57,58], a 0.5 s step size should be large enough to detect the transition. A 0.5-second step size would also keep the real-time detection robust. Since the window step size is smaller than the window size, each input sample has a 50% overlap with its predecessor. For the squatting activity, each squat was individually identified and segmented from the standing data.
To remove any influence of subject-specific body posture in the data, the data were normalized by subtracting each subject's mean of the first minute of standing condition data. Alongside this, data was augmented to add more variance to the dataset, avoid overfitting and improve the prediction accuracy of all classes. For the augmentation of data, we used three different techniques, noise injection, scaling, and magnitude warping [59–61].
Noise injection: Gaussian white noise with a mean of 0 and a standard deviation within the range of 0–10 cm/s2 was added to the collected data.
Scaling: The absolute value of each sample was multiplied with a Gaussian curve of mean 1 and standard deviation in the range of 0–0.3.
Magnitude warping: Element-wise multiplication of the sample and a third order curve, generated using cubic spline with n number of knots of random magnitude, where n lies between 5 and 12.
The time period of the squats, during data collection, was fixed (1 s ascending state and 1 s descending state) to maintain consistency between subjects and conditions [5]. However, training on such a dataset could lead to bias in the model. To address this potential bias, during the augmentation procedure, we resampled squatting data with varying periods in the range of 1–2 s to make the model more robust and applicable to everyday scenarios.
The final dataset contains 27,895 samples, of which, squatting contains the highest number of samples at 8266 while walking on the level ground contains the least at 3650. Running, standing, and walking on an incline contained 6193, 4583, and 5203 samples, respectively. A 70–30 random split was used on the dataset to create a training and testing dataset. The model parameters were trained only on the training dataset, while the testing dataset was used to validate its performance.
2.5 Training.
Categorical cross-entropy was used as the loss function for both the CNN and hybrid models. The Adam optimizer was used to update the model weights of both models. The two models were trained for 100 epochs, each epoch taking 2 s for the CNN model and 7 s for the hybrid model. Both models were trained using PyTorch and were accelerated by Cuda with an Intel i5 CPU and an Nvidia GeForce GTX 1650 GPU.
To validate the performance of the models, they were trained and tested on two public datasets, UCI and PAMAP2. A window size of 5.12 s was chosen for the PAMAP2 dataset, and 2.56 s was selected for the UCI dataset, similar to the papers shown in Table 1. Both models were trained on the prepartitioned UCI training dataset for 100 epochs and tested on the test dataset. The PAMAP2 dataset was split into the training and testing dataset with a 70–30 ratio and the models were trained for 100 epochs.
3 Results
3.1 Offline Validation.
The CNN model achieved a maximum train accuracy of 99.9% after 100 epochs and a maximum test accuracy of 98.1% after 58 epochs. The hybrid model achieved a maximum train accuracy of 99.9% after 92 epochs and a maximum test accuracy of 97.8% after 89 epochs. The test accuracy of both models is higher than most of the state-of-the-art models, Table 1.
Figure 2(b) shows the training and test loss calculated using the categorical cross-entropy function after each epoch for the two models. The CNN model had the lowest training loss of 0.0029 after 100 epochs and a minimum test loss of 0.067 after 66 epochs. The hybrid model had the lowest train loss of 0.0042 after 95 epochs and a minimum test loss of 0.093 after 29 epochs.
Table 3 shows the accuracy and F1-score of each activity for both models. The models with the highest test accuracy during training were used to calculate the F1-score and accuracy. From the table, it can be seen that for both models squatting class returns the highest F1-score at 0.99 and 1, respectively. Both running and squatting have the highest accuracy at 0.99 for both models. It can also be seen that for both models, walking on an incline has the lowest F1-score at 0.95 and 0.94, respectively. Walking on an incline also returns the most insufficient accuracy for the hybrid model at 0.92.
The two models were trained and tested on the public datasets UCI and PAMAP2 to validate their performance while keeping the parameters from our models the same (e.g., architecture, the number of hidden layers, the learning rate, and the optimizer). The CNN model achieved an accuracy of 0.91 and 0.90 on the two datasets, respectively, and the hybrid model achieved an accuracy of 0.92 and 0.90, respectively.
When tested on a larger window size of 4 s, the performance of the models improved, with the CNN and hybrid models achieving 0.997 and 0.997 test accuracy, respectively. Due to the larger window size, the computation time of training for one epoch increased from 2 s to 7 s for the CNN model and 7 s to 18 s for the hybrid. The inference time also increased from 0.003 s to 0.004 s when using a 4 s window.
3.2 Real-Time Validation.
After offline testing, a two-subject validation trial was conducted where the prediction of both models was collected in real-time.
The CNN model gave an overall accuracy of 0.989 and 0.943 for the two subjects. Running, standing, and squatting returned an accuracy of 1 for both subjects. For Sub1, the lowest accuracy of 0.96 was obtained for walking on level ground while for Sub2, the lowest accuracy was for walking on an incline at 0.72. The hybrid model gave an overall accuracy of 0.996 for Sub1 and 0.948 for Sub2. Similar to the CNN model running, standing and squatting have an accuracy of 1 for both subjects. The lowest accuracy of 0.75 was obtained on Sub2 for walking on an incline. The detailed confusion matrices for the real-time validation on both subjects are shown in Fig. 3.
4 Discussion
In this study, we developed and compared two Neural Network architectures using CNN and LSTM layers to perform human activity recognition. Both the models were trained on ten subjects' data with test and train accuracy of 98.1% and 99.9% for the CNN model and 97.8% and 99.9% for the hybrid model. Since real-time accuracy is crucial for exoskeleton applications, we validated the real-time performance by conducting a two-subject test. During this test, we found that the accuracy was consistent at 96.6% and 97.2% for the CNN and hybrid models, respectively, indicating that the models can be deployed in real-world applications.
A single chest-mounted accelerometer-based HAR can be versatile for most exoskeletons. Both the CNN and hybrid models used the accelerometer information and could predict the activity with high accuracy, even compared to the multisensor setup [53]. With this high accuracy and sensor position independent of most exoskeleton sensors, the same algorithm and sensor setup could be generalized for different exoskeleton combinations without retraining the model. Future work can include expanding a study to different sensor positions for more practical applications.
The models were trained and tested on ten subject data. Both models achieved a high-test accuracy; this could be due to the similarity between the test and training dataset as it included the same participants. The models' performance could suffer when tested on a completely new subject. To test this aspect, we tested both models in real-time on a new subject (Sub2) whose data was not included in the training dataset. Both models returned a reasonable accuracy of 0.94 and 0.95 for Sub2. Since subject 1 (Sub1), whose data were included in the training set, showed higher performance, future work includes further improving the model robustness to address intersubject variability.
Convolutional neural network model was memory and computationally efficient compared to the hybrid model. The CNN model with 638,021 trainable parameters performed better in terms of accuracy per memory usage when compared to the hybrid model, which had 2,347,525 parameters. The CNN model also takes less time to train, taking 2 s per epoch compared to 7 s for the hybrid model. This could be due to the fact that computations in LSTMs are not parallelized [34] unlike those in CNN.
The performance of the hybrid model was similar to that of the pure CNN model. The CNN model achieved 98.1% and 96.6% accuracy during offline and real-time testing, while the hybrid model gave an accuracy of 97.8% and 97.2%, respectively. LSTMs are generally known to encode time series information [38], while CNNs generally encode spatial information [62]. However, in this study, we saw that adding LSTM layers to a CNN model to classify time series data did not significantly improve the model's performance. This result was different from the finding of Li et al. [63], where adding LSTM layers to a CNN model significantly improved the model's performance. This change in accuracy could be due to our limited sample size or uneven data distribution. Further testing on a more extensive and balanced dataset is needed to compare the performance.
The classification time window influences the accuracy and the computation needs. During the model development, we also tested different time windows of classification based on other papers (Table 1). With a 4 s window size, the accuracy was improved by 2% with the more than doubled training time associated with higher computational costs. In addition, a large window size might not be practical for a short burst activity such as squatting, typically 1–2 s. Considering this computation costs for a small improvement in the performance, we selected the 1 s time window as our approach across activities. However, future applications could consider an activity-specific time window with faster training systems.
One of the main limitations of this study is that transition accuracy is not measured. With a window step size of 0.5 s and a window size of 1 s, we were able to achieve high accuracy of 0.96 in real-time. This high accuracy for both subjects suggests fast transition detection between activities. Rigorous testing, however, is needed to confirm the transition detection speed and accuracy in future work. In particular, for exoskeleton application, improper timing or inaccurate transition could lead to discomfort in the assistance. Future work also includes improving the algorithm using more sensory modalities, such as the gyroscope and the signal derivatives, to improve speed and accuracy.
Additional sensor modalities could help accurately classify walking on level ground and on an incline condition. In most real-world applications, incline walking is a routine activity, making this classification important. In our study, both CNN and hybrid models showed lower classification accuracy between walking on level ground and on an incline for Sub2, as shown in Fig. 3. These misclassifications could be due to intersubject variability in the walking posture and trunk angle across subjects [64]. Another possible reason is the angle of the slope at which walking on incline data was collected (+4 deg), which could be too small to make a noticeable difference in the accelerometer data. Adding a three-dimensional-gyroscope to the chest sensor while keeping the number of sensors constant could significantly improve the performance of the models. This addition could also help distinguish between more activities such as running on level ground and on an incline. When we tested on public datasets using our models, we observed a reasonable accuracy, higher than 0.90, similar to Refs. [65] and [66]. This suggests that using a single sensor makes it possible to obtain comparable accuracy. Future work includes using additional sensor modalities to improve accuracy and speed.
An additional transition state might be required for the practical implementation of this method with an exoskeleton. Since the activity recognition system observes the current state and identifies the activity, using this system in the exoskeleton might lead to a delayed change in control action after transitioning of activity. To address this, adding a transition system [67] between conditions using additional sensor modalities could help smooth the transition of the controllers in the exoskeleton.
Future studies can consider adding data from activities performed in an outdoor environment. For this study, running, walking on level ground, and walking on an incline data was collected on a treadmill. Although most exoskeletons are currently tested on the treadmill, the final application of these systems will be in the outdoor setting. Hence future studies should have both outdoor and indoor experiments to collect data and validate the applicability of the proposed HAR models.
5 Conclusion
A generalizable human activity recognition system is critical to making exoskeletons usable for day-to-day outdoor activities. This paper presents two model architectures: CNN and a hybrid model with CNN and LSTM layers. The two models use the data from a single three-axis accelerometer strapped to the chest, to classify between five activities. The two models achieved a high accuracy of 98.1% and 97.8% during offline testing and 96.6% and 97.2% during real-time validation. The high accuracy suggests that the data from a single accelerometer can be used to classify between activities, and the model's prediction can be passed on to the exoskeleton. Future work could entail adding a three-axis gyroscope to the input data, increasing the number of activities, and integrating this into exoskeleton assistance. Increasing the range of the accelerometer from the current −2000 to +2000 cm/s2 could also help distinguish more activities.
Acknowledgment
We are grateful for the assistance of Michael Jacobson, Inigo Sanz Pena, Sabrina Sullivan, Hyeongkeun Jeong, Gabe Griffin, Atharva Deshpande, and other UIC Rehabilitation Robotics lab members for their support. We also thank Courtney Haynes and Cortney Bradford for their guidance during the project and for their feedback on the paper.
Funding Data
US Army (Award No. W911NF2120230; Funder ID: 10.13039/100006754).
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.