Background
Parkinson’s disease (PD) is a neurodegenerative disorder that affects over six million people worldwide [
1]. One of the most debilitating symptoms associated with PD is freezing of gait (FOG), which develops in approximately 70% of PD patients over the course of their disease [
2,
3]. Clinically, FOG is defined as a “brief, episodic absence or marked reduction of forward progression of the feet despite the intention to walk” and is often divided into three manifestations based on leg movement: (1) trembling: tremulous oscillations in the legs of 8–13 Hz; (2) shuffling: very short steps with poor clearance of the feet; and (3) complete akinesia: no visible movement in the lower limbs [
1,
4]. While one patient can experience different FOG manifestations, the distribution of manifestations can vary widely among individuals, in which trembling and shuffling are more common than akinetic freezing [
5]. The unpredictable nature of FOG poses a significant risk of falls and injuries for PD patients [
6,
7,
8], and it can also affect their mental health and self-esteem, leading to a lower quality of life [
9]. To relieve the symptoms, dopaminergic medication such as Levodopa is mainly used [
10]. During Off-medication states, FOG more commonly occurs [
11], while in contrast, FOG episodes are milder in On-medication states but may manifest differently with more trembling [
12].
To qualitatively assess FOG severity in PD patients and guide appropriate treatment, subjective questionnaires, such as the Freezing of Gait Questionnaire (FOGQ) and the New Freezing of Gait Questionnaire (NFOGQ), are commonly used [
13,
14]. Although these questionnaires may be sufficient to identify the presence of FOG, they are insufficient to objectively describe patients’ FOG severity, and capture treatment effects, as they suffer from recall bias [
15], in which the patients may not have been completely aware of their freezing severity, frequency, or impact on daily life. These questionnaires are also poor for intervention studies due to the large test-retest variability resulting in extremely minimal detectable change values [
15]. To objectively assess FOG severity, PD patients are asked to perform brief and standardized FOG-provoking tasks in clinical centers. Common tasks include timed-up-and-go (TUG) [
16], 180 or 360 degrees turning while walking [
17], and 360-degree turning-in-place (360Turn) [
18]. The TUG is commonly used in clinical practice since the task includes typical everyday motor tasks such as standing, walking, turning, and sitting. In combination with a cognitive dual-task, it has proven to provoke FOG reliably [
19]. Recently, the 360Turn with a cognitive dual-task was also shown to be practical and reliable to provoke FOG for investigating therapeutic effects on FOG [
20]. Adding a cognitive dual-task to both the TUG and 360Turn test can increase the cognitive load on individuals, which can result in more FOG events, making these tests more sensitive and perhaps relevant measures of FOG severity in real-life situation [
17,
19,
20].
The current gold standard to assess FOG severity during the standardized FOG-provoking tasks is via a post-hoc visual analysis of video footage [
17,
21,
22]. This protocol requires experts to label FOG episodes and the corresponding FOG manifestations frame by frame [
22]. Based on the frame-by-frame annotations, semi-objective FOG severity outcomes can be computed, such as the number of FOG episodes (#FOG) and the percentage time spent frozen (%TF), defined as the cumulative duration of all FOG episodes divided by the total duration of the walking task [
23]. However, this procedure relies on time-consuming and labor-intensive manual annotation by trained clinical experts. Moreover, the inter-rater agreement between experts was not always strong [
23], and the annotated #FOG between raters could also contain significant differences due to multiple short FOG episodes being inconsistently pooled into longer episodes [
20].
As a result, there is an interest in automated and objective approaches to assess FOG [
5,
24‐
27]. Traditionally, automatic approaches detect FOG segments based on a predefined threshold for high-frequency spectra of the leg acceleration [
28]. These techniques, however, are not fully designed explicitly for FOG as they also provide a positive value to PD patients without FOG and even healthy controls [
29]. Additionally, since these techniques rely on rapid leg movements, they may not detect episodes of akinetic FOG. As gait in PD is highly variable, there is increasing interest in deep learning (DL) techniques to model FOG [
24,
27,
30‐
32]. Owing to their large parametric space, DL techniques can infer relevant features directly from the raw input data. As such, our group recently developed a new DL based algorithm using marker-based 3D motion capture (MoCap) data [
27]. However, marker-based MoCap is cumbersome to set up and is constrained to lab environments. As a result, inertial measurement units (IMU), due to the better portability, were often used to capture motion signals both in a lab and at home [
33,
34] and were widely used for the traditional sensor-based assessment of FOG [
24,
31,
35,
36]. The multi-stage temporal convolutional neural network (MS-TCN) stands as one of the current state-of-the-art DL models, initially designed for frame-by-frame sequence mapping in computer vision tasks [
37]. The MS-TCN architecture initially generates an initial prediction using multiple temporal convolution layers and subsequently refines this prediction over multiple stages. In a recent study, a multi-stage graph convolutional neural network was developed specifically for 3D MoCap-based FOG detection. This research demonstrated that the refinement stages within the model effectively mitigate over-segmentation errors encountered in FOG detection tasks [
27]. These errors manifest as long FOG episodes being predicted as multiple short FOG episodes, impacting FOG detection performance of DL models. Acknowledging the necessity of mitigating such errors, approaches like the post-processing step employed in [
24] also smooth and merged short FOG episodes in the predicted FOG annotations generated by DL models. Consequently, implementing a post-processing step in FOG annotation from DL models emerges as an essential aspect.
Previous studies proposed automatic FOG detection models for FOG assessment in clinical settings by training and evaluating DL models on datasets that include multiple standardized FOG-provoking tasks measured during both On- and Off-medication states [
24,
27,
31,
38]. However, seeing the widespread clinical use of the 360Turn task for FOG detection, it is still uninvestigated if DL models can adequately detect FOG in this task, which forms the first research gap. Additionally, whether training task-specific and medication-specific models enables a better FOG detection performance than a model trained on multiple tasks and both medication states was not discussed in the literature, which forms the second gap.
Moreover, gait patterns and FOG severity can vary substantially among different FOG-provoking tasks [
39] and medication states [
40,
41]. Prior studies have delved into the impact of medication states on FOG. For instance, researchers in [
42] trained a model using a combined dataset of Off and On medication trials and then assessed the model’s performance on each medication state independently. This evaluation aimed to understand how the automatic detection of FOG outcomes derived from the model would respond to medication conditions known to influence FOG severity. Similarly, in [
43], investigations were made to determine whether dopaminergic therapy affected the system’s ability to detect FOG. However, these studies have yet to explore the performance of DL models in detecting FOG in an unseen medication state compared to a model trained specifically on data collected from these medication states, which forms the third research gap. Here, “unseen” refers to conditions not included in the model’s training, such as training a model for 360Turn and evaluating its performance on TUG, or training exclusively on On medication data and testing on Off medication data. This gap is critical in evaluating the generalizability of DL models, probing whether their learned features can be robustly applied to new and unseen conditions, ultimately addressing the model’s adaptability beyond its original training context.
Additionally, although these standardized FOG-provoking tasks include walking and turning movements, similar to movements in real-life conditions, they do not include sudden volitional stops, which frequently occur during daily activities at home. Hence, it becomes crucial to be able to distinguish between FOG and volitional stops when transitioning toward at-home FOG assessment. These volitional stops usually do not include any lower limb movements and are often considered challenging to distinguish from akinetic freezing [
44]. Although a previous study proposed using physiological signals, such as electrocardiography, to detect discriminative features for classifying FOG from voluntary stops [
45], methods using motor signals to distinguish FOG from stops were seldom investigated. To the best of our knowledge, only limited studies proposed FOG detection or prediction on trials with stops using IMU signals [
31,
46]. However, while these studies developed models to detect FOG from data that contains voluntary stopping, they did not address the effect of including or excluding stopping instances during the model training phase on FOG detection performance, forming the fourth research gap.
To address the aforementioned gaps, this paper first introduced a FOG detection model to enable automatic FOG assessment on two standardized FOG-provoking tasks (i.e. the TUG task and the 360Turn task) based on IMUs. The model comprises an initial prediction block to generate preliminary FOG annotations and a subsequent prediction refinement block, designed to mitigate over-segmentation errors. Next, we evaluated whether a DL model trained for a specific task (TUG or 360Turn) or a specific medication state (Off or On) could better detect FOG than a DL model trained on all data. In essence, our aim was to ascertain whether DL models necessitate training on task-specific or medication state-specific data. Subsequently, we evaluated the FOG detection performance of DL models when applied to tasks or medication states that were not included during the model training phase. This analysis aims to assess the generalizability of DL models across unseen tasks or medication states. Finally, we investigated the effect of including or excluding stopping periods on detecting FOG by introducing self-generated and researcher-imposed stopping during standardized FOG-provoking tests. Both self-generated and researcher-imposed stops are hereinafter simply referred to as “stopping”. To this end, the contribution of the present manuscript is fourfold:
1.
We proposed a FOG detection model for fine-grained FOG detection on IMU data, demonstrating its ability to effectively generalize across two distinct tasks and accommodate both medication states.
2.
We show, for the first time, that FOG can be automatically assessed during the 360Turn task.
3.
We show that the DL model cannot generalize to an unseen FOG-provoking task, thereby highlighting the importance of expressive training data in the development of FOG assessment models.
4.
We show that the DL model can assess FOG severity with a strong agreement with experts across FOG-provoking tasks and medication states, even in the presence of stopping.
The study primarily focuses on evaluating the performance of a state-of-the-art model under different conditions, including different tasks, medication states, and stopping conditions, rather than introducing a novel FOG detection model. A comparison of various FOG detection models is provided in
Appendix.
Discussion
This is the first study to show that a DL model using only five lower limb IMUs can automatically annotate FOG episodes frame by frame that matches how clinical experts annotate videos. Additionally, this study is the first to assess the FOG detection performance of a DL model during the dual-task 360Turn task, recently proposed as one of the most effective FOG-provoking tasks [
20]. Two clinical measures were computed to evaluate the FOG severity predicted by the DL model trained for the clinical setting (Model_Clinical), the %TF and #FOG [
23]. Model_Clinical showed a strong agreement with the experts’ observations for %TF (ICC = 0.92) and #FOG (ICC = 0.95). In previous studies, the ICC between independent raters on the TUG task was reported to be 0.87 [
63] and 0.73 [
23] for %TF and 0.63 [
23] for #FOG, while for 360Turn, the ICC between raters was reported to be 0.99 for %TF and 0.86 for #FOG [
20]. While the ICC value in previous studies varied depending on the specific tool and population being studied [
64], in comparison, our proposed model achieved similar levels of agreement. This holds significant promise for future AI-assisted FOG annotation work, whereby the DL model annotates FOG episodes initially, and the clinical expert verifies/adjusts only where required. Despite the high agreement with the experts, results showed that the model statistically overestimated FOG severity with a higher %TF and #FOG than the experts when evaluating all trials. The overestimation of %TF and #FOG was partly due to FP when predicting FOG-related movement, such as shuffling and festination, as FOG segments. The systematic overestimation resulted in relatively low F1 scores while maintaining a high ICC. Given that these FOG-related movements often lie on the boundary between freezing and non-freezing [
45], it can be challenging for the model to accurately annotate and categorize them in a manner consistent with nearby FOG episodes.
This study aimed to assess the generalization capabilities of DL models across various tasks and medication states by comparing models trained on all tasks and medication states (referred to as Model_Clinical) against task-specific and medication-specific models (Model_TUG, Model_360Turn, Model_Off, and Model_On). Our results showed that task- and medication-specific models performed better than the general model, though these effects were not statistically significant. Moreover, when comparing the performance of the general model on different tasks and medication states, our result showed that it was more difficult for the model to detect FOG in 360Turn tests than TUG in terms of the average Segment-F1@50 and Sample-F1. Also, our result showed that it was more difficult for the model to detect FOG in Off-medication tests than in On-medication tests. Despite evaluating Model_Clinical on both tasks and medication states, our model exhibited relatively lower F1 scores compared to those reported in FOG detection literature [
32,
51,
65]. This discrepancy in our study’s F1 scores can be attributed to the challenging nature of our dataset, notably containing a higher proportion of short FOG episodes, with 41.84% lasting less than 1 s. In comparison, the CuPiD dataset [
66] has a proportion of 5.06%, while the dataset from [
24] reports 0% of such short episodes. When comparing our FOG detection models with those proposed in the literature, detailed in
Appendix, we observed that these models struggled to properly detect FOG in our dataset, exhibiting lower Sample F1 scores compared to our model. This disparity suggests that our dataset poses greater difficulty for annotations, possibly due to the prevalence of numerous short episodes.
Our next evaluation focused on determining the extent to which a DL model trained exclusively on a single FOG-provoking task or medication state could generalize to unseen FOG-provoking tasks or medication states. Results showed that the model trained on one FOG-provoking task (i.e., TUG or 360Turn) could better detect FOG in such a task than the model without training on such tasks. Additionally, although previous studies have shown that gait patterns are altered post anti-Parkinsonian medication [
40,
41], our results also showed that the model trained on one medication state could still detect FOG in the other medication state. As a result, we recommend caution when applying DL-based FOG assessment models on FOG-provoking tasks that were not explicitly trained on, while applying models trained on different medication states does not show such discrepancies. This also has implications for future work toward daily-life FOG detection. Training data needs to be diversified for all activities encountered during daily. On the other hand, diversifying training data towards the medication states is unnecessary, making data collection more feasible as data can be collected in the On-medication regimens in the future.
While existing approaches utilized DL models to detect FOG on standardized FOG-provoking tasks with IMUs [
24,
31], the model’s ability to distinguish FOG from stopping remains undetermined, which is critical for free-living assessment [
45]. Therefore, voluntary and instructed stops were introduced in the standardized FOG-provoking tasks. When evaluating trials without stops, results showed no difference between the model trained without stops and the model trained with stops, showing that adding stopping periods in the training data does not affect the DL model to detect FOG. Additionally, when evaluating trials with stops, results showed that compared with the model trained without stops, the model trained with stops produced less FP of stopping (16 compared to 74). While it was considered that stops are difficult to distinguish from FOG with movement-related signals, especially for akinetic FOG [
67], our model could still detect FOG in the presence of stops. Moreover, our result highlights the importance of including stopping in the training data.
Although this study has provided valuable insights, there are some limitations to acknowledge. The first limitation is that the videos in our dataset were annotated sequentially by two clinical experts. The first annotator’s work was verified and, if needed, corrected by the second annotator. As a result, we could not calculate the inter-rater agreement in our study to compare our models’ annotation against. However, the literature shows that inter-rater agreement is 0.39–0.99 [
20,
23,
24,
27,
35,
63] and that these differences between experts were sometimes due to minor differences between FOG and FOG-related movements. Our DL model’s agreement with the clinical experts exceeded those previously published inter-rater agreements, and just as between experts, most of our model’s mispredicted FOG segments were marked as FOG-related segments by the experts. Future work could investigate the development of DL models that can better differentiate between FOG and FOG-related events. On the other hand, whether such differentiation is truly needed depends on the research or clinical question. The second limitation is that this study simulated free-living situations by asking patients to stop when performing standardized FOG-provoking tasks. Yet, free-living movement will contain substantially more variance (e.g., daily activities) than captured during our standardized tasks. Moreover, FOG severity during our tasks does not necessarily represent FOG severity in daily life [
44,
68]. Therefore, future work should establish the reliability of our approach to data measured in free-living situations. The third limitation is that this study showed that training DL models with trials that include stopping resulted in better performance in detecting FOG in trials that include stopping. However, whether DL models are able to distinguish between FOG and stopping for all manifestations of FOG (e.g., akinetic FOG) remains to be investigated. The fourth limitation is our choice of utilizing the complete sensor configuration, which includes all five IMUs in this study. Previous research has compared various IMU positions and recommended an optimal technical setup comprising only three IMUs (specifically, lumbar and both ankles) [
24]. We included the performance results of models trained with the 3-IMU configuration in
Appendix. The result demonstrate that there is no significant difference between the performance of models trained with five IMUs and three IMUs. However, additional research is required to definitively establish the ideal sensor configuration for effective FOG detection in home environments. The fifth limitation is the small number of participants compared to the other use cases in DL literature. As this study evaluated the model with the LOSO cross-validation approach, the results still showed that the model could generalize learned features to unseen subjects. Moreover, despite the small number of subjects, the number of samples and FOG events in the dataset used in this study is comparable with the literature [
27,
31]. Future studies could evaluate automatic FOG assessment on larger datasets or across datasets. The sixth limitation is that the recruited PD patients subjectively reported having at least one FOG episode per day with a minimum duration of 5 s. While the proposed model works for these severe freezers, it still has to be verified whether the model also generalizes to mild freezers.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.