Keywords

1 Introduction

Recent reportsFootnote 1 from the World Health Organization (WHO) point out the global epidemic status that obesity has reached by doubling the affected population worldwide since 1980. In particular, overweight and obesity are two of the most prevalent preventable causes of death, alongside smoking tobacco and sexually transmitted diseases, and are responsible for over 2.5 million deaths per annum since 2001 [11]. Thus, the ability to unobtrusively monitor eating behavior plays a key role in the study and treatment of obesity.

Several devices have been introduced specifically for measuring meal eating behavior, e.g. by weight scale [8] or based on sound [9]. In this paper, we are interested in detecting eating moments during the course of a meal using general purpose IMU sensors. This enables us to automatically measure in-meal eating behavior in terms of number of bites, bite frequency and bite frequency acceleration or deceleration, thus approximating the food intake curve of [8].

Several approaches use multiple sensors to achieve high detection accuracy. In particular, the work of [1] involve the usage of multiple body-mounted accelerometers with the goal of detecting eating related gestures, whereas the authors of [6] combine a number of audio and motion sensors in order to detect bites and estimate intake weight. The main drawback of these methods, however, is the low usability compared to using a single, commercially available device.

Less obtrusive approaches exist, that employ the IMU sensors of a single smartwatch. Specifically, the authors of [12] propose the dissection of a feeding gesture into two sub-feeding movements, namely food-to-mouth and back-to-rest. Following the authors’ proposed gesture recognition scheme, a clustering approach is used to detect the final eating moments, resulting in 0.757 F1 score on a laboratory controlled dataset. The work of [10] makes use of the sequential dependency between a small number of gestures leading to a bite of food. Moreover, the authors propose the usage of Hidden Markov Models (HMM) to capture the temporal evolution of eating. The results show the high performance of the proposed approach in manually segmented sequences in a large dataset. However, no results on non-segmented sequences are presented. A gyroscope-based approach is introduced in [2]. The authors make use of a characteristic wrist roll pattern that is exhibited during a meal to detect biting moments.

In our previous work [4], we showed how classification of hand movements into five meal-related gestures, followed by two discrete HMMs, can be used to characterize a food intake cycle. In this paper, we improve on this approach, by modeling hand micro-movements as an SVM score vector and by subsequently using an LSTM network to classify each sequence as an intake or non-intake cycle. Experimental results on our publicly available Food Intake Cycle Footnote 2 (FIC) dataset show the effectiveness of this method.

Following the introduction, Sect. 2 introduces the terminology and presents the steps of the method towards the detection of food intake cycles. Information about the dataset is presented in Sect. 3, whereas Sect. 4 presents the conducted experiments and their results. Finally, Sect. 5 concludes the paper.

2 Proposed Approach

The work presented in this paper aims at identifying food intake cycles during a meal session. Each food intake cycle consists of a series of hand micro-movements. The relation between meal session, food intake cycle and micro-movement is depicted in Fig. 1.

In its ideal form, a food intake cycle starts by manipulating a utensil to pick up food from a plate, continues with an upwards movement of the hand operating the utensil towards the mouth, followed by inserting the food in the mouth and concluding with a downwards motion of the hand away from the mouth. However, in real meals we observe repetitions of certain hand movements, unrecognized hand movements, or no hand movement at all. In the same context, the term micro-movement is used to describe a hand movement of limited duration that is related with the food intake cycle. A typical micro-movement example is the upwards movement of the hand operating the utensil from the plate towards the mouth. The micro-movements that we used in this study originate from the FIC dataset and are presented in Table 1.

Table 1. Table listing the selected micro-movements

The proposed method uses the acceleration and gyroscope signals of a smartwatch with the purpose of detecting food intake moments within a meal session. An array of binary (one-versus-one) SVMs is used to represent the initial signals as micro-movement score vectors; whereas an LSTM network is used to classify sequences of micro-movement score vectors as intake or non-intake cycles. An overview of the proposed system architecture is presented in Fig. 2.

2.1 Data Pre-processing

Initially, the synchronized 3D accelerometer (\(a_{x}[n],a_{y}[n],a_{z}[n]\)) and gyroscope (\(g_{x}[n],g_{y}[n],g_{z}[n]\)) sensor streams of a meal session are individually smoothed by a \(5^{th}\) order median filter. Furthermore, since the accelerometer sensor captures both the acceleration caused by the hand’s movement as well as the acceleration due to the earth’s gravitational field, the next step is to remove the gravity from the acceleration signal. To this end, we use the method proposed by [5]. More specifically, the gyroscope samples are used to estimate the rotation of the smartwatch with respect to a reference frame. We use the first sample as the reference frame (i.e. the position of the smartwatch when recording starts). Then, by assuming that the smartwatch is initially still, gravity can be removed by subtracting the first acceleration sample from the rotated sequence.

Fig. 1.
figure 1

Segmentation of a meal session (solid line) into intake cycles (shaded area) and micro-movements (dotted line).

2.2 Feature Extraction

Given the pre-processed accelerometer and gyroscope streams feature extraction is performed by extracting frames of length \(w_{l}\) and step \(w_{s}\) corresponding to 0.2 and 0.1 s respectively. Let \(\varvec{w}^{i}_{a_{x}}\) be the i-th extracted frame from \(a_{x}[n]\) channel of the accelerometer signal. For each \(\varvec{w}^{i}_{a_{x}}\) a number of both time and frequency domain features are calculated, including (i) the number of zero crossings, (ii) the mean, (iii) the standard deviation, (iv) the variance, (v) the maximum value and minimum value, (vi) the range of values, (vii) the normalized energy and (viii) the first \(\frac{w_{l}}{2} + 1\) Discrete Fourier Transform coefficients. These features are also extracted for the rest of the accelerometer and gyroscope channels. Furthermore, the simple moving average is also calculated by \(SMA_{a}^{i}=\frac{1}{w_{l}}\sum _{j=k}^{w_{l} + k} |w^{i}_{a_{x}}[j]| + |w^{i}_{a_{y}}[j]| + |w^{i}_{a_{z}}[j]|\) for the acceleration stream and in a similar manner for the gyroscope. The result of feature extraction is the representation of the \(a_{x}[n]\), \(a_{y}[n]\), \(a_{z}[n]\), \(g_{x}[n]\), \(g_{y}[n]\) and \(g_{z}[n]\) time series as a series of L-dimensional feature vectors \(\varvec{f}_{i}\).

2.3 Modeling the Micro-movements

From the list of micro-movements of Table 1, we observed that class O exhibits high inner class variance, since it is used to represent every hand movement other than P, U, D, M and N. As a result, all extracted features belonging in the O class are excluded from the learning procedure. The micro-movement learning process is achieved by employing an array of one-versus-one SVM classifiers with the Radial Basis Function (RBF) kernel. Given the features belonging in the five classes of interest, a total of ten one-versus-one classifiers are trained. In addition, since some micro-movements are inherently longer in duration than others (e.g. P and N) all classes are weighted according to their prior probabilities. Finally, prior to training, all features are linearly scaled in [0, 1]. Given the trained SVM models, each feature \(\varvec{f}_{i}\) extracted as in Sect. 2.2 is converted into a 10-dimensional vector \(\varvec{s}_{i}\) composed of the pair-wise prediction scores of the 10 one-versus-one SVM classifiers.

Fig. 2.
figure 2

Overview of the proposed system.

2.4 Learning the Food Intake Sequences

We designed an LSTM network with the purpose of classifying sequences of \(\varvec{s}_{i}\) as intake or non-intake cycles. The LSTM network is an extension of the Recurrent Neural Network (RNN) specifically designed to solve the long term dependency and vanishing gradient problems, thus giving it the ability to effectively model large intra-dependent sequences such as micro-movement sequences. In contrast with Markov models where the current state depends solely on the previous state in time, LSTM networks use a combination of input, output and forget gates to retain information over a long period; thus, model more efficiently intake sequences that differ greatly from the ideal intake sequence due to the insertion of non intake-related micro-movements between intake-related micro-movements.

The proposed network’s architecture consists of two consecutive LSTM layers with 128 hidden cells each, followed by a fully connected output layer with a single neuron. For the activation function of the recurrent steps we used the hard sigmoid defined as \(\sigma (x) = \max (0,\min (1,x \, 0.2 + 0.5))\), while for the output layer we used the sigmoid function. In a compact notation, the network can be written as \(L(128)-L(128)-D(1)\), where \(L(k_{1})\) represents an LSTM layer with \(k_{1}\) hidden cells and \(D(k_{2})\) a fully connected layer with \(k_{2}\) neurons. The reason for using two LSTM layers stems from the work of Karpathy et al. [3], where the authors have shown that using a depth of at least two recurrent layers is beneficial when learning sequences.

Both intake and non-intake sequences are introduced to the network during training. Given the true label corresponding to each \(\varvec{s}_{i}\), a sequence of \(\varvec{s}_{i}, i=1,\,2,\,\ldots ,\, n_{j}\) is considered an intake cycle if it starts with P (the first P in a sequence of P labels), ends with D (the last D in a sequence of D labels) and contains at least an M micro-movement. On the other hand, the remaining sequences that appear between consecutive intake cycles, are considered as non-intake cycles. We then represent each intake and non-intake sequence by their appropriate \(n_{j} \times 10\) SVM score matrix. Since the input sequence of each LSTM layer is required to have a constant length, each sequence was pre-padded with zeros to a size \(n' \times 10\), where \(n'=\max \{n_{j}:j=1,2,3\ldots \}\). Thus, the input is long enough to contain every intake or non-intake sequence in the corpus. We used binary cross-entropy loss with the RMSprop optimizer (with \(10^{-3}\) learning rate) that has demonstrated high effectiveness in a recurrent network topology [7]. Finally, the network is trained using an batch size M equal to 32 for 5 epochs.

2.5 Food Intake Cycle Detection

Given the trained LSTM network and a sequence of \(\varvec{s}_{i}\) that represents a meal session, intake cycle detection is performed by extracting 3 s frames from the sequence of \(\varvec{s}_{i}\) with a step of 0.2 s. The extracted frames are then pre-padded with zeros to the target size \(n' \times 10\) and given as input to the LSTM network. The network output d[m] (i.e. the output of the sigmoid function) represents the normalized probability that an input frame is an intake cycle. Subsequently, by replacing with zeros the elements of the d[m] series that are lower than a threshold \(T_{d}\), the filtered series \(d'[m]\) is created. Finally, food intake cycles are detected by performing a local maximum search in \(d'[m]\), with the minimum distance between two successive peaks set to 3 s. In particular, the timestamp corresponding to each local maximum (i.e. intake cycle) is the timestamp of the middle of the frame that produced the local maximum.

3 Dataset

In this study we used our publicly available FIC dataset. The FIC dataset consists of recordings from 10 subjects performing one meal session each, with an average duration of 13.2 min, in the restaurant of Aristotle University of Thessaloniki. The accelerometer and gyroscope streams originate from the Microsoft Band 2 smartwatch and are provided at a sample rate of approximately 62 Hz. The ground truth is provided at a micro-movement level based on analysis of video sequences captured during each subject’s meal session. No specific instructions were given to the participating subjects other than clapping their hands once in the beginning and once in the end of the session for video/smartwatch synchronization purposes. Thus, the participants were able to engage in activities such as talking to other individuals in their proximity, during the recording. Table 2 provides additional information regarding the appearances of micro-movements in the dataset. Additionally, the average food intake cycle duration (from P to D) and the average distance between two consecutive food intakes were \(5.39\,(\pm 3.86)\) and \(11.22\,(\pm 8.79)\) s, respectively.

Table 2. Details of the exhibited micro-movements in the food intake cycle dataset

4 Experiments and Results

Given the true start and end moments of the i-th food intake cycle, \(t^{i}_{s}\) and \(t^{i}_{e}\) respectively, as well as \(t^{j}_{d}\) the moment of the j-th detected intake cycle in the same meal session, performance metrics were calculated by the following evaluation scheme. If for a given true intake cycle i, \(t^{j}_{d}\) is outside \([t^{i}_{s}, t^{i}_{e}]\) for any detected intake cycle j, then it counted as a false negative. Otherwise it counted as a true positive. However, every other occurrence of detected intake cycle in the same \([t^{i}_{s},t^{i}_{e}]\) interval counted as a false positive. Finally, if a detected intake cycle didn’t belong in \([t^{i}_{s},t^{i}_{e}]\) for any i, then it also counted as a false positive.

We used Leave One Subject Out (LOSO) cross validation for both training steps of the pipeline. As a result, for the evaluation of a single subject in the corpus, we trained ten SVM arrays, and one LSTM network. Since the LSTM is trained in a stochastic fashion, we repeated the LSTM training process for ten times, resulting in a total of 100 SVM arrays and 100 LSTM networks for the entire corpus. Experimentation with a small subset of the corpus led us to the selection of the C and \(\gamma \) parameters of the SVM to be equal to 100 and 0.1 respectively. Similarly, the threshold parameter \(T_{d}\) was set to 0.89 by picking the value that achieved the highest F1 score.

We used precision and recall for evaluation. The approaches of [2, 4] were also implemented and evaluated against the same dataset. Parameter selection for those approaches was performed according to the authors’ suggestions. Figure 3 depicts the precision-recall curves for all approaches, while Table 3 provides numerical results for the top F1 score. The decimals in the TP and FN columns arise from the averaging over the ten LSTM training repetitions.

Fig. 3.
figure 3

Precision recall curves for the proposed approach (blue dash-dot line), the approach by [4] (red dash line) and by [2] (black dotted line). (Color figure online)

Table 3. Evaluation results.

5 Conclusions

We presented a method for detecting food intake cycles during a meal, using an off-the-shelf smartwatch. Results on a 10-subject publicly available corpus indicate that the combination of multiple micro-movement SVMs and an LSTM network for score sequence classification is highly effective and outperforms similar approaches found in the literature.