1 Introduction

With growing number of available wearables and smart devices, the presence of multiple devices on a person is becoming more common. While some devices are designed to supplement each other, e.g. smartwatches and smartphones, most are intended to work standalone. When appliances consist of combinations of different devices, they can leverage single devices’ benefits while mitigating possible negative aspects [26]. Chen et al. for example investigate the design space of a smartwatch and smartphone combination to better distinguish gestures and differentiate feedback [4]. The drawback is an increased complexity of the overall system, especially of the UI being distributed across devices. New problems may also surface, e.g. lack of knowledge about the user’s intention to interact with a specific device or image overlap when combining see-through displays with normal ones. Potentials and drawbacks are illustrated in the use case of combining a smartphone and a head-mounted display (HMD), e.g. smart glasses. Smartphones are suited for more complex information and offer versatile interaction capabilities. Common UIs require focusing visual attention during interaction, though. Smart glasses provide visual output while allowing to observe the environment. The comparatively low display resolution of current smart glasses is less suited for the presentation of complex and extensive data. Also input capabilities are often limited. They complement each other, each offering a workaround to the other’s drawbacks. But consider looking through the smart glasses at the smartphone: The displayed data would overlap, possibly leading to loss or misinterpretation of information, which can affect the user’s efficiency or cause dangerous situations. This is especially true when regarding professional use, since working environments may require most of the user’s attention, leaving only short periods for interaction with a mobile appliance [13, 22]. The user may also be engaged in physical activities, e.g. walking, while interacting [2, 17]. By temporarily disabling information output on the smart glasses, the problem can be avoided. Designing such adaptive UIs requires an understanding of the user’s intention to interact in a particular situation. Especially knowledge about the user’s attention is needed to adapt to the way the system is interacted with. Human attention in general is complex and is considered a limited resource. In a typical use case, various issues may need attention, the mobile appliance being only one of them. Consequently, users are required to switch attention between appliance and environment. Although several factors influence the cognitive process of attention, human gaze provides hints regarding its actual focus [6, 31]. The user’s visual focus of attention (VFOA) can be derived from the eye fixation [8]. To avoid displaying overlapping data, the device holding the user’s VFOA should be given preference for presenting visual information. Previous research has brought up different methods to estimate the VFOA in stationary and mobile settings. These are discussed in the following section along with other relevant work. Based on this, an efficient yet simple approach for VFOA estimation for combined mobile devices is presented. It uses the orientation of smart devices to recognize postures associated with a specific visual focus. Head-mounted devices can provide the most valuable information regarding the visual focus of the user. Therefore, we focus on a set-up using such a head-mounted device in a first step. The required parametrization for the approach was obtained in an experiment with 12 participants and evaluated with further 8 participants.

2 Related Work

Recognizing the intention to interact with a device can be based on several aspects. Schwarz et al. e.g. consider body and head posture as well as gestures to determine intention to interact with a wall screen [27]. Interaction with visual displays requires the display to be visible at some point, making the detection of a user looking at the device a primary concern. Information about location and content of what a person is looking at is called the visual focus of attention [19, 30, 32]. It is indicated by the gaze direction, including the head pose. It can be independent of the overall focus of attention, e.g. when the person is in deep thought. When those cannot be measured, pattern recognition on a device’s motion sensor data can suffice [1], the lack of head pose information makes this somewhat less reliable though. Similar approaches are used in recent smartwatch designs. E.g., the Galaxy Gear smartwatchFootnote 1 and the Apple WatchFootnote 2 determine the user’s VFOA through pattern recognition on the watches’ Inertial Measurement Units (IMU) data to only activate the display when needed. This requires the user to perform a specific movement or to hold a posture for a short time. Sometimes it is possible to measure the gaze direction, e.g. using eye-tracking glasses, eye-tracking being the quantitative recording of eye movements. Most methods are based on image recognition techniques [10]. Eye tracking provides high-quality data, but application to mobile environments is limited. Common systems require a firm fixation, a calibration and in most cases special training [10]. Nonetheless, gaze analysis in mobile applications can be precise, if used in prepared environments. Tracking can be divided into inside-out and outside-in approaches [24]. Inside-out approaches use cameras worn by the user to detect markers in the environment. Outside-in approaches use cameras in the environment to track markers on the user. Both require markers or even cameras to be placed in- or alongside the tracking volume. Also, to compute 3D gaze points, a correct model of the environment is required [9]. Despite the development of automatically calibrating and less cumbersome eye tracking systems, these requirements alongside high costs are some of the reasons the technique is mainly used in scientific context [21]. Detailed visual attention is currently rarely considered in mobile applications. The simpler approach of face detection with the smartphone camera is found e.g. in screen unlocking mechanisms [29], games or screen dimming mechanisms, as in recent Samsung smartphonesFootnote 3. Image based methods suffer some drawbacks. They can be sensitive to lighting conditions and thus perform poorly. In the field of Human Activity Recognition, visual methods are therefore mainly used in constrained environments. For dynamic use, body worn inertial sensors are preferred [3]. An alternative is to leverage the orientation of the head which yields a reasonable approximation [16, 30, 32, 34]. Farenza et al. [7] use this to estimate the subjective view frustum, which bounds the portion of the scene a subject is looking at. Possible poses are limited by human motion constraints. This facilitates processing sensor data, e.g. for gait analysis. It is especially helpful for determining relations of poses of different body parts. Seel et al. use knowledge about kinematic constraints in joint movements to determine joint angles and positions [28]. To accomplish this, all involved limbs were fitted with sensors. Using spatial relations between devices is not new (see e.g. [14]). However, the emergence of smart devices expanded possible inter-device interactions considerably. Chen et al. for example study the design space of a smartphone-smartwatch set-up and use Decision Tree learning on accelerometer data [4]. They also give a more general overview of interaction techniques for hand-held/wrist-worn devices and multi-device systems. Our approach also targets multi-device set-ups, albeit focusing on at least one head-mounted device, e.g. smart glasses. It relies on inertial data of two or more devices to estimate the VFOA, specifically the derived orientation. We aim at scenarios where head-mounted devices are already in use or the added weight is an acceptable trade-off for higher precision than found e.g. in the mentioned smartwatch designs. For ease of use, we refrained from adding sensors and used only those already built-in. The data is interpreted considering knowledge about anatomy and device mounting.

3 Approach

The basic idea is to estimate the current VFOA by tracking posture. We demonstrate how posture is used as an indicator for the VFOA in the given context of body worn devices. Subsequently, practical issues regarding collection and interpretation of the required data are discussed. Finally, the algorithm used for the actual recognition is presented. Using VFOA information, it is possible to distinguish between focusing on a mobile device and focusing on the environment. Considering a system consisting of smartphone and smart glasses, the devices and the environment provide possible VFOA targets. As smart glasses are worn to observe the environment and digital information simultaneously, smart glasses and the environment can be treated as a single VFOA target in this case. The problem of VFOA estimation is thus simplified to the distinction whether the smartphone is being looked at or not. Visual perception is mainly driven by the location of the sensory organs as well as size and position of the object in focus [18]. Depending on individual anatomic prerequisites and the target’s spatial properties, various postural changes may be necessary to perceive a particular object visually. It may be necessary to rotate the head. If the object to focus on is worn on the own body, one may also need to move respective body parts into view. For a forearm-mounted smartphone, the user has to raise and rotate his forearm. Figure 1 illustrates user postures when wearing smartphone and smart glasses.

Fig. 1.
figure 1

Typical postures (left to right): (a) smartphone focused; (b) environment focus, arm lowered; (c) environment focus, arm raised; (d) environment focus, like a, but arm lowered

A posture where the user’s VFOA is likely to be located at the smartphone is depicted in Fig. 1a. The smartphone is positioned in front of the body, screen rotated towards the head. Additionally the user has to rotate and lower his head. As anatomic characteristics restrict the range of motion, it is sufficient to consider relative orientation between head and forearm. If the rotation of either head or forearm differs significantly (Fig. 1b–d), the smartphone is not visible. The VFOA is then likely to be located in the environment. As Fig. 1c and d show, considering only one component is not sufficient as the remaining information allows for ambiguous interpretation. If only the forearm was considered, Fig. 1a and c would result in the same estimation, although in Fig. 1c the user is clearly not looking at the smartphone. To estimate the users’ VFOA target, the position of at least two body parts, namely head and forearm, have to be considered. A tolerance has to be taken into account, due to the varying contexts of motion and constraints in motor control [33]. Motion capturing in mobile environments is usually performed using inertial measurement units (IMU) [25]. In comparison to established videogrammetric methods, they provide less accurate tracking, as static setups are required to achieve this precision. IMUs suffer from accumulated errors by design. To compensate for this, rotation is typically computed from a combination of magnetometer, gyroscope and accelerometer data. One widespread method is the use of Kalman-filters [12]. IMUs are incorporated in smartphones and smart glasses. As the smart glasses are worn on the head, their IMU can be assumed to be a suitable approximation of the head’s orientation. Accordingly, the smartphone’s IMU can be interpreted as forearm orientation due to its mount. The quality of IMUs found in smart devices is considerably worse compared to dedicated motion capturing systems. Inaccuracies should therefore be taken into account. Orientation in space is typically represented as a rotation around one or multiple axes. Two common notations are Euler angles and quaternions. The former uses successive rotations about three different axes. The latter is based on Euler’s rotation theorem, stating that any rotation or sequence of rotations about a fixed point can be represented by a single rotation by a given angle about a fixed axis running through the fixed point. Apart from other advantages, quaternions avoid the Gimbal Lock problem inherent in the Euler angle representation, the loss of a degree of freedom when approaching a yaw angle of 90°. For further reading on representing orientation, see [5]. The orientation of a device is defined by its rotation quaternion \( q_{device} = \left( {x,y,z,w} \right)^{T} \). The rotational difference between devices can be described as the transformation quaternion:

$$ q_{{\delta_{1,2} }} = q_{1}^{ - 1} *q_{2} $$

It contains the information regarding the rotation necessary to rotate from the orientation of the first to those of the second device. In other words, it describes the orientation of the devices towards each other. Be \( q_{ref} \) a rotation between two devices, while the VFOA is known to be on a specific device. Let ‘∙’ be the inner product. For the current rotation between two devices, the distance to the reference rotation can then be obtained by the metric:

$$ \varPhi {:{\mathbb{R}}}^{4} \times {\mathbb{R}}^{4} \to \left[ {0,1} \right],\varPhi \left( {q_{1} ,q_{2} } \right) = 1 - \left| {q_{1} \cdot q_{2} } \right| $$

It was originally developed to describe the difference between Euclidean transformations [15]. Its main advantage is a relatively low computational expense [11]. For the case of two devices, a rotation and thereby a posture \( q_{current} \) is recognized if and only if its distance to \( q_{ref} \) is below a threshold \( \varepsilon \in [0, 1] \):

$$ \varPhi \left( {q_{ref} , q_{current} } \right) < \varepsilon $$

Note, that this only requires determining the rotational data of the desired posture and no sophisticated learning. In general this approach can be extended to more than two body parts or respective devices. However, computing the rotational differences naively results in quadratically rising computational effort in the number of devices. For a fixed number of devices, the computational complexity is constant. A common smartphone (Galaxy S3 GT-I 300, Samsung) performs the computation in less than a millisecond. Latency is therefore negligible.

4 Experiment 1: Parametrization User Study

A reference quaternion (\( q_{ref} \)), describing the posture of the user when looking at the smartphone and the allowed deviation (\( \varepsilon \)) have to be determined. The data was gathered in a lab under the mobile usage conditions of walking and standing to get a more dynamic range of motions in comparison to e.g. sitting. Controllability of the environment was of primary concern, although real-world application will likely result in more unrelated movement than lab settings. 12 employees (5 female) of a research institute were recruited as participants. Mean age was 31.25 (SD = 3.34) years. One of four letters was displayed using a wall projection. Each appearance of a letter was announced by audio. The participants had to enter the letter using a software keyboard on the smartphone. Random reordering of the keyboard required looking at the device, so the posture captured at the time of input corresponds to the VFOA being on the smartphone. Each input was followed by random pauses of 5, 6 or 7 s, which allows for completion of a gait cycle and resumption of accompanying arm movement [23]. Two movement speeds were distinguished: Standing (S, 0 km/h) and walking (W, 5 km/h). The sequence of conditions was permuted to avoid learning effects. The assignment of participants to sequences was randomized. All participants completed the task under both conditions on a treadmill, ensuring equal conditions. Both conditions required 60 letters to be processed. The participants were asked to lower the arm after each input. All sessions were tracked by an external motion capturing system (Vicon Bonita, Vicon), to reveal if reduced quality of the built-in IMU’s data causes problems. The used devices were fitted with IR-markers. A head mounted display (Lite-Eye LE-750 A, Lite-Eye Systems) with an attached IMU (InertiaCube3, Intersense) was used as a substitute for smart glasses. A smartphone (Galaxy S3, GT-I9300, Samsung) was mounted on the left forearm. This allowed wearing it next to the wrist like a wrist-watch. The IMU data was tracked at 40 Hz. Additionally, the data was logged on every input as well. Each run was recorded on video. Checking the IMUs against external motion tracking, no significant deviations were found. For each participant and condition a reference quaternion was determined, using the mean transformation quaternion [20] between smart glasses and -phone over the first three inputs. All following inputs were compared to this reference. One data set was excluded, as the experimental task was not followed through. A posture was recognized if for the frame captured during input, the rotational difference was below the threshold. To account for visually verifying the input, a tolerance of two seconds after input was given. Before notification or after input this was counted as false positive (\( fp \)). This includes potentially correct recognitions outside of the task. To get a realistic impression of the impact of false positive detections, coherent sequences were counted only once. This curbs the impact of events such as users looking at the smartphone between the inputs, e.g. to adjust or check the mounting. Only thresholds providing a recognition rate \( r > 80 \% \) and false positives \( fp \le 15 \) were considered. As the exact number of possible negatives is unknown for the experimental task, the false negatives were set into relation to the false positives to determine a threshold providing both a high recognition rate and a low amount of false positives. The resulting quotient is \( Q = fp/r \), \( r \) being the recognition rate. The recognition rates and false positive detections for each condition are shown in Table 1.

Table 1. Recognition rates and false positives for different thresholds

For the condition “standing”, \( Q \) is minimized by \( \varepsilon_{S} = 0.17 \). For “walking”, the best results were found for \( \varepsilon_{W} = 0.25 \). The stricter threshold for “standing” could stem from the experimental task itself being rather monotonous or from the difference to the walking-induced motion sequence. Video analysis shows that participants produced a greater variety of postures when standing, compared to walking. There were roughly three different types of data. Figure 2 gives an idea of this variety. The charts show distances of the transformation quaternions to the reference. The data set in Fig. 2a shows clear peaks: At the time of input (dashed line) the distance between transformation and reference quaternion decreases sharply. Aforementioned effects for “standing” can be seen in Fig. 2b. The data is noisy as the participant swung his arms between the inputs. Some participants adapted to the task as depicted in Fig. 2c. The differences between the posture during input and idling in-between steadily decrease. To find a threshold covering both conditions, considering the mean of the thresholds, 0.21 seems to be a suitable starting point.

Fig. 2.
figure 2

Exemplary data sets from the parametrization for individual reference quaternions and \( \varepsilon = 0.2 \). From top to bottom: (a) normal behavior, (b) noise-inducing arm movement and (c) progressive adaption to detection threshold. The graph depicts the distance to the reference quaternion over time. Dashed lines mark input notifications; circles indicate VFOA detections

However, by reduction to \( \varepsilon_{G} = 0.20 \) a lower error count is achieved while maintaining the recognition rate. The question arises, if the reference quaternion too can be generalized. We computed the mean of all reference quaternions within both conditions \( q_{refG} \). Applying the algorithm to the measured data, using the generalized reference quaternion \( q_{refG} \) and threshold \( \varepsilon_{G} \), resulted in mean recognition rates of 97.18 % (SD 6.27 %) and 4.83 false positives (SD 5.97) while standing and 98.98 % (SD 1.82 %) and 3.80 false positives (SD 5.49) while walking. The results suggest that it is feasible to neglect individual differences to a certain degree, making individual configuration unnecessary. This was evaluated in a separate experiment.

5 Experiment 2: Evaluation

The evaluation’s objective was to investigate a practical use case as well as the obtained parameters. To this end, the task of unlocking a smartphone screen was chosen. Apart from task performance, we wanted to investigate the reliability of the obtained general parameters when used to detect the VFOA of persons who were not involved in the parametrization process. Further 8 employees of a research institute were recruited as participants (4 female). Mean age was 30.75 (SD = 2.99) years. The evaluation was performed under conditions similar the parametrization, only logged data, task, design and participant sample differed. Notably, the VFOA detection was computed in real-time. A lab setting was used again for controllability. Basically, the same task as for the parametrization was used, but the screen had to be unlocked before performing any input. Two conditions were used: Under manual condition (M) users unlocked the screen by pressing a specific button. The button resembled the devices’ standby button in size and location, being in the upper left corner of the display. Under automatic condition (A) the algorithm was used to estimate whether the user’s VFOA was on the smartphone to unlock the screen. In case the algorithm failed, users had to unlock the screen as under manual condition. Movement speed was considered as before: “Standing” (S, 0 km/h) and “walking” (W, 5 km/h). The participants were randomly assigned to the sequences. Again a treadmill was used to ensure equal conditions for the participants. The sequence of the conditions was permuted to avoid learning effects. Each participant completed the task under each condition. 20 letters had to be processed. The apparatus stayed the same. The time span between notification and subsequent input was logged additionally, as well as the interaction with the unlock button under automatic condition. Access times for the different conditions were compared. Data analysis remained the same for recognition rates and false positive detections. Regarding the reliability of the approach for persons not involved in the parametrization, the recognition rates as well as the amounts of false positives were analyzed. For “standing” (S), the screen always unlocked when required, resulting in recognition of 100 %. For “walking” (W), the algorithm failed once, resulting in a mean recognition for all participants of 99.41 %. On average 1.00 (SD = 2.29) false positive detections occurred for “standing” and 1.63 (SD = 2.60) for “walking”. We compared the general performance of both parametrization and evaluation by relating the amounts of false positives to the experiment durations (Fig. 3). As can be seen, the approach yields results comparable to experiment 1, even without individual configuration. Using the VFOA algorithm, the times required to unlock the screen were 1.62 (SD = 0.14) seconds for “standing” (S) and 1.66 (SD = 0.2) seconds for “walking” (W). Manually unlocking the screen took 2.29 (SD = 0.31) seconds for “standing” (S) and 2.38 (SD = 0.48) seconds for “walking” (W). This shows that the automatic screen unlocking is significantly faster compared to manual unlocking.

Fig. 3.
figure 3

False positives per minute for both experiments

6 Summary and Outlook

We motivated the value of knowledge about the visual focus of attention (VFOA) for adaptive systems with multiple mobile devices available to the user. User interfaces seeking to avoid problems caused by multi-device usage should consider the user’s current VFOA. Image based approaches still have problems when used in mobile contexts. An alternative approach, using device orientation and knowledge about anatomical limitations was proposed. The algorithm recognizes postures associated with the VFOA being on the smartphone. Two experiments were performed to parameterize and evaluate the approach. A general parametrization covering the use while standing nad walking was obtained. A subsequent evaluation investigated the practicability of the proposed algorithm using the identified parameters. The results show that the VFOA can be estimated in real-time, even without individually configured parameters. The approach can be scaled up for more devices, so a future goal is to study the potential of systems comprising e.g. smartphone, smartwatch, smart glasses and or tablets. Additional validation is required to drive the approach towards practical use. The results have to be compared to other methods in an experiment with more participants and also covering a broader population and real-world applications. We currently rely on the mobile devices being mounted to the user’s body. An interesting question is, whether good results can also be achieved, when the devices’ positions are only roughly known, as is e.g. the case for hand-held devices.