3D Visual Tracking to Quantify Physical Contact Interactions in Human-to-Human Touch

Xu, Shan; Xu, Chang; McIntyre, Sarah; Olausson, Håkan; Gerling, Gregory J.

doi:10.3389/fphys.2022.841938

ORIGINAL RESEARCH article

Front. Physiol., 09 June 2022

Sec. Physio-logging

Volume 13 - 2022 | https://doi.org/10.3389/fphys.2022.841938

This article is part of the Research Topic Social Touch View all 11 articles

3D Visual Tracking to Quantify Physical Contact Interactions in Human-to-Human Touch

Shan Xu¹

¹School of Engineering and Applied Science, University of Virginia, Charlottesville, VA, United States
²Center for Social and Affective Neuroscience (CSAN), Linköping University, Linköping, Sweden

Across a plethora of social situations, we touch others in natural and intuitive ways to share thoughts and emotions, such as tapping to get one’s attention or caressing to soothe one’s anxiety. A deeper understanding of these human-to-human interactions will require, in part, the precise measurement of skin-to-skin physical contact. Among prior efforts, each measurement approach exhibits certain constraints, e.g., motion trackers do not capture the precise shape of skin surfaces, while pressure sensors impede skin-to-skin contact. In contrast, this work develops an interference-free 3D visual tracking system using a depth camera to measure the contact attributes between the bare hand of a toucher and the forearm of a receiver. The toucher’s hand is tracked as a posed and positioned mesh by fitting a hand model to detected 3D hand joints, whereas a receiver’s forearm is extracted as a 3D surface updated upon repeated skin contact. Based on a contact model involving point clouds, the spatiotemporal changes of hand-to-forearm contact are decomposed as six, high-resolution, time-series contact attributes, i.e., contact area, indentation depth, absolute velocity, and three orthogonal velocity components, together with contact duration. To examine the system’s capabilities and limitations, two types of experiments were performed. First, to evaluate its ability to discern human touches, one person delivered cued social messages, e.g., happiness, anger, sympathy, to another person using their preferred gestures. The results indicated that messages and gestures, as well as the identities of the touchers, were readily discerned from their contact attributes. Second, the system’s spatiotemporal accuracy was validated against measurements from independent devices, including an electromagnetic motion tracker, sensorized pressure mat, and laser displacement sensor. While validated here in the context of social communication, this system is extendable to human touch interactions such as maternal care of infants and massage therapy.

Introduction

Social and emotional communication by touch is important to human development in daily life. It contributes to brain and cognitive development in infancy and childhood (Cascio et al., 2019), and plays a role in providing emotional support (Coan et al., 2006), and forming social bonds (Vallbo et al., 2016). For example, being touched by one’s partner mitigates one’s reactivity to psychological pressure, as observed in decreased blood pressure, heart rate, and cortisol levels (Gallace and Spence, 2010). Behaviors such as compliance, volunteering, and eating habits are also positively improved (Gallace and Spence, 2010). Moreover, several works now indicate that particular social messages and emotional sentiments can be readily recognized from touch alone (Hertenstein et al., 2006; Hertenstein et al., 2009; Thompson and Hampton, 2011; Hauser et al., 2019a; McIntyre et al., 2021). Despite their importance and ubiquity, we have just begun to quantify the exact nuances in the underlying physical contact interactions used to communicate affective touch.

To decompose how physical contact interactions evoke sensory and behavioral responses, most prior studies employ highly controlled stimuli, which vary a single factor at a time. In particular, mechanical and thermal interactions are typically delivered to a person’s skin using robotically driven actuators (Löken et al., 2009; Essick et al., 2010; Ackerley et al., 2014a; Tsalamlal et al., 2014; Bucci et al., 2017; Teyssier et al., 2020; Zheng et al., 2020). For example, brush stimuli swept along an arc have been widely adopted to mimic caress-like stroking, while controlling their velocity, force, surface material, and/or temperature. Using such stimuli, C-tactile afferents are shown to be preferentially activated at stroke velocities around 1–10 cm/s, which align with ratings of pleasantness (Löken et al., 2009; Essick et al., 2010; Ackerley et al., 2014a). Beyond experiments to examine brush stroke, more complex interactions have been delivered via humanoid robots and robot hands (Teyssier et al., 2020; Zheng et al., 2020). However, device-delivered stimuli do not fully express the natural and subtle complexities inherent in human-to-human touch. This can result in disconnect with the everyday, real-world interactions for which our sensory systems are finely tuned.

Measuring and quantifying free and unconstrained human-to-human touch interactions is complex and challenging. In particular, the physical interactions are unscripted, unconstrained, and individualized with rapid and irregular transitions. Indeed, multiple contact attributes often co-vary over time, e.g., lateral velocity, contact area, indentation depth. Therefore, in moving toward quantification, the initial efforts used qualitative, manual annotation to describe touch gestures, and their contact intensity and duration (Hertenstein et al., 2006; Hertenstein et al., 2009; Yohanan and MacLean, 2012; Andreasson et al., 2018). While adaptable to a wide range of touch interactions and settings, qualitative methods are constrained by the time required to analyze the data, the potential subjectivity of human coders, and a courser set of metrics and classification levels. For instance, contact intensity is typically classified in only three levels as light, medium, strong. As a result, automated techniques have been introduced, such as electromagnetic motion trackers (Hauser et al., 2019a; Lo et al., 2021) and sensorized pressure mats (Silvera-Tawil et al., 2014; Jung et al., 2015), with each their own capabilities and limitations. For instance, electromagnetic trackers capture the movement of only a handful of points, thus unable to monitor complex surface geometry, and can emit electromagnetic noise incompatible with sensitive biopotential recording equipment. Pressure sensors and mats inhibit direct skin-to-skin contact, when even thin films are shown to attenuate touch pleasantness (Rezaei et al., 2021). Three-dimensional optical tracking methods have also been employed, such as infrared stereo techniques (Hauser et al., 2019a; Hauser et al., 2019b; McIntyre et al., 2021), motion capture systems (Suresh et al., 2020), and stereo cameras with DeepLabCut (Nath et al., 2019). While these methods are specialized in tracking joint positions of hands and limbs, they do not capture the shape and geometry of body parts, since the infrared cameras lack sufficient accuracy on depth, motion capture systems only track pre-attached markers, and stereo matching of multiple cameras often fail with texture-less surfaces. In contrast, depth cameras can provide high spatial resolution point clouds and allow shape extraction of texture-less body parts, such as a forearm. Depth cameras, as well, are more readily set up without calibration, afford minimum magnetic interference, and can be located at a larger distance from the area of interest. While depth cameras have been used in hand tracking and 3D reconstruction (Rusu and Cousins, 2011; Taylor et al., 2016), they have not been used to measure contact interactions in human-to-human touch.

While defined to a degree, we are still deciphering those physical contact attributes vital to social touch communication. In such settings, human touch interactions tend to include gesture, pressure/depth, velocity, acceleration, location, frequency, area, and duration (Hertenstein, 2002; Hertenstein et al., 2006; Hertenstein et al., 2009; Yohanan and MacLean, 2012; Silvera-Tawil et al., 2014; Jung et al., 2015; Andreasson et al., 2018; Hauser et al., 2019a; Hauser et al., 2019b; Lo et al., 2021; McIntyre et al., 2021). To understand the functional importance of specific movement patterns, certain attributes such as spatial hand velocity have been further decomposed into directions of normal and tangential (Hauser et al., 2019a) or forward-backward and left-right (Lo et al., 2021). Moreover, simultaneous tracking of multiple contact attributes is needed for understanding naturalistic, time-dependent neural output of peripheral afferents. For example, a larger contact area should recruit more afferents, larger force or indentation should generate higher firing frequencies, and optimal velocity in tangential direction should evoke firing of C-tactile afferents (Johnson, 2001; Löken et al., 2009; Hauser et al., 2019b).

Herein, we develop an interference-free 3D visual tracking system to quantify spatiotemporal changes in skin-to-skin contact during human-to-human social touch communication. Human-subjects experiments evaluate its ability to discern unique combinations of contact attributes used to convey distinct social touch messages and gestures, as well as the identities of the touchers. Moreover, the system’s spatiotemporal accuracy is validated against measurements from independent devices, including an electromagnetic motion tracker, sensorized pressure mat, and laser displacement sensor.

Human-to-Human Contact Tracking System

This work introduces a 3D visual tracking system and data processing pipeline, which used a high-resolution depth camera to quantify contact attributes between the bare hand of a toucher and the forearm of a receiver. As illustrated in Figure 1, the tracking system captured the 3D shape and movements of the toucher’s hand and the receiver’s forearm independently but simultaneously within the same camera coordinate system. Physical skin contact was detected between the hand and forearm based on interactions of their 3D point clouds. Seven contact attributes were derived over the time course of touch, which were contact area, indentation depth, contact duration, overall contact velocity, and its three orthogonal velocity components.

FIGURE 1

FIGURE 1. 3D visual tracking setup and data workflow. The toucher’s hand and receiver’s forearm are tracked using one depth camera (Microsoft Azure Kinect). Forearm shape is extracted as a point cloud while the hand mesh is animated by the gestures and movements of the toucher’s hand.

3D Shape and Motion Tracking With Depth Camera

The tracking procedure extracts the detailed 3D shape of the touch receiver’s forearm. By merging the camera’s RGB and depth information, an RGB-D image was derived and then converted into a dense point cloud per frame. The point cloud was cropped and downsampled to balance information and computation costs. To obtain a clean point cloud of the forearm without background, neighboring points around the forearm were first removed. Two removal methods were used alternatively based on the experimental setup (Figure 1). If the receiver’s forearm was placed on a flat surface, such as a table, the points within that flat surface could be removed in a shape-based manner using the plane model segmentation algorithm provided by the Point Cloud Library (PCL) (Rusu and Cousins, 2011). In the second case, if a monochromatic holder was set underneath the forearm, such as a cushion, then the points of that holder could be removed by color-based segmentation in the HSV color space. Next, the 3D region growing segmentation algorithm (Rusu and Cousins, 2011) was applied to separate the rest point cloud into multiple clusters according to the smoothness and distance between points. Since neighboring points around the forearm were removed in advance, points farther away in the background were assigned to separate clusters instead of being blended with the arm. Finally, by setting a relatively large smoothness threshold, all arm points could be grouped into one cluster despite the curvature of the forearm shape.

In human-to-human touch scenarios, the receiver’s forearm is frequently occluded by the toucher’s hand. Given that a blocked arm region is nearly impossible to capture, only the shape of the forearm prior to the contact was extracted. More specifically, the forearm point cloud was extracted before the beginning of each contact interaction to update its shape and position. During the contact, its position was refreshed in real-time according to the 3D position of the color marker on the arm, though its shape was not updated during the contact. Once the forearm was shape updated, the normal vector $n_{a r m}^{i}$ of each arm point $p_{a r m}^{i}$ was calculated and updated as well to facilitate further contact detection and measurement.

The hand tracking procedure was developed to capture the posture and position of the toucher’s hand by combining depth information with a monocular hand motion tracking algorithm (Zhou et al., 2020). The algorithm is robust to occlusions and object interactions, which is advantageous in hand-arm contact. The monocular tracking algorithm contains two neural network modules to predict the 3D location and rotation of all 21 hand joints. In the first module of the hand joint detection network, features extracted from the 2D RGB image were first fed into a 2-layer convolutional neural network (CNN) to detect the probability of the 2D position of all joints. Then, another two 2-layer CNN was used to predict the 3D position of hand joints based on 2D features and 2D joint position estimates. In the second module of the inverse kinematic network, a 7-layer fully connected neural network was designed to derive the 3D rotation of each joint. Finally, the parametric MANO hand model (Romero et al., 2017) was employed to incorporate 3D joint rotations to animate the hand mesh following the shape and pose of the toucher’s hand.

The rendered hand mesh was expressed in the local hand coordinate without the spatial information of the hand position. Therefore, depth information is incorporated here to locate the hand mesh in the camera coordinate, according to the movement of any hand joint or the color marker on the back of the hand (Figure 1). Specifically, the 2D position of the color marker was detected in the in the HSV, while the 2D position of the joint was retrieved from the detected 2D hand. The depth value of the hand joint or marker was derived by transforming the depth image to the RGB coordinate, which was then used to obtain its 3D position following the camera projection model. By identifying the corresponding point of that marker or joint in the hand mesh model, the posed hand mesh was moved in real-time following the toucher’s hand movements.

Definition of Contact Attributes

Hand-arm contact was measured in a point-based manner (Figure 2), which afforded higher resolution compared with a geometry-based method (Hauser et al., 2019a). First, a contact interaction between the hand and forearm was detected when at least one vertex point of the hand mesh was underneath the arm surface. More specifically, for each hand vertex point $p_{h a n d}^{i}$ , its nearest arm point $p_{a r m}^{i}$ was found first. Then, as detailed in Eq. 1, if the angle between the vector $p_{h a n d}^{i} - p_{a r m}^{i}$ and the normal vector $n_{a r m}^{i}$ of arm point $p_{a r m}^{i}$ is larger than or equal to 90°, this hand vertex is marked as underneath the arm surface.

F_{c o n t a c t} = {\begin{matrix} 1 & \forall (p_{h a n d}^{i} - p_{a r m}^{i}) \cdot n_{a r m}^{i} \leq 0 \\ 0 & \exists (p_{h a n d}^{i} - p_{a r m}^{i}) \cdot n_{a r m}^{i} > 0 \end{matrix} (1)

FIGURE 2

FIGURE 2. Definition of contact attributes. (A) Color image from video recorded by depth camera. Two color markers were placed on the toucher’s hand and the receiver’s forearm respectively to support motion tracking. (B) 3D forearm point cloud and hand mesh. Short black line segments represent the norm vector of arm points; red points on the forearm represent the region contacted by the hand. In the arm coordinate, the vertical axis (blue) is designated along the vertical direction pointing right upward, the longitudinal axis (green) is parallel with the arm direction from elbow to wrist, and the lateral direction is perpendicular to the two axes pointing to the internal side of the forearm. (C) Six time-series attributes include absolute velocity, which is the absolute value of spatial contact velocity; three orthogonal velocity components corresponding to the three axes of the arm coordinate; contact area, which is the overall area on the forearm being contact; and the indentation depth as the average depth applied on the forearm by the hand.

Physical contact attributes were calculated when hand-arm contact was detected. Indentation depth is measured as Eq. 2. In particular, $N_{C}$ is the number of hand vertex points contacted with the forearm. For each contacted hand point $p_{h a n d}^{i}$ , its indentation depth $d^{i}$ is approximated as half the distance between $p_{h a n d}^{i}$ and its nearest arm point $p_{a r m}^{i}$ . The half scale was used because the line between two points might not be perpendicular to the arm surface. The overall indentation $d$ deployed by the hand to the forearm is defined as the average indentation depth of all $N_{C}$ contacted hand points:

D e p t h = \frac{\sum_{i = 1}^{N_{C}} {‖ p_{h a n d}^{i} - p_{a r m}^{i} ‖}_{2}}{2 N_{C}} . (2)

Contact area is measured as the summed area of all contacted arm points. As shown in Eq. 3, the unit area $S^{i}$ for one arm point is calculated as a sphere whose radius is the average neighbor distance, and π is round to 3. Within the arm point cloud of $N_{a l l}$ points, the average neighbor distance $l_{n b r}^{i}$ is calculated as the average distance of all points to their nearest neighbor points:

A r e a = 3 N_{C} {(\frac{\sum_{i = 1}^{N_{a l l}} l_{n b r}^{i}}{N_{a l l}})}^{2} . (3)

In addition to cutaneous contact attributes, the velocity of hand movement was quantified when contact was detected. The absolute contact velocity $V_{a b s}$ is measured as the modulus of the spatial hand velocity $v_{H a n d}$ :

V_{a b s} = | \frac{p_{H a n d}^{t} - p_{H a n d}^{t - 1}}{△ t} | . (4)

In Eq. 4, hand position $p_{H a n d}$ is represented by the position of the middle metacarpophalangeal joint. By defining another coordinate on the receiver’s forearm (Figure 2C), spatial hand velocity $v_{H a n d}$ is further decomposed in the arm coordinate as three velocity components $V_{v t}$ , $V_{l g}$ , $V_{l t}$ parallel with its axis of the arm coordinate (Figure 2C). The vertical axis $i_{v t}$ of the arm coordinate is aligned with the vertical direction pointing upright. It could be obtained as the normal vector of a point on a horizontal surface, like a table, or the normal vector of a point on the top of the receiver’s forearm. Vertical velocity $V_{v t}$ is the hand velocity component in this direction:

V_{v t} = v_{H a n d} \cdot i_{v t} . (5)

The longitudinal axis $i_{l g}$ is aligned with the direction of the arm bone, pointing from elbow to wrist. To derive this axis, the camera was orientated to display the forearm vertically in the 2D image. Then, the direction of the arm bone in the 2D image was set to be parallel with the y axis of the image coordinate. By projecting the y axis $y$ of the camera coordinate onto the perpendicular plane of the vertical axis $n_{v t}$ , the longitudinal axis follows the direction of the projected vector:

i_{l g} = \frac{y - (y \cdot i_{v t}) i_{v t}}{{‖ y - (y \cdot i_{v t}) i_{v t} ‖}_{2}} . (6)

V_{l g} = v_{H a n d} \cdot i_{l g} . (7)

Lastly, the lateral axis $i_{l t}$ is perpendicular to the plane of longitudinal and vertical axis, following the right-hand rule:

i_{l t} = i_{l g} \times i_{v t} . (8)

V_{l t} = v_{H a n d} \cdot i_{l t} . (9)

Compared with the overall hand velocity, these velocity components can quantify the directional nature of the hand movements.

Moreover, contact duration is measured as a scalar value for each hand-arm touch interaction, which is the sum of time over which contact was detected. Given the recording frequency $f$ of the camera is 30 Hz and $N_{f}$ is the number of frames per interaction, the contact duration is measured as:

D u r a t i o n = \frac{\sum_{i = 1}^{N_{f}} F_{c o n t a c t}}{f} . (10)

Experiment 1: Human-to-Human Affective Touch Communication

The first experiment was designed with the task of human-to-human emotion communication. Touchers was instructed to deliver cued emotional messages, e.g., happiness, sympathy, anger, to the touch receiver at the receiver’s forearm using preferred gestures, e.g., tapping, holding, stroking. Recorded contact attributes were then used to differentiate delivered messages, utilized gestures, and individual touchers. Contact analysis was conducted on the platform with the Intel Core i9-9900 CPU, 3.1 GHz, 64 GB RAM, and a NVIDIA GeForce RTX 2080 SUPER GPU. The same platform was used for the second experiment.

Cued Emotional Messages and Gesture Stimuli

Seven emotions of anger, attention, calm, fear, gratitude, happiness, and sympathy were selected as cued messages for touchers to express (Table 1). Those messages were adopted from prior studies and have been observed to be recognizable through touch alone (Hertenstein et al., 2006; Hertenstein et al., 2009; Thompson and Hampton, 2011; Hauser et al., 2019a; McIntyre et al., 2021). Among them, gratitude and sympathy are prosocial expressions that are more effectively communicated by touch compared with those self-focused. Anger, happiness, and fear are universal expressions that are commonly communicated by facial, vocal, and touch expressions. Attention and calm are also preferred messages in touch interactions and can be correctly interpreted significantly better than chance. For each of the cued messages, three commonly used gestures were adopted from prior studies (Hertenstein et al., 2006; Thompson and Hampton, 2011; Hauser et al., 2019a; McIntyre et al., 2021) (Table 1). Holding and squeezing were combined into one since they share a similar hand gesture and hand motion. Similarly, hitting was combined with the tapping gesture, but only for the message of anger.

TABLE 1

TABLE 1. Available gestures for each cued emotional message in touch communication task.

Participants

The human-subjects experiments were approved by the Institutional Review Board at the University of Virginia. Ten participants were recruited as touchers, including five males and five females (mean age = 23.8, SD = 5.0). Another five participants were recruited as touch receivers with three males and two females (mean age = 24.0, SD = 4.4). Five experimental groups were randomly assembled, where each group consisted of one male toucher, one female toucher, and one receiver. Each group performed two experimental sessions with one session conducted by the male toucher and another one conducted by the female toucher. Written informed consent was obtained from all participants.

Experimental Setup

To avoid visual distractions during the experiment, touchers and receivers sat at opposing sides of an opaque curtain. They were instructed to not speak to each other. As shown in Figure 2A, a cushion was set on the table at the toucher’s side upon which the receiver rested her or his left forearm. Cued emotional messages and corresponding gestures were displayed to the toucher on the computer screen. The toucher could select the gesture and proceed to the next message using the computer’s mouse. Cued messages and the toucher’s selection of gestures were also recorded. As illustrated by a snapshot of the experiment recoding by depth camera (Figure 2A), the camera was set in front of the cushion and orientated towards it.

Experimental Procedures

In each session, seven cued emotional messages were communicated with each repeated six times. The 42 message instructions were provided in random order. In each trial, one message was displayed on the screen with three gestures listed below. Touchers had 5 s to choose a gesture and report it on the computer display. For each cued message, the three provided gestures were identical but their order was randomized trial by trial. After that, the toucher delivered the message, by touching the receiver’s forearm from elbow to wrist, using the right hand. Within each trial, only the chosen gesture was used. The use of other gestures or a combination of gestures was not allowed. For the same cued message across trials, touchers were free to use the same gesture or change to another gesture. A gesture could be deployed in any pattern of contact deemed appropriate by the toucher. No constraints or instructions were given for delivering the gesture, such as its duration, hand region employed, intensity, or repetition. At the end of a trial, by clicking the “Next” button on the bottom of the computer display, the toucher initiated the next trial with a new message word and corresponding three gestures.

Data Analysis

Overall, 420 trials were performed in ten experimental sessions. Twelve trials were excluded from analysis as contact interactions were not properly recorded. Statistical and machine learning analyses were performed to examine the measured contact attributes.

To identify the contact pattern between touch gestures, paired-sample Mann–Whitney U tests were applied across gestures per contact attribute. For time-series attributes, the mean value was used. Since longitudinal velocity, lateral velocity, and vertical velocity are signed variables, the mean was derived from the absolute value of those variables. Contact duration as a scalar variable was directly compared across gestures. To evaluate which of the contact attributes could best identify or describe a certain type of touch gesture, the importance of each attribute in predicting that gesture was identified using a random forest classifier. The mean values of time-series attributes together with the scalar attribute served as inputs. For example, in predicting the stroking gesture, all trials were labeled in a binary fashion as delivering or not delivering this gesture, instead of being labeled as the four gesture types. Seventy-five percent of trials were randomly assigned as the training set and those remaining were assigned as the test set. The permutation method was used to derive the importance of attributes. The value was obtained as the average of 100 repetitions of classification, with 10 permutations per classification.

Further classification analyses were performed regarding the discrimination of touch gestures, emotional messages, and individual touchers, respectively, using the random forest algorithm. Contact attributes were fed into classifiers in three different formats, including the mean value of each time-series attribute, multiple relevant features extracted from each time-series attribute, and the original time-series attributes. In particular, multiple features were extracted to quantify the amplitude, frequency, and dynamic characteristics of the time-series signal (Christ et al., 2018). For example, time-domain features included mean, maximum, quartiles, standard deviation, trend, skewness, entropy, energy, etc. Frequency domain features included autocorrelations and partial autocorrelations with different lags, coefficients of wavelet and Fourier transformations, mean, variance, skew of Fourier transform spectrum, etc. From all extracted features, relevant ones were selected for classification by significance tests in predicting the classification target and the Benjamini Hochberg multiple test (Christ et al., 2018). When time-series data were used, all attributes were concatenated into one variable as input (Löning et al., 2019). To identify attributes that could better encode social affective touch, the importance of individual attributes was ranked for each classification task. More specifically, based on the mean - value classification, the permutation method was repeated multiple times to derive the average importance values.

Results

Physical Contact Attributes in Human-to-Human Touch

Human-to-human physical contact interactions between social messages, gestures, and individual touchers were quantified by their contact attributes. As shown in Figure 3, exemplar data for the four touch gestures (shake, tap, hold, and stroke) exhibit distinct patterns across the contact attributes, consistent with expected hand movements per gesture. In particular, the stroking gesture was characterized by regular patterns in longitudinal velocity, which implies slow and repetitive movements along the direction of the forearm. For the shaking gesture, velocity attributes depicted large changes in frequency and relatively lower amplitude. Meanwhile, velocities in all three directions changed simultaneously, indicating a spatial direction in the movement of the toucher’s hand. The tapping gesture was quantified as discontinuous, large-amplitude spikes of short contact duration. Compared with other touch gestures, holding gesture exhibited relatively stable contact with minimal changes. With further inspection into each gesture, contact patterns with subtle differences could also be captured across emotional messages. Such as in the shaking gesture, happiness was delivered with higher velocities compared with the expression of fear. Within the tapping gesture, shorter but more intensive contact was recorded when expressing anger compared with attention.

FIGURE 3

FIGURE 3. Time-series recordings of each contact attribute across touch gestures and delivered messages. Distinct contact patterns were captured by the spatiotemporal changes of those attributes. The Contact variable represents the status of the being contacted or not. V_abs denotes the absolute contact velocity (cm/s), V_lg denotes the longitudinal velocity (cm/s), V_lt denotes the lateral velocity (cm/s), V_vt denotes the vertical velocity (cm/s), Area denotes the contact area (cm²), and Depth denotes the indentation depth (mm).

As shown in Figure 4A, the four touch gestures were statistically differentiable according to several of their contact attributes. For instance, absolute contact velocity can differentiate all gesture pairs except for that of stroking and shaking. With the contact attribute of longitudinal velocity, stroking was differentiable from shaking as it afforded higher longitudinal velocity. This also aligns with hand movements during stroking that are typically along the direction of the forearm. Both shaking and tapping gestures exhibited significantly higher longitudinal velocities than the holding gesture. With the lateral velocity, significant differences were derived among all four gestures, where tapping and shaking gestures afforded higher amplitudes than stroking and holding. As for the vertical velocity, the tapping gesture was associated with significantly higher velocities than others, which aligns with its up-down movements. Across all velocity attributes, the holding gesture was significantly distinct from other ones.

FIGURE 4

FIGURE 4. (A) Comparison of contact attributes across the four touch gestures. *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001 were derived by paired-sample Mann–Whitney U tests. (B) Importance of certain contact attributes in identifying each touch gesture using random forest classification. Diamonds denote means; points denote importance values of 100 repetitions of classification.

For the contact area attribute, shaking and holding gestures exhibited significantly higher values than the stroking gesture, and then tapping. Indeed, participants generally used the whole hand to deliver holding and shaking, while only the finger digits for stroking and the fingertips for tapping. Moreover, with indentation depth and contact duration, tapping was distinct amongst the gestures with significantly lower depth and shorter duration. Note the hand motion with the tapping gesture could be faster than the recording frequency of the camera, where one trial of contact might not be entirely captured and thus lead to a lower estimation of indentation depth.

In Figure 4B, the contact attributes that were salient in identifying or describing a specific touch gesture were further analyzed according to their importance in predicting that gesture. From the importance ranking, longitudinal velocity appears to be the most useful attribute in describing the stroking gesture. The shaking gesture did not have a single salient attribute, perhaps because it was delivered from multiple directions and varied velocities. The attributes of contact area, contact duration, and longitudinal velocity were relatively more important. The holding gesture could be identified by longitudinal and absolute velocities with both lower amplitudes. For the tapping gesture, contact duration could be important in identifying it, which should be shorter than other gestures.

Classification Amidst Gestures, Messages, and Individuals

In Figure 5, the contact attributes are shown to robustly classify touch gestures, delivered messages, and individual touchers at accuracies better than chance, which is 25%, 14.3%, and 10% respectively. For gesture prediction, the accuracy was 87% when the mean values of contact attributes were used as predictors (Figure 5A). The prediction accuracy slightly increased to 92% when all relevant features were used as more information was included, and was around 86% when predicted by the time-series data. In classifying delivered emotional messages, the accuracy was 54%, 57%, and 55%, for the three respective feature classes (Figure 5C). Moreover, in classifying the individual touchers, the accuracies were 56%, 72%, and 77%, respectively. For the importance ranking of the contact attributes, those of longitudinal velocity, contact duration, and contact area were typically more important.

FIGURE 5

FIGURE 5. Classification of touch gestures, delivered messages, and toucher individuals using the mean value, all relevant features, and time-series data of contact attributes, respectively. The accuracy in prediction of (A) touch gestures, (C) delivered messages, (E) toucher individual are shown, as well as the importance of particular contact attributes in classifying (B) touch gestures, (D) delivered messages, (F) toucher individual. Numbers and colors in confusion matrices represent the prediction percentage. In the importance plots, the diamonds denote means; points denote importance values from 100 repetitions of classification.

Experiment 2: Technical Validation on the Visual Tracking Method

The second experiment was designed to validate the effectiveness of the 3D visual tracking system in measuring controlled human movements against those from independent devices, including an electromagnetic motion tracker, sensorized pressure mat, and laser displacement sensor. These techniques are used commonly in haptics studies (Silvera-Tawil et al., 2014; Jung et al., 2015; Hauser et al., 2019a; Xu et al., 2020; Lo et al., 2021; Xu et al., 2021a). In this experiment, the observed contact attributes were compared within controlled touch conditions, e.g., stroking in different directions at preset velocities, pressing with different parts of the hand varying in contact area, and tapping at different depth magnitudes.