1 Introduction

According to the World Health Organization (WHO), in 2019, over 466 million people (5% of the world’s population) had disabling hearing loss [1]. Correspondingly, the 2011 National census reported that about 135,000 Iranians had some kind of hearing or speech difficulties [2]. This sizeable community uses sign language (SL) as one of its most comprehensive ways of communicating [3]. SL is a language utilizing hand gestures and non-manual signs (such as facial expressions) to transfer messages.

A rich and diverse research literature on automatic sign language recognition is to be expected considering the Deaf community's population, cultural diversity, and needs. One of the first works in this field was conducted by Kim et al. in 1996 [4]. They used a pair of Data-Gloves and a fuzzy min–max neural network for the online classification of Korean Sign Language. Liang and Ouhyoung, 1998 [5] concentrated on making sign recognition real-time, continuous, and functional for large vocabularies. This research also utilized Data-Gloves for data acquisition and Hidden Markov Models (HMM) as a recognition algorithm. The authors used features like posture, orientations, and movements to recognize Taiwanese Signs. They reached an 80.4% average recognition rate. The main concern of Vogler and Metaxas' research in 1999 [6] and 2001 [7] was developing scalable solutions for American Sign Language automatic recognition. They referred to Liddell's research on SL phonology and used parallel HMMs to detect each sign. They applied multi-camera processing systems to extract arm movements. Also, in 2001, Kim and Chien [8] used Data-Gloves and HMMs for hand gesture recognition. They decompose gestures into several "strokes" such as right to left or clockwise movements and used them as phonemes to recognize defined gestures. They reported a 96.88% recognition rate. The major contribution of Yang et al.’s research in 2009 [9] was designing a threshold model in a Conditional Random Fields model to increase its recognition rate. They used image processing techniques for data acquisition. In later research in 2009, Cho et al. [10] used semi Markov models for variable-length signs. In a 2016 study, Paudyal et al. [11] proposed a wearable platform for sign recognition called SCEPTRE. SCEPTRE used two Myo armbands [12] to collect electromyography, gyroscope, and accelerometer data and then utilized Dynamic Time Warping (DTW) algorithms to classify the performed signs. In recent years some researchers used deep neural networks for sign recognition, such as Cui et al. in 2019 [13]. As can be seen, there is no one widely accepted approach in the field, and existing studies have used different algorithms and data collection methods. Also, one of the greatest problems in this field is the extensibility of these algorithms. Most of these works have been tested on only limited sets of signs.

Some researchers have focused on designing SL-based or Gesture-based Human–Robot Interaction (HRI). The most significant researches addressing this issue are as follows: In 2010, Nandy et al. developed an Indian Sign Language based HRI for HOAP-2, an advanced humanoid robotic platform [14]. The system employed several image-processing techniques to extract features from videos of the user's hands. Another Indian Sign Language based HRI was developed in 2017 by Baranwal et al. [15]. This research employed NAO robots as its target platform and used multiple algorithms for sign recognition. After detecting the signed commands, the researchers used the NAO's API and a MATLAB program to engage the user and robot in a conversation with predefined sentences. In 2015, Russo et al. developed a novel telecommunication system for deaf-blind users [16]. The system consists of a robotic hand on one end and a Kinect sensor on the other end. The robotic hand imitates the hand gestures perceived via the Kinect sensor allowing the deaf-blind user to understand the sign by touching it.

Gesture-based HRIs for service robots have a richer research background. One of the first mentions of this subject was in Waldherr et al.'s paper on developing an HRI for a service robot, AMELIA, in 2000 [17]. After implementing their system, the robot was able to detect and obey some gestures common to service robots (Stop, Follow, etc.). Xiao et al.'s notable research in 2014 [18] used a combination of a CyberGlove and Kinect Sensor as a data acquisition setup and various KNN classification algorithms (including Large Margin Nearest Neighbor) to facilitate an upper-body gesture-based interaction between a human user and a humanoid robot, Nadine. The research covered many kinds of interaction, including shaking hands and reacting to the user's actions such as drinking, reading, etc. The extensibility problem of SL detection algorithms is also a common problem in this field, as these proposed HRIs were all created and tested on limited sets of data.

To solve the extensibility problem presented in SL-based HRIs, we believe that such an HRI should also contain the following properties: (1) The recognition module structure must change to a more extensible structure with algorithms capable of learning new signs, and (2) An LfD architecture must be added as a routine to enable the robot to learn new signs from demonstrated signs. To this end, in this paper, we are presenting a solution to improve the human–robot interaction and enhancing the Deaf users’ experience by design and implementation of a meaningful gestures learning architecture for robots. A combination of Learning from Demonstration (LfD) and one-shot learning techniques in the architecture's design would enable an SL-based HRI to extend its SL vocabulary (in both recognition and regeneration). This technique would (hopefully) help to develop an appropriate architecture using machine learning algorithms for social Human–Robot Interaction. Unlike the typical situation for applying deep learning algorithms with a large number of training data, the LfD-based algorithm used in this study appropriately works with a limited number of input data.

LfD has shown promising results in aiding the robotic learning process in different areas (such as teleoperation [19,20,21], rehabilitation [22, 23], robot surgery [24,25,26], industrial assembly [27], navigation [28,29,30]) and it has shown suitable effects on the quality of the performance of a social robots' tasks [31, 32]. The LfD design challenge is how to generalize and learn new policies from a small amount of new data. Some good examples of LfD applications can be seen in Calinon et al. researches [33,34,35,36], where they try different approaches to teach various robots to recognize and regenerate different kinds of gestures. Another example can be found in [35], where Calinon and Billard use HMMs to teach a robot to recognize and reproduce the English alphabet written by hand movements. Like SL recognition, there is not a commonly accepted algorithm or method in the field of LfD, and completely different algorithms (from probabilistic methods to neural networks) have been used in similar problems. However, in recent years, meta-learning algorithms (like one-shot learning) have become more popular [37,38,39]. The research done by Finn and colleagues [37] is one of the first examples of using meta-learning techniques in LFD to teach a robot manipulator’s various tasks (including reaching, pushing, and placing) in different conditions. Ref. [39] is another more recent example where a robot has been taught to classify objects by seeing only a few examples.

In the following sections, we first become familiar with the robotic platform used in the paper. Then, we design the LfD architecture based on the robot's characteristics and implement it using neural networks. Lastly, we discuss the results and point out some limitations of the study and suggest some points for future work in line with the current study.

2 Methodology

2.1 RASA Robot

Utilizing social robots for children with special needs is increasing in the last decade [40,41,42,43,44]. RASA is a novel social robotic platform whose purpose is to facilitate teaching Iranian Sign Language (ISL) to deaf and hard of hearing Iranian children (Fig. 1). The robot features a cartoon-like face and an attractive exterior. Its interaction modules were designed based on child-robot interaction's requirements. With its 32 Degrees of Freedom (29 DoF in the upper body), its active fingers, and expressive face, RASA can perform comprehensible ISL signs [44, 45]. At the beginning of the current study, RASA had an ISL sign library of about 60 signs.

Fig. 1
figure 1

The RASA Robot

A customized Cognitive Architecture (CA) has also been developed for RASA [46]. The CA has a highly specialized and modular structure. It has four main modules:

  • Perception Unit: accountable for receiving and processing data from the environment.

  • Logic Unit: plans for and decides the desired outputs based on the perceived data.

  • Action Unit: executes the desired outputs planned by the Logic unit.

  • Memory Unit: simultaneously acts as a central junction for data transferring between other units and as a store for structured object-oriented learned data.

In this way, the CA can interact reciprocally with the user through the Perception Unit (as input) and the Action Unit (as output). An overview of the designed CA is shown in Fig. 2 [46]. We used this architecture as a framework for programming the HRI.

Fig. 2
figure 2

The general overview of RASA's cognitive architecture [46]

2.2 Overview of the LfD Architecture

As is mentioned, RASA is a teaching assistant robot used to help teach ISL to deaf children. It can be employed in a variety of scenarios and interact in many different ways with children or teachers. In this study, we do not restrict the possible HRI experiences, but we assume that in all cases, the robot needs to understand the signs performed for him and answer back in correct distinguishable signs. Our aim is to design an LfD architecture to allow the robot to extend its limited vocabulary.

To teach a new sign to RASA, the teacher needs to wear the data glove and perform the sign (at least once) to pass the corresponding word to the robot. A pre-trained neural network converts the sign to an embedded vector. New signs will be recognized by comparing them to these embedded vectors. This is the learning process for the sign recognition portion. The sign imitation is done by mapping the performed sign to the robot's kinematics in a way similar to [47] (see Fig. 3).

Fig. 3
figure 3

An overview of the proposed LfD architecture in this study. A user gives a sign's name to the robot and performs that sign using a Data Glove. The architecture converts the performed sign to a state image and feeds it to its recognition and regeneration modules. The recognition module feeds the incoming state image to a pre-trained convolutional neural network and stores its vector output alongside the given name in the robot's Long Term Memory (see Fig. 2). The regeneration module maps the given state image to the robot's kinematics and stores the output trajectory in the memory. This is the learning process. Now the robot can retrieve the trajectory or the embedded vector for recognizing or regenerating that sign in the future

We use one-shot learning techniques in pre-training the neural network. The network's output is an embedded vector and not categorical in order to enable it to perform in new categories outside of the training output. In the training process, we train this network on a diverse dataset to expose it to many kinds of features and force it (with the aid of a loss function) to map the input data in clusters with sufficient margins. In this way, the network can map new signs in new vectors, and hopefully, they will form new clusters. Then, the classification can be done by other criteria such as the Nearest Neighbor.

2.3 Implementation of the LfD Architecture

2.3.1 Data Collection

We chose a Neuron Lite glove from Noitom Ltd. for this study [48]. It is a single-arm sensor glove using six sensors, each with a 3-axis accelerometer, 3-axis gyroscope sensor, and 3-axis magnetometer. The sensors are located on the middle of the arm, on the wrist, on the back of the hand, and on the tip of the thumb, index, and middle fingers of the right hand. So it can detect the movements of the right hand including its three fingers. The outputs of the glove are in the form of joint positions and rotations. The glove connects to a proprietary software called "Axis Neuron" [49]. Axis Neuron can save motions (in various file formats) for offline use. In addition, it can use various communication protocols to broadcast motion data (in BVH format) in real-time.

In the next step, we chose 16 ISL signs. All of the selected signs were distinguishable from features that could be extracted by the 3-finger data glove. We also preferred to have different categories of signs: static signs, dynamic signs with simple motions, and signs with periodic motions. The selected signs are comprised of several color signs and common signs, including yes/no and greetings (Fig. 4).

Fig. 4
figure 4

Iranian Signs selected for the HRI [3]

Due to COVID-19 limitations, we have gathered a narrow dataset. After teaching 12 hearing adults (unfamiliar with sign language) the signs, we asked them to perform selected signs. Each sign was performed 3 times in different order. We recorded these performances with the Axis Neuron and saved them to a dataset as ".BVH" files. After removing bad demonstrations, each sign has more than 30 performances and thus the final dataset had approximately 500 signs in 16 categories.

2.3.2 State Images

The recorded BVH files contain full data for every joint in the participants' body for the whole performance duration. As the glove only gathers data from the right arm and the performers paused many times between the signs, a good deal of the files' contents are unnecessary. Therefore, a necessary step is trimming, segmenting files, and feature selection.

For feature selection, we refer to Stokoe's theory. In Stokoe's theory [50, 51], every (manual) sign consists of four basic elements: Hand Shape, Palm Orientation, Hand Location, and Movements (changes in the first three elements). Therefore, unlike a spoken language, the phonemes of signs in SL are occurring simultaneously. Since there are not enough studies on ISL structure, the current study assumes that Stokoe's theories are extensible to ISL. Amongst all the possible features of these elements, we chose:

  • Hand location (a height-normalized 3d vector from the shoulder toward the hand in the Cartesian coordination).

  • Palm orientation (as a quaternion).

  • Handshape (normalized angles representing the fingers' flexion).

In the next step, we have scaled and resampled all demonstration times so we could make a 12 × 50 matrix representing each demonstration. These image-like matrixes are called State Images [52, 53]. In Fig. 5, readers can see their structure and a sample State Image of the sign “Orange”. These images are the inputs of our models.

Fig. 5
figure 5

State Image. a The structure of the State Images. b State Image sample of the sign “Orange” [54]. In the "Orange" sign the handshape and palm direction are fixed so the last 9 rows of the image do not show significant changes in their values over time. The thumb and index are closed so the 7th and 8th rows (from top) are slightly darker than the rows beneath them. The hand moves in a circle in the x–y plane so the sinusoidal movement in the 1st and 2nd rows are clear, but there is no change in the 3rd row (associated with the z-axis)

2.3.3 Network Architecture

To classify the performed signs, we used a simple Convolutional Neural Network to map each State Image to a 512-dimensional vector. The proposed structure is shown in Fig. 6.

Fig. 6
figure 6

The chosen CNN Architecture [55]

The chosen structure was kept as simple as possible to allow us to train it with our limited dataset of ~ 500 signs in the meta-learning phase. As our purpose is to have an expansible vocabulary, the network also maps the state images to an embedding vector instead of returning a categorical result or one-hot.

The proposed architecture has 2,462,912 trainable parameters, which is very large for our limited dataset. We chose the Siamese networks [56] with cosine triplet loss [57] to train the CNN with the smaller amount of data from this dataset. In this architecture, we have three parallel CNNs with shared weights. We feed three images simultaneously to them and compute the triplet loss of the vectors produced by them. The first image is called the anchor image. The second image (also known as the positive images) must be an image with the same label, and the third image (also known as the negative image) must belong to a different class. Using this method, we have more than eight million data entries.

The loss function is as follows (Eq. 1):

$$loss\left(a,p,n\right)=\mathrm{max}\left(0,d\left(a,p\right)-d\left(a,n\right)+\alpha \right)$$
(1)

where \(d(x,y)\) is the distance of x and y (predicted vectors of images X and Y, respectively) and is calculated as (Eq. 2):

$$d\left(x,y\right)=1-\widehat{x}.\widehat{y}, \widehat{x}=\frac{x}{\left|x\right|}$$
(2)

This loss function tries to maximize the distance between anchor images and negative images while keeping anchors and positives close enough together. α is a margin parameter to eliminate the chance of converging to a trivial answer (mapping all images to zero vectors). Following the results in [58], we chose α = 0.2.

2.3.4 Training

The designed architecture was implemented using the Keras library [59]. We used stochastic gradient descent (without momentum and with a learning rate of 0.001) as the optimizer. Google Colab platform [60] is used to train the model (batch size = 128). Moreover, we used the following methods to improve the performance and prevent overfitting issues:

  • Reducing the learning rate by 80 percent after 4 steps of no improvements on validation loss;

  • Implementing early stopping with the patience of 10 steps on validation loss; and

  • Feeding negative images based on their current distance to the anchor image (nearest to the anchor image).

To check the extensibility of the model, we trained the model 16 times, and each time we exclude one class from the training process to be used as test data (~ 6% of all of the data). We also separated ~ 20% of the remaining data (equally distributed in 15 remaining classes) as the validation data.

2.3.5 Evaluation

We check the model's accuracy using n-way accuracy (n = 4, 8, 12, and 16). In this method, we repeat a testing procedure 1000 times. In each round, we randomly select an anchor image within the target data and then randomly draw n other images as benchmark images. These benchmark images consist of one positive image (an image with the same label as the anchor image) and n-1 negative images (images with different labels than the anchor image). We calculate the similarity between the anchor image and benchmark images in terms of \(d(x,y)\). We consider this round a "True" case if the nearest benchmark image to the anchor image is a positive image; otherwise, it is considered a "False" case. After completing the procedure, the accuracy was computed by dividing the number of "True" cases by the number of rounds (i.e., 1000). To better understand the model's extensibility, we perform this test within two groups: first, the validation data within the classes present in the training data, and second, the test data (from the excluded class). The test is performed four times for each group in all 16 training sessions.

3 Results and Discussions

The main result of the implementation of the proposed structure is an LfD plugin that enables the robot to learn new signs just by watching it once. We can see the performance of the one-shot sign recognition module in Table 1. The full results are presented in Table 2 in Appendix.

Table 1 Summary of the recognition module's accuracy (percentage) including the mean and the standard deviations of all the trials (For more details, see Table 2 in the Appendix)

It can be seen that as a general trend, the accuracy in both trial modes decreases when the n in the n-way procedure increases, which is expected but may raise concerns about the performance after the expansion of the signs library. There was no significant difference (p > 0.05 in all cases) between the mean accuracies in the test and training trials, which is quite appropriate. Although in terms of the Standard Deviations (SD), we observed some meaningful differences. In general, the SD of the training data trials is less, which shows the robustness of the architecture in the absence of some data in the meta-learning phase. Trials of the test data show more SD, which is expected because the module is testing against signs it has not seen before. Hence, some signs are less likely to be recognized by the module (such as Pink and Purple), and some (such as Black and Orange) are more recognizable by the module. We assume this is because of some unique features in the more recognizable signs (i.e. unique movements of hands in Black and unique hand shape in Orange) and the lack of similarity of the less recognizable signs to the training signs.

While using one-shot learning strategies in LfD is not unheard of [61, 62], meaningful gesture learning research is so uncommon that we have yet to come across any. Therefore, due to the lack of similar papers in this field, it is somewhat impossible for us to do a thorough comparison between the findings of this study and other related works. Nevertheless, the expansibility and promising results of the one-shot Learning from Demonstration technique are in line with other researches [61, 62] which are the main achievements of conducting such machine learning algorithms in social Human–Robot Interaction. In terms of accuracy, the implemented system is weaker than many of state of the art researches [63,64,65,66]; however, due to the small size of the dataset and other limitations, this is justifiable and promising. It should be noted that the same level of accuracy is reported in recent papers in the literature (such as [13]).

4 Limitations and Future Work

This study had serious limitations (partly because of the COVID-19 pandemic) in the dataset size (less than 600 demonstrations), diversity (just 16 signs), and gathering process (lack of deaf community presence and limited data glove). Despite these limitations, the results show promise that with a richer and more diverse base dataset, we can reach an accurate and extensible LfD architecture. Also, we wish we could have conducted in vivo HRI experiments to investigate the users' opinions about this framework and their experience while using it. But unfortunately, due to COVID-19 restrictions, it was not possible for us to conduct experiments.

The ultimate goal of the current research line is to enable RASA to act as a teaching assistant robot in SL teaching environments. The present study followed this line; therefore, we suggest future works along the subsequent line in order to make RASA faultless in learning and teaching new signs and interacting with people using its knowledge in SL. Keeping these main goals in mind, the recommended directions in extending the current study lies in the following routes:

  • Removing the limitations: focusing on making a richer, more diverse dataset with more signs and more demonstrations, more demonstrators familiar with sign language, and better equipment to record more complex signs; conducting HRI based trials to measure the acceptability, ease of use, and desirability of the results.

  • Enhancing the recognition module: trying to make the module more accurate or more extensible by using better data or trying other algorithms and techniques (such as data augmentation).

  • Enhancing the imitation module: implementing methods with more natural results than just mimicking the movements (such as GMM/GMR approaches in [33,34,35,36], Variational Auto-Encoders (VAE) or Generative Adversarial Networks (GAN)).

5 Conclusion

Using one-shot learning strategies and Convolutional Neural Networks, we introduced and implemented an LfD-based system to teach new ISL signs to a teacher assistant robot in this research. The proposed architecture can learn to recognize and imitate a sign after just one demonstration using a data glove. The reorganization module reached a 4-way accuracy of 70% on the test data, and while this is not a very high accuracy level, it was still promising if we consider the small size and low diversity of the data set used in this study. The module showed good potential to make the HRIs more extensible. The main point is that the presented results have been obtained from a small number of training data in contrast to the typical deep learning algorithms which need big datasets. The expansibility and promising results of the one-shot Learning from Demonstration technique are the main achievements of conducting such machine learning algorithms in social Human–Robot Interaction.