1 Introduction

As studies in cognition shift towards embodiment, researchers are considering the role that interaction with the environment plays in models of cognitive processes (Anderson 2003). According to research in embodied cognition, many cognitive processes are tightly coupled with the way the body interacts with the environment. The development of intelligence in humans depends as much on their interaction with the world in which they are embodied as it does on their individual brains (Anderson 2003). This trend has been reflected in recent research in implementing cognitive models in artificially intelligent systems. Sandini et al. (2007) implemented a kid-sized humanoid robot with 53 degrees of freedom, which could improve its skills through interacting with its environment.

In this paper we describe a cognitive model for a marionette interacting with humans. Our marionette interacts not with the physical world as in Sandini et al. (2007), but with humans in a social context. The marionette’s interaction with humans takes the form of gesture. The marionette detects human gestures, reasons about them, and responds by performing a sequence from its pre-designed set of gestures. The underlying cognitive model is based on this perception-reasoning-action cycle. The design process included abstracting a participant’s behavior into meaningful gestures as well as designing and implementing marionette actions. To recognize complex and sophisticated human actions as discrete gestures, a gesture elicitation study was performed for both human gesture and marionette gesture classifications.

This paper first introduces the willful marionette as an art project, followed by an overview of related work on gesture design and interaction. The paper then presents the user experience and cognitive system design of the marionette. The cognitive model is comprised of three modules: gesture recognition, marionette gesture selection, and marionette gesture performance. The last section concludes with a summarization and reflection of the effectiveness of the cognitive model in provoking a human response through continued interaction, as well as a discussion of how participants ascribed human-like intelligence to the marionette.

2 The Willful Marionette

The synthesis of a traditional art form, the marionette, with a modality of human-machine gesture interaction, leads to a new kind of creative computation. The marionette evokes a creative dialogue of gestures between humans and machines. A marionette is a string-operated puppet, a traditional art form that exists in many cultures (Chen et al. 2005). Modern marionette and puppetry artists often work with engineers and researchers to explore the possibility of integrating robotics into marionette performance (Yamane et al. 2004). In previous work on developing a robotic marionette, researchers explored the possibility of infusing robotic systems into the traditional art form of marionette theater (Chen et al. 2005), and to evoke and stimulate public interest (Robert et al. 2011; Sidner et al. 2005; Speed et al. 2014).

The design and construction of the willful marionette was a collaboration between artists Lilla LoCurto and Bill Outcault, and the Interaction Design Lab at UNC Charlotte. The artists’ previous work focused on the human body as a three-dimensional form, re-representing it in ways that draw attention to the frailty of human physicality.

The willful marionette began as a 3D scan of Bill Outcault’s body, which was then 3D printed in segments and constructed as a marionette. Little Bill is shown in Fig. 1, and stands about 3 feet tall. To create the marionette, the whole body was segmented into 17 parts. Segments are connected with hinge and socket joints based on the corresponding joints of the human body. Thirteen of the joints are connected to strings to enable movement. The movement of the strings is controlled by motors connected to a frame above Little Bill, which extend and retract the strings and cause joints to move up and down. Two Microsoft Kinects are attached to the frame, and capture the movement of people in the area around Little Bill, allowing the marionette to respond to them. Inside Little Bill’s head a fourteenth motor controls his eye lids.

Fig. 1.
figure 1

The Willful Marionette, aka “Little Bill”, a 3-foot tall naked, blue cognitive agent (Color figure online)

Little Bill provokes an interactive relationship between art object and audience member. Historically, the audience member is a passive viewer, watching a puppeteer perform with a marionette in a theatrical setting. With an interactive marionette, the audience becomes active participants in the performance, and the theatrical performance is replaced with an interactive dialogue. Without the context of a theatrical performance to capture viewers’ attention, the puppet must act as a provocateur to engage the viewer. Once engaged, participants continue to interact with the willful marionette and to evoke the movements and reactions of the puppet. Interaction design in this context is about the interaction between participants and the 3D marionette. Participants differ from traditional conceptions of the “user” in HCI as they are not acting to further some goal but instead participating in a dialogue with an embodied cognitive agent. The marionette also differs from traditional conceptions of an “interface” as it is both socially and physically embodied within the same space as the humans with which it is interacting.

Interaction with the willful marionette is entirely based on the human body: both that of the participant and the machine. The marionette creates a dialogue of gestures that provoke movement and evoke emotional response. This form of dialogue both engages the audience and, as we determined through our user studies, can provoke a strong perception that the marionette possesses a form of intelligence.

3 Background

The goal of the marionette project was to design an interactive system that engaged people in a gestural dialogue. Doing so requires an understanding of the role of gestures in cognition (Baber 2014; Maher et al. 2014; Tversky et al. 2014). Past HCI research has striven to develop common sets of gestures for interaction (Card 2014; Jetter 2014; Karam 2005), and to evaluate the ease and effectiveness of those gestures for performing tasks (Ackad et al. 2014; Vanacken et al. 2014). In order to elicit a set of popular gestures (Seyed et al. 2012), several researchers have explored the design of gestures that people can easily learn or discover (Cartmill et al. 2012; Karam 2005).

Research in computational models of emotion (Marsella and Gratch 2009) and affective computing (Picard and Picard 1997) are also relevant to the marionette project, given the primacy of emotion in the body language that makes up a great deal of gestural interaction between people.

The Viewpoints AI system (Jacob et al. 2013), while virtual rather than physical, comes closest to the marionette project among past work of which the authors are aware. Viewpoints is a Kinect-based projection of a humanoid form that communicates with human dancers through the medium of dance: it will dance a duet with them, similar to the willful marionette’s gestural dialogue. One point of contrast in the dialogues the two systems construct is that Viewpoints begins by mimicking the human dancer to establish synchronicity, while the willful marionette strictly avoids mimicry to establish its autonomy and otherness.

Research into robotic marionettes has primarily focused on their application in the performing arts. Hoffmann (1996) developed a human-scale marionette that could, controlled by a human, enact a dance performance using motions based on human dancers. Hemami and Dinneen (1993) proposed a strategy for stabilizing a marionette through a system of unidirectional muscle-like actuators. The strategy provides positive force to the actuators analogous to the firing rate of natural muscles. Yamane et al. (2004), controlled the upper body of a marionette to perform dances using human motion. These projects show how a marionette performance can be automated, but not autonomous: they do not consider a marionette as an embodied actor interacting with and responding to human participants.

4 Gesture Design and Implementation

The first step in designing gestures for an interactive system is to understand the design space of possible gestures. Various design methods, such as bodystorming, role-playing, personas and image boards, were used in the early stages of the project to explore possible avenues for gestural interaction with the marionette. These methods provided the design team with the opportunity to explore the possibilities of both the hardware and software technologies that could be used in the development phase. In bodystorming and role playing, members of the design team played either a human participant or a marionette role, and acted out gestural dialogues. This enabled the design team to better understand how an embodied gestural interaction with a marionette could proceed.

Based on the initial prototypes and the results of bodystorming, a set of preliminary gestures were selected and implemented for the marionette. Each gesture was defined by specifying each motor’s movement in time. This definition allowed us to write and store gestures for selection by the cognitive system, i.e. when a marionette gesture is selected in response to a perceived human gesture.

After the implementation of the selected gestures for the marionette, a preliminary evaluation study was performed to assess and refine them. The evaluation of this study was done using the Wizard of Oz technique, since the perceptual system that would sense human gestures and map them to marionette gesture responses had not been developed at that time. A human operator decided which marionette gesture to execute based on participants’ gestures. The think aloud method was used to gather further data on how human participants perceived the interaction was progressing, and the reasons behind their gestural selections.

For this preliminary study we recruited twelve students as our participants. Each participant was asked to interact with the marionette spontaneously without specifying a specific task. Participants were asked to think aloud while interacting with the marionette in this study. After each session, an interview was conducted to collect additional insights about participant behavior through gestures. Video recording was used to capture each performed gesture along with notes taken by the design team.

One of the biggest challenges that distinguish Little Bill from other gesture-based interaction systems is that participants in this context are not given instructions or tasks. The most difficult moment is the “cold start”: participants initially have no conception of the scope of the marionette’s ability to perceive or respond to them. Seeing the marionette respond to their presence typically gives participants the confidence to initiate gesturing towards the marionette. Participants were more willing to continue the dialogue after they noticed that they got Little Bill’s “attention” (i.e. eye contact). Based on this feedback we designed the interaction such that Little Bill “makes the first move” and directs its attention to the new participant. To achieve this, we added an “approach” gesture that the marionette perceives when a participant is approaching it. A corresponding “retreat” gesture was added, ensuring that walking away signifies to the system the participant has lost interest in Little Bill.

The full list of participant gestures elicited by the preliminary study was: waving, bending over, approaching, walking away, getting too close, and going behind the marionette. The last two gestures are detected by the angle and distance of the participant. Results of the gesture elicitation study and the interviews revealed that lifting the marionette’s head to make eye contact, turning the marionette’s body to follow the participant, and the marionette raising its hands were the three gestures that inspired the greatest emotions among the testers.

4.1 Gesture Implementation

The marionette gestures were designed to convey emotions of different kinds and were divided into the following five categories:

  • Complex gestures: A subset of the marionette gestures involve large body movements that are intended to convey that the marionette is experiencing a strong emotional response. These gestures are implemented as a quickly executed series of movements across many degrees of freedom. Examples of the marionette’s complex gestures include surprise and scared.

  • Subtle gestures: Other, smaller gestures are intended to encourage continued interaction with the marionette. For instance, simply lifting the head of the marionette gives an impression of eye contact and, in our experiments, engaged participants and made them more likely to continue interacting. Similarly, a series of movements that convey a “quizzical look” from the marionette while participants wave for him can be intriguing and encourages continued interaction.

  • Attentive gestures: These gestures are a direct response to participants’ movement. For example, the marionette turns so that it tracks a participant, or turns its head such that it faces them. As another example, when participants walk away from the marionette, the marionette might shake its head as an attentive gesture to get their attention back.

  • Living gestures: These gestures are designed to convey the impression that the marionette is alive, and involve movement that is not a direct response to perceived participant action. For example, the marionette possesses a motor behind its eyes that can execute a blinking gesture, which is performed at random times. Another example of this type of gesture is a “breathing” gesture, which moves the back of the marionette up and down very slowly such that it looks like it is breathing. These gestures prevent the marionette from being completely still.

  • Restorative gestures: After performing some gestures the marionette might not be in a natural pose, or may have lost track of its exact pose due to technical limitations. To accommodate this lack of information, a restorative gesture was designed to adjust the marionette’s position back to its initial position. One such gesture slightly lifts marionette up off the ground and returns it to its default position.

The cognitive model for the marionette has three components: participant gesture detection, marionette gesture selection and marionette gesture execution. Participant gesture detection uses the Microsoft Kinect and its SDK to detect and send human gestures to the marionette gesture selection program. The selection component selects the most relevant marionette gesture to execute, and sends the related action to the gesture execution component.

The next challenge was how to model the selection of a marionette gesture as a response to a human gesture. We developed a set of guidelines based on observations of human movement, particularly during dialogue:

  • people are always moving;

  • different people respond differently, particularly to a repeated event;

  • people may respond by starting a conversation with another person;

  • people shift their attention to a different object even when there is no obvious event.

In order to have the marionette’s responses not become predictable, each participant gesture is mapped to a set of possible marionette responses from which a single response is stochastically chosen. This one-to-many relationship is used because the goal of the interaction is not to generate an expected response, but to encourage and provoke continued interaction. This is in contrast to the typical interaction design goal of learnable and predictable interaction between user and interface. The perceived autonomy of this simple random behavior is also intended to provoke the human to perceive intelligence in the marionette: human social interaction is not predictable, and systems intending to provoke dialogic interaction should be similarly opaque.

To define a set of appropriate marionette response gestures to participant gestures, the design team envisioned the set of probable emotional states that could cause the participant to perform each gesture. Since the gesture selection process cannot interpret participants’ emotional states, the set of possible response actions was designed to cover all emotional states (such as surprised, shy or shocked) that were deemed probable causes. The cognitive model constructs a probability that participants would be in each emotional state, and selects a random gesture weighted by this probability distribution. This allowed the marionette to respond in a manner that responds to the most likely emotional state of the participant. As an example, a person who is bent over near the marionette is probably displaying curiosity or interest in it, which puts the marionette in the surprised, shocked or shirk state. Table 1 shows the mappings from each human gesture that the marionette can recognize to a list of possible marionette response gestures. Each gesture in Table 1 represents an emotional state and refers to one or more implementation on the marionette.

Table 1. The mapping from human to marionette gestures.

The marionette gesture selection component is also responsible for deciding which participant is the current focus of the marionette’s attention. Participant interestingness is based on continued engagement (measured by amount of body movement) and the order that people approached the marionette. The marionette attends to the participant it perceives to be the most interesting, and rotates to follow their position. From the gesture elicitation study and participant interviews, it was determined that eye contact was the most important feature to participants, and resulted in the highest level of engagement. If a person was behind the marionette and no one was present in front, the marionette rotates to face them, allowing gestural interaction to continue.

The gestures of the marionette were divided into two categories based on its responsiveness. The first category is a set of “regular” gestures that are selected in response to participants’ gestures, and the second one is a set of “idle” movements that are selected when no one interacts with the marionette for a defined amount of time. If no participants are detected the gesture selection component triggers an idle state, during which the marionette performs subtle gestures in an attempt to engage anyone present but not detectable by its perceptual system (due to the limited range or field of view of the Kinect sensors). Idle movements are actions that are short in terms of execution time, and are subtle movements designed to engage people to interact. This ensures the marionette is not still for lengthy periods.

5 Evaluation of Human Response to Gesture Dialogue

Once the gestures for the marionette were selected, refined through the preliminary study, and implemented, we conducted a user study to evaluate the effectiveness of the cognitive model for gesture dialogue. This study included 13 participants (a different set of users than the preliminary user study). Since the marionette is originally an art piece and it is expected to be placed in an art museum, we performed the study in a gallery setting. Participants visited the marionette individually, with nobody but the experiment facilitator present. Figure 2 shows some participants interacting with the marionette.

Fig. 2.
figure 2

Participants interacting with the willful marionette

Participants received a short verbal introduction to the project. This introduction provided an explanation of about the objective of the evaluation. The willful marionette was presented to the participants as shown in Fig. 1. Participants could see that the marionette was controlled electronically, but they were not given any information on how the processes for perception, reasoning and response. Similar to the initial user study, the participants were asked to interact with the marionette while thinking aloud. Participants were explicitly not given a task to perform, as the purpose of our system is exploratory gestural dialogue. Participants were made aware that they were being video recorded. An interview was conducted after each session to collect additional insights from the participants about their experience, expectations, the degree to which they felt that the marionette responded to their gestures, and any feedback or suggestions.

5.1 Evaluation Results

This section presents highlights and notable themes of the responses given during interviews performed in both the preliminary and the final user studies. By far the most common topic in the interviews was the participants’ interest in the marionette. All of the participants found the marionette highly provocative and interesting. The human-like features of the marionette such as blinking and breathing, its responsiveness, and the complexity of its gestures were all given as reasons for participants’ surprise, curiosity and interest. One of the participants said he hesitated in approaching the marionette due to the blinking of its eyes and the human-like nature of its movement. Our data, while small-scale, suggests that participants overwhelmingly associated the embodied form and clear responsiveness of the marionette with human behavior and assumed that it was intelligent far beyond its simple reasoning.

One of the common themes of participant responses concerned the first moments of their dialogue with the marionette. At first, most participants did not approach the marionette directly, keeping their distance and observing the marionette’s idle actions. Several participants noted that at that time they did not believe that the marionette would respond to them, and that its movements were predetermined. However, when they saw the marionette’s response to their approach, they became aware of its interactive nature and began to make various gestures actively. This initial surprise was mentioned as a cause of significant emotion in several users, both positive and negative. Participants discussed that the “back and forth” resulting from this initial exploration of the marionette’s interactivity – their first gesture-based dialogue with an embodied cognitive agent – were unexpectedly engaging.

Another feature of the interaction with the marionette which led participants to comment on the intelligence of the marionette was its lack of mimicry behaviors. Three participants expected that the marionette would mirror their own movements, i.e. if they raised an arm, the marionette would also raise an arm. These participants said the interaction was more interesting than they expected because this behavior was absent – a conscious decision on the behalf of the design team. The fact that the interaction was based on an exploratory gestural dialogue, rather than simple mimicry, helped engage these participants.

Four of the participants tried to talk to the marionette, assuming that if it was capable of human gestural interaction then it would also be capable of hearing and understanding speech. Two of these participants referred to Siri, the digital personal assistant in Apple’s iOS, and said they expected that the marionette could interact in a similar fashion.

Participants were asked to indicate which gestures they found most provocative. Three participants mentioned the blinking of the eyes during eye contact as highly provocative. Rotation of the marionette to track participants (especially when participants moved behind the marionette) was rated as most highly provocative by two other participants. One participant said that arm-related gestures, such as the marionette raising its arm as if to shield its face in response to a participant approaching, was highly provocative and caused a strong emotional response.

6 Discussion

The willful marionette is a contemporary interpretation of the marionette as a form of interactive installation based on an embodied cognitive agent. The system includes a cognitive model that perceives, reasons, and acts in order to engage humans in a gestural dialogue. By replacing the puppeteer and puppet with cognitive agent and an animatronic marionette, the performance becomes more physically and socially engaging: users are participants in an interactive dialogue rather than an audience. The willful marionette’s control systems are deliberately exposed and clearly visible, serving to more greatly unnerve and fascinate users about its nature. The true unknown about the marionette is not how it is physically controlled, but the processes by which it decides to act in response to its environment. Even though its cognitive system is extremely simple, its physical embodiment in a human-like form, and the deceptively human-like behaviors it executes using that form, cause users to project onto it significant capacity for higher-level thought. Much like with the ELIZA program of mid last century (Weizenbaum 1966), it is very easy to attribute intelligence to a system that interacts with you in the way you interact with other humans. As an art piece the willful marionette exposes the fundamental frailty of human social interaction: we can be so captivated by simple randomness, so long as it is embodied to look and move like us! As an HCI research project the willful marionette demonstrates the possibilities of affective gestural interaction that seeks to sustain engaging dialogue, rather than complete a task by the most effective route.

In summary, the willful marionette is an example of a simple interactive embodied cognitive system that draws participants into a gesture-based dialogue. Our evaluation of the cognitive model shows that people easily ascribe intelligence to the marionette because of the combination of its human-like form and its unexpected and provocative behavior. The core reasoning process for the marionette is a mapping algorithm that maps a detected gesture to a selection from a list of predefined gestures. This mapping was designed to produce unpredictable social interaction, and to leave ambiguous the question of the marionette’s capacity for higher thought. Based on the results of our interviews with the human participants, we believe that the key factors that caused the participants to perceive machine intelligence are (1) the unpredictability of the perceive-reason-perform cognitive model (2) use of gesture dialogue as the mode of interaction (3) the human-like features and gestures of the marionette and (4) the proactive movements of the marionette: the idle gestures.

We developed a system that enables a novel gesture-gesture based dialogue in order to explore how embodied cognitive agents with human-like features could affect a physical social context. Even though the resulting system’s behavior is based on a simple cognitive model, users were more than willing to ascribe higher intellect to its actions due to its embodied nature. This result has implications for both art and the design of future interactive systems.