1 Introduction

In popular personal videophone system using PCs and Tablets, listening and talking with downcast eyes is inevitable. This is because a camera is installed above a display and there is no gaze line matching between two persons. In the teleconference system for multiple persons, half mirrors and cameras were used for realizing eye contact talks [1]. In another study, the picture plane was rotated for compensating the gaze direction, and the improvement of subjective perception was reported by the number of votes by 52 subjects [2]. Gaze correction method [3] and multi-viewpoint videos merging method [4] have been reported the improvement of eye-contact communication, too. However, there is no objective evaluation result in those studies.

Generally, in a natural conversation, eye-contact and face to face communication can be observed frequently, and those human behaviors should be taken into account by evaluating a system. In e-learning applications, eye mark recorder which recorded the fixation point movement on the view was applied to analyze the effectiveness of presentation methods [5].

In this paper, we define two objective measures, i.e., the ECCR and the FFCR, and show the experimental results using eye mark recorder [6] with changing the position of camera from the above to the center of display. We also discuss what makes a conversation natural using videophone and virtual conversational system.

2 Human Behaviors in Conversation

Mutual gaze during natural conversation is one of important interactions [7]. However, in personal videophone system, inconsistent gaze behavior, e.g., gaze at partner’s clothes or out of display, is frequently observed.

Figure 1 shows the relationship between the gaze at partner’s eye Geye(t), the gaze at partner’s face Gface(t), the Talk by subject Ts(t), and the Talk by partner Tp(t), and behavioral states. As each feature is represented by ON(1) or OFF(0), there are 16 behavioral states. In this figure, the duration of Geye(t) = 1 and Gface(t) = 1, when Ts(t) = 1 or Tp(t) = 1, is the most important state. As shown in Fig. 1, we define Eye-Contact Conversation (ECC) state in which Geye(t) = 1 and Ts(t) = 1 or Tp(t) = 1, and Face to Face Conversation (FFC) state in which Gface(t) = 1 and Ts(t) = 1 or Tp(t) = 1. The former is marked up by black bar and the latter is by gray bar in the bottom of figure.

Fig. 1.
figure 1

Relationship between 4 features and 2 states

2.1 Eye-Contact Conversation Ratio (ECCR)

In order to estimate the eye-contact conversation objectively, we add up ECC durations and calculate ECC ratio by the following equation,

$$ ECCR = \frac{{\mathop \sum \nolimits_{m = 1}^{M} T_{ECC} \left( m \right)}}{{\mathop \sum \nolimits_{i = 1}^{I} T_{s} \left( i \right) + \mathop \sum \nolimits_{j = 1}^{J} T_{p} \left( j \right)}} \times 100 $$
(1)

where, TECC(m) is the m-th duration of ECC state, Ts(i) is the i-th duration of subject’s talk, and Tp(j) is the j-th duration of partner’s talk.

From the equation, ECCR presents eye-contact conversation ratio in both talking period and listening period.

2.2 Face to Face Conversation Ratio (FFCR)

As same as ECCR, we sum up FFC durations and calculate FFC ratio by the following equation,

$$ FFCR = \frac{{\mathop \sum \nolimits_{n = 1}^{N} T_{FFC} \left( n \right)}}{{\mathop \sum \nolimits_{i = 1}^{I} T_{s} \left( i \right) + \mathop \sum \nolimits_{j = 1}^{J} T_{p} \left( j \right)}} \times 100 $$
(2)

where, TFFC(n) is the n-th duration of FFC state.

From the equation, FFCR presents face to face conversation ratio in both talking period and listening period. As shown in Fig. 1, FFC duration includes ECC duration, because eye is a part of face.

3 Experimental System

In order to evaluate the eye-contact conversation and the face to face conversation using videophone with different camera position, we have developed a videophone system, in which the camera position can be changed. Gaze point is recorded by the eye mark recorder which uses the infrared reflection of pupil/cornea, and the decision whether the gaze position of subject is at face or at eye or at others is made by analyzing the recorded images. Conversations are also recorded and separated into subject’s talk and partner’s talk after noise reduction.

In this section, the developed videophone system and the flow of signal processing are described.

3.1 Videophone System

The developed videophone system is shown in Fig. 2. A half mirror is located in front of a subject with 45° angle for realizing face to face conversation with the image of a partner. The flipped horizontal image is displayed on the monitor for avoiding left and right being reversed.

Fig. 2.
figure 2

Developed videophone system

The height of the camera can be changed in any position. In this experiment, we use two positions such as center position and above position which simulates PC’s camera. Two sets of systems are used in the experiment. Specification of each system is as follows,

  • Display size: 24.1 in. LCD

  • Videophone application: Skype

  • Left and Right Reverse: ManyCamFootnote 1

  • Camera: 640pixels (H) by 480pixels (W), 24bit color, 30fps

  • Audio: fs = 44.1 kHz, 16bit

A scene of the experiment using the developed videophone system is shown in Fig. 3.

Fig. 3.
figure 3

Experimental setup

3.2 Gaze Point Estimation

The gaze points and audio information of the subject wearing the eye mark recorder (EMR-9) [6] are recorded.

The recorder measures the sight angle of the subject based on the infrared reflection image in the cornea and the pupil movement. The detection range is ±40° in horizontal and ±20° in vertical. The gaze points (Left: + , Right: □) and a parallax corrected gaze point: ○ are displayed on the image (640 × 480 pixels) taken by field of view cameras installed at the brim of a cap as shown in Fig. 4. The image was being recorded while conversation and analyzed together with recorded voice after the experiment to look into the location of the parallax gaze point.

Fig. 4.
figure 4

Example of recorded image from EMR-9

Figure 4 shows examples of (a) “Eye-Contact Conversation in talking” scene, and (b) “Face to Face Conversation in listening” scene.

Face area and eye area are manually determined by the face features such as skin color, eye-brow, eye, nose, mouth, and tin as shown in Fig. 5.

Fig. 5.
figure 5

Face area and eye area

3.3 Signal Processing Flow

Figure 6 shows the flow of signal processing to get 4 signals, i.e., Geye(t), Gface(t), Ts(t), and Tp(t), and 4 states, i.e., TECC and TFFC in talking/listening. In this study, Geye(t) and Gface(t) are extracted manually, and the Talk by subject Ts(t) and the Talk by partner Tp(t) are extracted automatically based on the audio signal power.

Fig. 6.
figure 6

Signal processing flow

4 Experimental Results and Discussion

10 male persons (Age: 22–24) were divided into 5 groups and made free conversation for 6 min. or more. After 1 min. passed, image including gaze point shown in Fig. 4 and speech were recorded for 5 min. and analyzed.

4.1 ECCR and FFCR in Higher Camera Position

It is expected that both a subject and their partner using PC based videophone are talking or listening with downcast eyes. This degrades the communication quality and leads to low ECCR and FFCR. Table 1 shows the ECCR and the FFCR in the conversation of 5 min. Total talking time of subject and Total talking time of partner are also indicated in seconds.

Table 1. ECCR, FFCR, and talking time in higher camera position

From Table 1, averaged ECCR of five subjects is 29.0% and the deviation is not large. On the other hand, averaged FFCR shows 74.7% even if a partner talks or listens with downcast eyes. Five FFCRs depend on subjects, and varies between 53.9% and 89.3%.

In order to inspect ECCR and FFCR in detail, we separate them into the ratios in talking and listening, and summarized as shown in Table 2. The suffix “T” and “L” shows “talking” and “listening” respectively.

Table 2. ECCR and FFCR in talking and listening in higher camera position

Except for the ECCR of subject 5, both averaged ECCRT and FFCRT are less than averaged ECCRL and FFCRL, respectively. This means that almost all subjects watch the partner’s eye and face in listening rather than in talking. Also, it is found that the subjects watch the partner’s face rather than in the eye while talking.

4.2 ECCR and FFCR in Center Camera Position

Table 3 shows the ECCR, the FFCR, Total talking time of subject, and Total talking time of partner, and Table 4 shows the detail of ECCR and FFCR.

Table 3. ECCR, FFCR, and talking time in center camera position
Table 4. ECCR and FFCR in talking and listening in center camera position

By comparing Table 3 with Table 1, the following are found,

  1. (1)

    Averaged ECCR decreases by locating a camera to the center. This trend can be seen except for subject 1.

  2. (2)

    Averaged FFCR increases by locating a camera to the center. This trend can be seen true for all subjects.

By comparing Table 4 with Table 2, the following are found,

  1. (3)

    Averaged ECCRT slightly increases by centering a camera. But, this is not a remarkable trend.

  2. (4)

    Averaged ECCRL slightly decreases by centering a camera. This trend can be seen except for subject 5.

  3. (5)

    Both FFCRT and FFCRL increase by centering a camera. This trend can be seen true for all subjects.

5 Conclusion

In order to improve the naturality of a conversation using videophone or virtual conversational system, we have proposed two objective measures, ECCR and FFCR, and developed a videophone system with half mirror. By changing the position of camera from above to center, FFCR increases from 74.7% to 88.0%. This means that the face to face conversation is affected by the gaze of the partner. Obviously, the conversation with mutual gaze increased by centering a camera, and the naturality of a conversation have been improved. However, because of wide eye area and no consideration of partner’s gaze, ECCR in talking/listening does not change. For clarifying the true eye contact conversation ratio, the eye area and gaze of partner should be considered in future work. In addition, measures of affect or measure of emotion such as PANAS (Positive and Negative Affect Schedule) should be studied.