Elsevier

Computers & Graphics

Volume 48, May 2015, Pages 99-106
Computers & Graphics

Special Section on SUI 2014
VideoHandles: Searching through action camera videos by replicating hand gestures

https://doi.org/10.1016/j.cag.2015.01.004Get rights and content

Highlights

  • Current video review techniques are unsuitable for action-camera video.

  • We explore example action-cam footage and highlight usage styles.

  • We propose 2 video navigation styles; prospective tagging and retrospective search.

  • We detail a technique to support action-camera video search by replicating gestures.

Abstract

We present a detailed exploration of VideoHandles, a novel interaction technique to support rapid review of wearable video camera data by re-performing gestures as a search query. The availability of wearable video capture devices has led to a significant increase in activity logging across a range of domains. However, searching through and reviewing footage for data curation can be a laborious and painstaking process. We showcase the use of gestures as search queries to support review and navigation of video data. By exploring example action-camera footage across a range of activities, we propose two video data navigation styles using gestures: prospective gesture tagging and retrospective gesture searching. This paper builds on our earlier VideoHandles work, reporting details of our interaction design and implementation, and presenting two additional evaluations. We demonstrate that VideoHandles is a viable interaction technique, returning promising results in its prototype form.

Introduction

Max, a marine zoologist, is performing a scuba dive to record some underwater footage. He mounts an action camera to his chest and starts recording. During the dive Max performs a variety of hand gestures for his buddy, indicating aquatic life of interest. On one occasion, he sees a trigger fish and performs a fish swimming gesture followed by a trigger mime. On another occasion, he sees a puffer fish and performs the gesture (fish swimming gesture followed by a two-handed mimicked inflation to indicate ’puffer’) to his dive buddy so that she can identify it too. In total 90 minutes of footage was captured. Upon returning home, Max uploads the footage to his computer with the intention of revisiting some of the key moments. He performs a puffer fish gesture as a search query and VideoHandles produces the puffer fish footage as a top ranking result among other results that include the fish swimming gesture – a key component across a range of diving gestures. After watching the puffer fish footage, Max notices the trigger fish footage among the returned results and decides to review that footage as well.

A wide variety of users, from amateurs to professionals, have adopted action cameras across a diverse range of activities, from mountain biking and scuba diving through to professional fieldwork. These cameras, such as the GoPro (www.gopro.com) are frequently mounted on head-gear or fixed to the chest and record throughout an activity, often for 1–2 hours, with little or no additional interaction. From these positions, and given a wide field of view (circa 170°), the cameras are able to catch the majority of the wearer׳s view including any interactions or gestures they may be performing with their hands.

Where professionals may capture footage in order to maintain a clear record of their actions, others (e.g. sports enthusiasts) are more likely to capture footage for key exciting moments. Although these are different motivations, all scenarios necessitate review in order to locate desired moments. The most widely adopted current method for video review is video scrubbing (clicking through the timeline). However, as videos increase in length this process becomes inefficient and inaccurate [1].

We previously presented VideoHandles [2], a novel video search technique to expedite the review of specific moments in wearable video camera data. Our technique exploits action-camera׳s wide field of view to capture the wearer׳s interactions and gestures. VideoHandles allows users to query their footage by repeating interactions and gestures they performed during capture. As in our described scenario, these reproduced gestures are matched to instances in the original footage (Fig. 1).

Based on observations of footage across a range of activities, we propose two video data navigation styles using gestures: prospective gesture tagging, where gestures are specifically performed to create a search marker, and retrospective gesture searching, where gestures are simply a part of the activity, recalled through muscle memory.

As an extension of our previous work [2], we present a more detailed exploration of the related work and our observations of a corpus of action-camera footage. Additionally, we provide further details of our prototype system design and present two additional lab-studies alongside our pilot in-the-wild study.

Section snippets

Related work

Our work builds on research in video navigation and search, and draws on research on gesture segmentation, gesture matching and memory.

An exploration of action camera footage

In order to gain a more detailed understanding of the typical usage of action and ‘self-capture’ cameras, we observed and analyzed footage captured from a range of activities. In total we collated more than 50 hours of footage from a range of activities intended to represent a varied sample of those in which action-cameras are used. Our activities included snowboarding, power boating, tennis, cycling, hiking, running, scuba diving, windsurfing and archaeological excavation. The footage was

VideoHandles

VideoHandles is a video search technique based on reproduction of gestures. Our technique enables users to remember, or specifically plan, gestures produced during recording and to reproduce these gestures as search criteria to relocate specific moments in footage. Our technique reduces the requirement for human time and effort in reviewing vast reams of video data. Further to this, our technique supports wider exploration and comparison of footage by returning a range of results.

Action cameras

Prototype algorithm

We developed a prototype system to explore the feasibility of our concept and its value from an HCI perspective. We used a combination of existing computer vision algorithms to track, segment, and shape- and motion-match gestures in different videos. The technical approach we adopt is just one of many possible approaches and any appropriate computer vision algorithm could be used. We detail our technical implementation here to better situate the results we present in our later studies.

We

Examining the performance of our algorithm

Having developed a prototype to enable further exploration of the VideoHandles technique, we were keen to explore its accuracy through a range of different studies. As previously discussed, self-captured video is typically noisy, subject to fast changes in motion, constant vibration, and captured in dynamic lighting. For this reason, a key challenge in the automatic processing of footage of this kind is the reliable segmentation of features of interest. In study 1, we explore the ability of our

Pilot study of VideoHandles in the wild

To further explore our interaction technique and to begin to evaluate our prototype, we conducted a pilot study of VideoHandles in realistic use. Two participants wore GoPro action cameras on chest mounts whilst cycling, recording 45 min of footage on average. The participants were asked to perform a gesture of their choosing, indicating every time they saw a red car, when they were feeling energetic and when they felt tired. Examples of participant 1׳s (P1) gestures can be seen in Fig. 5.

After

Discussion

Our results showcase the promise of our prototype algorithm and highlight the potential of our interaction technique as a whole. We demonstrate 80% hand detection, 75.5% hand matching in our lab study, and 89% gesture matching across 28 gestures performed in the wild. Our exploration and evaluation of our approach has highlighted a number of interesting features of our design.

Firstly, the majority of action-cam devices have a wide field-of-view (typically between 160 and 170°). While this can

Conclusion

VideoHandles is a novel search interaction technique for action camera footage which allows users to search through footage by repeating actions performed during the original recording. VideoHandles allows real-time tagging and categorizing of data, thus reducing time spent on post-processing, whilst facilitating wider exploration of recorded footage by supporting comparison between search matches. Our technique also supports a range of usage methods, allowing for both retrospective searching

Future work

Our work here has been an early exploration of the feasibility of using gestures and actions as search criteria for video data. There is significant opportunity for future work in this space.

Our prototype system relies heavily upon, and is constrained by, the skin-color segmentation of the frame. Our current implementation uses a best-guess, widest-fit skin color model for matching, designed to provide acceptably accurate results across the broadest range of possible users. In a final

References (23)

  • S.W. Cook et al.

    Gesturing makes memories that last

    J Mem Lang

    (2010)
  • Matejka J, Grossman T, Fitzmaurice G. Swift: reducing the effects of latency in online video scrubbing. In: Proceedings...
  • Knibbe J, Seah SA, Fraser M. VideoHandles: replicating gestures to search through action-camera video. In: Proceedings...
  • Hurst W. Interactive audio-visual video browsing. In: Proceedings of the 14th annual ACM international conference on...
  • Jackson D, Nicholson J, Stoeckigt G, Wrobel R, Thieme A, Olivier P. Panopticon: a parallel video overview system. In:...
  • Matejka J, Grossman T, Fitzmaurice G. Swifter: Improved online video scrubbing. In: Proceedings of the SIGCHI...
  • Tian X, Yang L, Wang J, Yang Y, Wu X, Hua XS. Bayesian video search reranking. In: Proceedings of the 16th ACM...
  • Shim JC, Dorai C, Bolle R. Automatic text extraction from video for content-based annotation and retrieval. In:...
  • Halvey M, Jose JM. The role of expertise in aiding video search. In: Proceedings of CIVR ׳09. New York, NY, USA: ACM;...
  • Yuan J, Tian Q, Ranganath S. Fast and robust search method for short video clips from large video collection. In:...
  • Zhong D, Chang SF. Spatio-temporal video search using the object based video representation. In: Proceedings of...
  • View full text