doi:10.1016/j.imavis.2005.07.026
Copyright © 2006 Elsevier B.V. All rights reserved.
Conceptual representations between video signals and natural language descriptions
M. Arens
, a,
, R. Gerbera and H.-H. Nagela
aInstitut für Algorithmen und Kognitive Systeme, Fakultät für Informatik der Universität Karlsruhe (TH), 76128 Karlsruhe, Germany
Received 13 July 2004;
revised 17 June 2005;
accepted 21 July 2005.
Available online 24 April 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
An artificial cognitive vision system associates video signals with conceptual descriptions of the depicted time-varying scene. This linkage is mediated by knowledge representation formalisms. An experimental implementation of such an approach yielded initial results for the conceptual description of videos recorded at innercity traffic scenes, see [M. Haag, H.-H. Nagel, Incremental recognition of traffic situations from video image sequences, Image and Vision Computing 18 (2) (2000) 137–153]. Accumulating experience with this system approach and its extension for the generation of natural language texts from videos caused us to redesign the overall computer vision system as well as the knowledge representation formalisms utilised within that system.
Keywords: Cognitive vision; Knowledge representation
Fig. 1. Sketch of the architecture of the overall computer vision system discussed in [20].
Fig. 2. Original situation graph tree from [20] for crossing innercity intersections. Further explanations in the text.
Fig. 3. Top: one frame of the Nibelungen-Platz image sequence with overload tracking results for objects 10, 11, 12, 15, 18, and 20 at time point no. 300. Bottom: Sequence of situation nodes visited during tracking. fobj_xx is an individual constant referring to a particular lane (taken from [20]).
Fig. 4. New layer-architecture of a cognitive vision system (adapted from [3]).
Fig. 5. Overview picture of new SGT describing the behaviour of vehicles at intersections. See Figs. 6-8 for detail views.
 |
Fig. 6. Two top-most levels of new SGT. Rectangles show situation schemes with identifiers (e.g. sit_ED_SIT1), state predicates (e.g. on_iseg(Agent, Lseg)), and action predicates (e.g. note(sit_finding_possible_paths(Agent, LobjList))) separated by lines. While the state predicates of a situation scheme have to be satisfied during traversal to instantiate this scheme, the action predicates of a scheme are executed by the traversal algorithm whenever the situation scheme has been instantiated. Here, the note-predicates cause the traversal-algorithm to print out situation-dependent messages which then serve as input for the generation of natural language text. Thin arrows indicate prediction edges, while small circles in the upper right corner of situation schemes show prediction edges from and to a single scheme. Bold, rounded rectangles enclose situation (sub-)graphs. Bold arrows stand for specialisation edges. Small filled rectangles below situation schemes denote sub-graphs not yet shown in this figure.
Fig. 7. Second level of new SGT illustrating specialisations of sit_driving_to_intersection (sit_ED_SIT27 in Fig. 6).
Fig. 8. Second level of new SGT illustrating specialisations of sit_giving_way (sit_ED_SIT28 in Fig. 6).
Fig. 9. Top: Same image frame as in Fig. 3. Bottom: Sequence of situation nodes visited during traversal of new SGT. lobj_x is an individual constant referring to a particular lane.
Table 1.
Types of knowledge which are provided to the quantitative layers (QL), the conceptual layers (CL), and the natural language layers (NL) (adapted from [20])
