VIsual TRAnslator: Linking perceptions and natural language descriptions

Herzog, Gerd; Wazinski, Peter

doi:10.1007/BF00849073

VIsual TRAnslator: Linking perceptions and natural language descriptions

Published: March 1994

Volume 8, pages 175–187, (1994)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Gerd Herzog¹ &
Peter Wazinski¹

162 Accesses
34 Citations
3 Altmetric
Explore all metrics

Abstract

Despite the fact that image understanding and natural language processing constitute two major areas of AI, there have only been a few attempts toward the integration of computer vision and the generation of natural language expressions for the description of image sequences. In this contribution we will report on practical experience gained in the projectVitra (VIsual TRAnslator) concerning the design and construction of integrated knowledge-based systems capable of translating visual information into natural language descriptions. InVitra different domains, like traffic scenes and short sequences from soccer matches, have been investigated.

Our approach towardssimultaneous scene description emphasizes concurrent image sequence evaluation and natural language processing, carried out on anincremental basis, an important prerequisite for real-time performance. One major achievement of our cooperation with the vision group at the Fraunhofer Institute (IITB, Karlsruhe) is the automatic generation of natural language descriptions for recognized trajectories of objects in real world image sequences. In this survey, the different processes pertaining to high-level scene analysis and natural language generation will be discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

André, E., Bosch, G., Herzog, G. & Rist, T. (1987). Coping with the Intrinsic and the Deictic Uses of Spatial Prepositions. In Jorrand, K. & Sgurev, L. (eds.)Artificial Intelligence II: Methodology, Systems, Applications, 375–382. North-Holland: Amsterdam.
Google Scholar
André, E., Rist, T. & Herzog, G. (1987). Generierung natürlichsprachlicher Äußerungen zur simultanen Beschreibung zeitveränderlicher Szenen. In Morik, K. (Hrsg.)GWAI-87, 330–337. Springer: Berlin, Heidelberg.
Google Scholar
André, E., Herzog, G. & Rist, T. (1988). On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System SOCCER. In Proceedings ofThe Eighth ECAI, 449–454. Munich.
André, E., Herzog, G. & Rist, T. (1989).Natural Language Access to Visual Data: Dealing with Space and Movement. Report 63, Universität des Saarlandes, SFB 314 (VITRA), Saarbrücken. Presented at the 1st Workshop on Logical Semantics of Time, Space and Movement in Natural Language, Toulouse, France.
Bajcsy, R., Joshi, A., Krotkov, E. & Zwarico, A. (1985). LandScan: A Natural Language and Computer Vision System for Analyzing Aerial Images. In Proceedings ofThe Ninth IJCAI, 919–921. Los Angeles, CA.
Finkler, W. & Schauder, A. (1992). Effects of Incremental Output on Incremental Natural Language Generation. In Proceedings ofThe Tenth ECAI, 505–507. Vienna.
Gapp, K.-P. (1993).Berechnungsverfahren für räumliche Relationen in 3D-Szenen. Memo 59, Universität des Saarlandes, SFB 314.
Gapp, K.-P. (1994). Basic Meanings of Spatial Relations; Computation and Evaluation in 3D Space. In Proceedings ofThe AAAI-94. Seattle, WA. (to appear).
Grice, H. P. (1975). Logic and Conversation. In Cole, P. & Morgan, J. L. (eds.)Speech Acts, 41–58. Academic Press: London.
Google Scholar
Harbusch, K., Finkler, W. & Schauder, A. (1991). Incremental Syntax Generation with Tree Adjoining Grammars. In Brauer, W. & Hernandez, D. (eds.)Verteilte Künstliche Intelligenz und kooperatives Arbeiten: 4. Int. GI-Kongreβ Wissensbasierte Systeme, 363–374, Springer: Berlin, Heidelberg.
Google Scholar
Herzog, G. (1986).Ein Werkzeug zur Visualisierung und Generierung von geometrischen Bildfolgenbeschreibungen. Memo 12, Universität des Saarlandes, SFB 314 (VITRA).
Herzog, G. (1992). Utilizing Interval-Based Event Representations for Incremental High-Level Scene Analysis. In Aurnague, M., Borillo, A., Borillo, M. & Bras M. (eds.). Proceedings ofThe Fourth European Workshop on Semantics of Time, Space, and Movement and Spatio-Temporal Reasoning, 425–435. Château de Bonas, France.
Herzog, G. (1992).Visualization Methods for the VITRA Workbench. Memo 53, Universität des Saarlandes, SFB 314 (VITRA).
Herzog, G., Sung, C.-K., André, E., Enkelmann, W., Nagel, H.-H., Rist, T., Wahlster, W. & Zimmermann, G. (1989). Incremental Natural Language Description of Dynamic Imagery. In Freksa, Ch. & Brauer, E. (eds.)Wissensbasierte Systeme. 3. Int. GI-Kongreß, 153–162. Springer: Berlin, Heidelberg.
Google Scholar
Herzog, G., Maaß & Wazinski, P. (1993). VITRA GUIDE: Utilisation du langage Naturel et de Représentation Graphiques pour la Description d'Itinéraires. InColloque Interdisciplinaire du Comité National “Images et Langages: Multimodalité et Modélisation Cognitive, 243–251. Paris.
Herzog, G., Schirra, J. & Wazinski, P. (1993).Arbeitsbericht für den Zeitraum 1991–1993: VITRA — Kopplung bildverstehender und sprachverstehender Systeme. Memo 58, Univesität des Saarlandes, SFB 314 (VITRA).
Jameson, A. & Wahlster, W. (1982). User Modelling in Anaphora Generation. In Proceedings ofThe Fifth ECAI, 222–227. Orsay, France.
Koller, D. (1992).Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von Straßenverkehresszenen. Infix: St. Augustin.
Koller, D., Daniilidis, K., Thórhallson, T. & Nagel, H.-H. (1992a). Model-based Object Tracking in Traffic Scenes. In Sandini, G. (ed.). Proceedings ofThe Second European Conf. on Computer Vision, 437–452. Springer: Berlin, Heidelberg.
Google Scholar
Koller, D., Heinze, N. & Nagel, H.-H. (1992b). Algorithmic Characterization of Vehicle Trajectories from Image Sequences by Motion Verbs. In Proceedings ofThe IEEE Conf. on Computer Vision and Pattern Recognition, 90–95. Maui, Hawaii.
Kollnig, H. & Nagel, H.-H. (1993). Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen.Informatik Forschung und Entwicklung 8(4): 186–196.
Google Scholar
Lüth, T. C., Längle, Th., Herzog, G., Stopp, E. & Rembold, U. (1994). Human-Machine Interaction for Intelligent Robots Using Natural Language. InThird IEEE Int. Workshop on Robot and Human Communication, RO-MAN'94, Nagoya, Japan (to appear).
Maaß, W., Wazinski, P. & Herzog, G. (1993). VITRA GUIDE: Multi-modal Route Descriptions for Computer Assisted Vehicle Navigation. In Proceedings ofThe Sixth Int. Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems IEA/AIE-93, 144–147. Edinburgh, Scotland.
Neumann, B. & Novak, H.-J. (1986). NAOS: Ein System zur natürlichsprachlichen Beschreibung zeitveränderlicher Szenen.Informatik Forschung und Entwicklung 1: 83–92.
Google Scholar
Neumann, B. (1989). Natural Language Description of Time-Varying Scenes. In Waltz D. L. (ed.)Semantic Structures, 167–207. Lawrence Erlbaum: Hillsdale, NJ.
Google Scholar
Niemann, J., Bunke, H., Hofmann, I., Sagerer, G., Wolf, F. & Feistel, H. (1985). A Knowledge Based System for Analysis of Gated Blood Pool Studies.IEEE Transactions on Pattern Analysis and Machine Intelligence 7: 246–259.
Google Scholar
Reithinger, N. (1992). The Performance of an Incremental Generation Component for Multi-Modal Dialog Contributions. In Dale, R., Hovy, E., Rösner, D. & Stock, O. (eds.)Aspects of Automated Natural Language Generation: Proceedings ofThe Sixth Int. Workshop on Natural Language Generation, 263–276. Springer: Berlin, Heidelberg.
Google Scholar
Retz-Schmidt, G. (1988). Various Views on Spatial Prepositions.Al Magazine 9(2): 95–105.
Google Scholar
Retz-Schmidt, G. (1991). Recognizing Intentions, Interactions, and Causes of Plan Failures.User Modeling and User-Adapted Interaction 1: 173–202.
Google Scholar
Retz-Schmidt, G. (1992).Die Interpretation des Verhaltens mehrerer Akteure in Szenenfolgen. Springer: Berlin, Heidelberg.
Google Scholar
Rohr, K. (1994). Towards Model-based Recognition of Human Movements in Image Sequences.Computer Vision, Graphics, and Image Processing (CVGIP): Image Understanding 59(1): 94–115.
Google Scholar
Schirra, J. R. J. & Stopp E. (1993). ANTLIMA — A Listener Model with Mental Images. In Proceedings ofThe Thirteenth IJCAI, 175–180. Chambery, France.
Schirra, J. R. J., Bosch, G., Sung, C.-K. & Zimmermann, G. (1987). From Image Sequences to Natural Language: A First Step Towards Automatic Perception and Description of Motions.Applied Artificial Intelligence 1: 287–305.
Google Scholar
Sung, C.-K. & Zimmermann, G. (1986). Detektion und Verfolgung mehrerer Objekte in Bildfolgen. In Hartmann, G. (Hrsg.)Mustererkennung, 181–184. Springer: Berlin, Heidelberg.
Google Scholar
Sung, C.-K. (1988). Extraktion von typischen und komplexen Vorgängen aus einer langen Bildfolge einer Verkehrsszene. In Bunke, H., Kübler, O. & Stucki, P. (Hrsg.)Mustererkennung, 90–96. Springer: Berlin, Heidelberg.
Google Scholar
Tsotsos, J. K. (1985). Knowledge Organization and its Role in Representation and Interpretation for Time-Varying Data: the ALVEN System.Computational Intelligence 1: 16–32.
Google Scholar
Wahlster, W., Marburger, H., Jameson, A. & Busemann, S. (1983), Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In Proceedings ofThe Eighth IJCAI, 643–646. Karlsruhe, FRG.
Wahlster, W. (1989). One Word Says More Than a Thousand Pictures. On the Automatic Verbalization of the Results of Image Sequence Analysis Systems.Computers and Artifial Intelligence 8: 470–492.
Google Scholar
Walter, I., Lockemann, P. C. & Nagel, H.-H. (1988). Database Support for Knowledge-Based Image Evaluation. In Stocker, P. M., Kent, W. & Hammersley, R. (eds.) Proceedings ofThe Thirteenth Conf. on Very Large Databases, Brighton, UK, 3–11, Los Altos, CA: Morgan Kaufmann.
Google Scholar
Wazinski, P. (1993a).Graduated Topological Relations. Memo 54, Universität des Saarlandes, SFB 314.
Wazinski, P. (1993b). Graduierte topologische Relationen. In Hernandez (ed.)Hybride und integrierte Ansätze zur Raumrepräsentation und ihre Anwendung, Workshop auf der 17, KI-Fachtagung, Berlin, 16–19. Technische Univ. München. Institut für Informatik. Forschungsberichte Künstliche Intelligenz, FKI-185-93.

Download references

Author information

Authors and Affiliations

SFB 314, Project VITRA, Universität des Saarlandes, D-66041, Saarbrücken
Gerd Herzog & Peter Wazinski

Authors

Gerd Herzog
View author publications
You can also search for this author in PubMed Google Scholar
Peter Wazinski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Herzog, G., Wazinski, P. VIsual TRAnslator: Linking perceptions and natural language descriptions. Artif Intell Rev 8, 175–187 (1994). https://doi.org/10.1007/BF00849073

Download citation

Issue Date: March 1994
DOI: https://doi.org/10.1007/BF00849073

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VIsual TRAnslator: Linking perceptions and natural language descriptions

Abstract

Access this article

Similar content being viewed by others

An Extensive Review on Verbal-Guided Image Parsing

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Vision-language navigation: a survey and taxonomy

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

VIsual TRAnslator: Linking perceptions and natural language descriptions

Abstract

Access this article

Similar content being viewed by others

An Extensive Review on Verbal-Guided Image Parsing

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Vision-language navigation: a survey and taxonomy

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation