Skip to main content
Log in

COSMOROE: a cross-media relations framework for modelling multimedia dialectics

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Though everyday interaction is predominantly multimodal, a purpose-developed framework for describing the semantic interplay between verbal and non-verbal communication is still lacking. This lack not only indicates one’s poor understanding of multimodal human behaviour, but also weakens any attempt to model such behaviour computationally. In this article, we present COSMOROE, a corpus-based framework for describing semantic interrelations between images, language and body movements. We argue that in viewing such relations from a message-formation perspective rather than a communicative goal one, one may develop a framework with descriptive power and computational applicability. We test COSMOROE for compliance to these criteria, by using it for annotating a corpus of TV travel programmes; we present all particulars of the annotation process and conclude with a discussion on the usability and scope of such annotated corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. André, E., Rist, T.: The design of illustrated documents as a planning task. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 94–116, Chap. 4. AAAI Press/MIT Press, Cambridge, MA (1993)

  2. André, E., Rist, T.: Referring to world objects with text and pictures. In: Proceedings of the Computational Linguistics Conference, pp. 530–534 (1994)

  3. Barnard K., Duygulu P., Forsyth D., Freitas N., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)

    Article  MATH  Google Scholar 

  4. Barras, C., Geoffrois, E., Wu, Z., Liberman, M.: Transcriber: a free tool for segmenting, labeling and transcribing speech. In: Proceedings of the First International Conference on Language Resources and Evaluation, pp. 1373–1376 (1998)

  5. Barthes, R.: Image, Music, Text. Flamingo (1984)

  6. Bateman, J., Delin, J., Allen, P.: Constraints on layout in multimodal document generation. In: Proceedings of the Workshop on Coherence in Generated Multimedia, First International Natural Language Generation Conference (2000)

  7. Bateman, J., Delin, J., Henschel, R.: Multimodality and empiricism: preparing for a corpus-based approach to the study of multimodal meaning-making. In: Perspectives on Multimodality, pp. 65–89. John Benjamins, Amsterdam (2004)

  8. Bernsen N.: Why are analogue graphics and natural language both needed in hci? In: Paterno, F. (ed.) Interactive Systems: Design, specification and verification. Focus on Computer Graphics, pp. 235–251. Springer, Berlin (1995)

  9. Bordegoni M., Faconti G., Feiner S., Maybury M., Rist T., Ruggieri S., Trahanias P., Wilson M.: A standard reference model for intelligent multimedia presentation systems. Computer Standards Interfaces 18(6/7), 477–496 (1997)

    Article  Google Scholar 

  10. Carletta J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)

    Google Scholar 

  11. Carlson, L., Marcu, D., Okurowski, M.: Building a discourse-tagged corpus in the framework of rhetorical structure theory. In: Current Directions in Discourse and Dialogue, pp. 85–112. Kluwer, Dordrecht (2003)

  12. de Carolis, B., Pelachaud, C., Poggi, I.: Verbal and nonverbal discourse planning, proceedings of fourth international conference on autonomous agents. In: Proceedings of the Workshop on Achieving Human-Like Behaviour in Interactive Animated Agents, Fourth International Conference on Autonomous Agents (2000)

  13. Cassell, J.: A framework for gesture generation and interpretation. In: Computer Vision in Human–Machine Interaction, Chap. 11. Cambridge University Press, London (1998)

  14. Chen, L., Liu, Y., Harper, M., Maia, E., McRoy, S.: Evaluating factors impacting the accuracy of forced alignments in a multimodal corpus. In: Proceedings of the 4th Language Resources and Evaluation Conference (2004)

  15. Corio, M., Lapalme, G.: Integrated generation of graphics and text: a corpus study. In: Proceedings of the Association of Computational Linguistics Workshop on Content Visualisation and Intermedia Representation, pp. 63–68 (1998)

  16. Corio, M., Lapalme, G.: Generation of texts for information graphics. In: Proceedings of the European Workshop on Natural Languge Generation, pp. 49–58 (1999)

  17. Crewson P.: Fundamental of clinical research for radiologists: reader agreement studies. Am. J. Roentgenol. 184, 1391–1397 (2005)

    Google Scholar 

  18. Dasiopoulou, S., Papastathis, V., Mezaris, V., Kompatsiaris, I., Strintzis, M.: An ontology framework for knowledge-assisted semantic video analysis and annotation. In: Proceedings of the International Workshop on Knowledge Markup and Semantic Annotation (2004)

  19. Everingham, M., Gool, L.V., Williams, C., Zisserman, A.: Pascal visual object classes challenge results. World Wide Web (http://www.pascal-network.org/challenges/VOC/voc) (2005)

  20. Fasciano, M., Lapalme, G.: Intentions in the co-ordinated generation of graphics and text from tabular data. Knowl. Inform. Syst. 2(3) (2000)

  21. Feiner, S., McKeown, K.: Automating the generation of co-ordinated multimedia explanations. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 117–138, chap. 5. AAAI Press/MIT Press, Cambridge, MA (1993)

  22. Fellbaum,C. (ed.):WordNet:An Electronic Lexical Database. The MIT Press, Cambridge, MA (1998)

    MATH  Google Scholar 

  23. Green, N.: An empirical study of multimedia argumentation. In: Proceedings of the International Conference on Computational Sciences-Part I, pp. 1009–1018. Springer, Berlin (2001)

  24. Gut, U., Looks, K., Thies, A., Trippel, T., Gibbon, D.: Cogest conversational gesture transcription system. Tech. rep., University of Bielefeld (2002)

  25. Jackendoff R.: Consciousness and the Computational Mind. MIT Press, Cambridge (1987)

    Google Scholar 

  26. Kendon A.: Gesture: Visible Action as Utterance. Cambridge University Press, London (2004)

    Google Scholar 

  27. Kipp, M.: Gesture generation by imitation—from human behavior to computer character animation. Boca Raton, Florida: Dissertation.com (2004)

  28. Kipp, M.: Spatiotemporal coding in anvil. In: Proceedings of the 6th Language Resources and Evaluation Conference (2008)

  29. Lin, C., Tseng, B., Smith, J.: Video collaborative annotation forum: Establishing ground-truth labels on large multimedia datasets. TRECVID Proceedings (2003)

  30. Lindley, C., Davis, J., Nack, F., Rutledge, L.: The application of rhetorical structure theory to interactive news program generation from digital archives. Technical Report INS-R0101, Centrum voor Wiskunde en Informatica (2001)

  31. Magno-Caldognetto, E., Poggio, I., Cosi, P., Cavicchio, F., Merola, G.: Multimedia score—an anvil-based annotation scheme for multimodal audio-video analysis. In: Proceedings of the LREC Workshop on Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation Of Multimodal Input And Output Interfaces, pp. 29–33 (2004)

  32. Mann W., Thompson S.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G.(eds) Natural Language Generation: New results in Artificial Intelligence, Psychology and Linguistics, pp. 85–95. Nijhoff, Dodrecht (1987)

    Google Scholar 

  33. Marsh E., Domas-White M.: A taxonomy of relationships between image and text. J. Document. 59(6), 647–672 (2003)

    Article  Google Scholar 

  34. Martin, J., Grimard, S., Alexandri, K.: On the annotation of multimodal behavior and computation of cooperation between modalities. In: Proceedings of the International Conference on Autonomous Agents workshop on Representing, Annotating, Evaluating Non-verbal and Verbal Communicative Acts to Achieve Contextual Embodied Agents, pp. 1–7 (2001)

  35. Martin, J., Julia, L., Cheyer, A.: A theoretical framework for multimodal user studies. In: Proceedings of the Second International Conference on Cooperative Multimodal Communication, pp. 104–110 (1998)

  36. Martin, J., Kipp, M.: Annotating and measuring multimodal behaviour—tycoon metrics in the anvil tool. In: Proceedings of the Language Resources and Evaluation Conference 2002, pp. 31–35 (2002)

  37. Martinec R., Salway A.: A system for image–text relations in new (and old) media. Vis. Commun. 4(3), 339–374 (2005)

    Google Scholar 

  38. Maybury, M. (ed.): Intelligent Multimedia Interfaces. AAAI Press/MIT Press, Cambridge, MA (1993)

    Google Scholar 

  39. Maybury, M.,Wahlster,W. (eds.): Intelligent User Interfaces. Morgan Kaufmann Publishers, San Francisco, CA (1998)

    Google Scholar 

  40. McNeil D.: Gesture and Thought. The University of Chicago Press, Chicago, IL (2005)

    Google Scholar 

  41. Minsky, M.: The Society of Mind. Simon and Schuster Inc., NY, USA (1986)

  42. Moore J., Paris C.: Planning text for advisory dialogues: capturing intentional and rhetorical information. Comput. Linguist. 19(4), 651–695 (1993)

    Google Scholar 

  43. Moore J., Pollack M.: Problem for RST: the need for multi-level discourse analysis. Comput. Linguist. 18(4), 537–544 (1992)

    Google Scholar 

  44. Nicholas, N.: Parameters for rhetorical structure theory ontology. In: University of Melbourne Working Papers in Linguistics, vol. 15, pp. 77–93. University of Melbourne, Melbourne (1995)

  45. Pastra, K.: The language of caricature: language and drawing interaction. Final year project, Department of Greek Philology and Linguistics, University of Athens (1999) (in Greek)

  46. Pastra, K.: Viewing vision–language integration as a double-grounding case. In: Proceedings of the AAAI Fall Symposium on Achieving Human-Level Intelligence through Integrated Systems and Research, pp. 62–67 (2004)

  47. Pastra, K.: Vision–language integration: a double-grounding case. Ph.D. thesis, University of Sheffield (2005)

  48. Pastra, K.: Beyond multimedia integration: corpora and annotations for cross-media decision mechanisms. In: Proceedings of the 5th Language Resources and Evaluation Conference, pp. 499–504 (2006)

  49. Pastra, K., Piperidis, S.: Video search: new challenges in the pervasive digital video era. J. Virtual Reality Broadcast. 3(11) (2006)

  50. Pastra K., Saggion H., Wilks Y.: Intelligent indexing of crime-scene photographs. IEEE Intell. Syst. 18(1), 55–61 (2003)

    Article  Google Scholar 

  51. Pastra, K., Wilks, Y.: Vision–language integration in AI: a reality check. In: Proceedings of the 16th European Conference in Artificial Intelligence, pp. 937–941 (2004)

  52. Radev, D.: A common theory of information fusion from multiple text sources. step one: cross document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, pp. 74–83 (2000)

  53. Rocchi, C., Zancanaro, M.: Generation of video documentaries from discourse structures. In: Proceedings of the 9th European Workshop on Natural Language Generation (EWNLG 9) (2003)

  54. Sanders T., Spooren W., Noordman L.: Toward a taxonomy of coherence relations. Discourse Process. 15, 1–35 (1992)

    Article  Google Scholar 

  55. Simou, N., Tzouvaras, V., Avrithis, Y., Stamou, G., Kollias, S.: A visual descriptor ontology for multimedia reasoning. In: Proceedings of the workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) (2005)

  56. Srikanth, M., Varner, J., Bowden, M., Moldovan, D.: Exploiting ontologies for authomatic image annotation. In: Proceedings of the ACM Special Interest Group in Information Retrieval (SIGIR), pp. 552–558 (2005)

  57. Taboada M., Mann W.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)

    Article  Google Scholar 

  58. Wachsmuth, S., Stevenson, S., Dickinson, S.: Towards a framework for learning structured shape models from text-annotated images. In: Proceedings of the HLT-NAACL Workshop on Learning Word Meaning from non-linguistic Data (2003)

  59. Whittaker, S., Walker, M.: Toward a theory of multi-modal interaction. In: Proceedings of the National Conference on Artificial Intelligence Workshop on Multi-modal Interaction (1991)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katerina Pastra.

Additional information

Communicated by B. Bailey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pastra, K. COSMOROE: a cross-media relations framework for modelling multimedia dialectics. Multimedia Systems 14, 299–323 (2008). https://doi.org/10.1007/s00530-008-0142-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-008-0142-0

Keywords

Navigation