Abstract
We present a probabilistic model for the retrieval of multimodal
documents. The model is based on Bayesian decision theory and
combines models for text-based search with models for visual
search. The textual model is based on the language modelling
approach to text retrieval, and the visual information is modelled
as a mixture of Gaussian densities. Both models have proved
successful on various standard retrieval tasks. We evaluate the
multimodal model on the search task of TREC′s video track.
We found that the disclosure of video material based on visual
information only is still too difficult. Even with purely visual
information needs, text-based retrieval still outperforms visual
approaches. The probabilistic model is useful for text, visual,
and multimedia retrieval. Unfortunately, simplifying assumptions
that reduce its computational complexity degrade retrieval
effectiveness. Regarding the question whether the model can
effectively combine information from different modalities, we
conclude that whenever both modalities yield reasonable scores, a
combined run outperforms the individual runs.