“Show Me the Cup”: Reference with Continuous Representations

Baroni, Marco; Boleda, Gemma; Padó, Sebastian

doi:10.1007/978-3-319-77113-7_17

Marco Baroni¹⁴,
Gemma Boleda¹⁴ &
Sebastian Padó¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

906 Accesses

Abstract

One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stars2: a corpus of object descriptions in a visual domain

Article 17 March 2016

A conceptual framework for the study of demonstrative reference

Article Open access 09 October 2020

A computational treatment of generalized reference

Article Open access 21 January 2017

Notes

1.
We ignore the thorny philosophical issues of reference, such as its relationship to reality. For an overview and references (no pun intended), see [3].
2.
For neural network design and training see, e.g., [6].
3.
We do not enter the determiner in the query, since it does not vary across data points: our setup is equivalent to always having “the” in the input. The network learns the intended semantics through training.
4.
http://imagenet.stanford.edu/.
5.
We use the MatConvNet toolkit, http://www.vlfeat.org/matconvnet/.

References

Russell, B.: On denoting. Mind 14, 479–493 (1905)
Article Google Scholar
Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)
Article Google Scholar
Reimer, M., Michaelson, E.: Reference. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2014 edn. (2014)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Article MathSciNet Google Scholar
Bos, J., Clark, S., Steedman, M., Curran, J.R., Hockenmaier, J.: Wide-coverage semantic representations from a CCG parser. In: Proceedings of the COLING, Geneva, Switzerland, pp. 1240–1246 (2004)
Google Scholar
Nielsen, M.: Neural Networks and Deep Learning. Determination Press, New York (2015). http://neuralnetworksanddeeplearning.com/
Frome, A., et al.: DeViSE: A deep visual-semantic embedding model. In: Proceedings of NIPS, Lake Tahoe, NV, pp. 2121–2129 (2013)
Google Scholar
Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of ACL, Beijing, China, pp. 270–280 (2015)
Google Scholar
Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: Proceedings of IJCAI, Barcelona, Spain, pp. 2764–2770 (2011)
Google Scholar
Lazaridou, A., Pham, N., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Proceedings of NAACL, Denver, CO, pp. 153–163 (2015)
Google Scholar
Baroni, M., Lenci, A.: Distributional memory: a general framework for corpus-based semantics. Comput. Linguist. 36, 673–721 (2010)
Article Google Scholar
Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46, 904–911 (2014)
Article Google Scholar
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of ACL, Baltimore, MD, pp. 238–247 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of ICML, Lille, France, pp. 2048–2057 (2015)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Proceedings of NIPS, Montreal, Canada, pp. 2692–2700 (2015)
Google Scholar
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks (2015). http://arxiv.org/abs/1503.08895
Weston, J., Chopra, S., Bordes, A.: Memory networks. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)
Article Google Scholar
Larsson, S.: Formal semantics for perceptual classification. J. Logic Comput. 25, 335–369 (2015)
Article MathSciNet Google Scholar
Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: Proceedings of AAAI, Quebec City, Canada, pp. 2556–2563 (2014)
Google Scholar
Steels, L., Belpaeme, T.: Coordinating perceptually grounded categories through language: a case study for colour. Behav. Brain Sci. 28, 469–529 (2005)
Google Scholar
Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL, Beijing, China, pp. 292–301 (2015)
Google Scholar
Krahmer, E., van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38 (2012)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of EMNLP, Doha, Qatar, pp. 787–798 (2014)
Google Scholar
Tily, H., Piantadosi, S.: Refer efficiently: use less informative expressions for more predictable meanings. In: Proceedings of the CogSci Workshop on the Production of Referring Expressions, Amsterdam, The Netherlands (2009)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of CVPR, Las Vegas, NV (2016) (in Press)
Google Scholar
Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Nat. Acad. Sci. 112, 3618–3623 (2015)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS, Montreal, Canada, pp. 1682–1690 (2014)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Proceedings of NIPS, Montreal, Canada (2015). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015
Baroni, M.: Grounding distributional semantics in the visual world. Lang. Linguist. Compass 10, 3–13 (2016)
Article Google Scholar
Abbott, B.: Reference. Oxford University Press, Oxford (2010)
Google Scholar
Datta, R., Joshi, D., Li, J., Wang, J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40, 1–60 (2008)
Article Google Scholar

Download references

Acknowledgments

We are grateful to Elia Bruni for the CNN baseline idea, and to Angeliki Lazaridou for providing us with the visual vectors used in the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe) and ERC grant agreement No 715154 (AMORE); ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES); DFG (SFB 732, Project D10); and Spanish MINECO (grant FFI2013-41301-P). This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains.

Author information

Authors and Affiliations

Center for Mind/Brain Sciences, University of Trento, Trento, Italy
Marco Baroni & Gemma Boleda
Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Stuttgart, Germany
Sebastian Padó

Authors

Marco Baroni
View author publications
You can also search for this author in PubMed Google Scholar
Gemma Boleda
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Padó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Baroni .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Appendices

A Data Creation for the Object-Only Dataset (Experiment 1)

The process to generate a object sequence is shown in Algorithm 1. We start with an empty sequence and sample the length of the sequence uniformly at random from the permitted sequence lengths (l. 2). We fill the sequence with objects and images sampled uniformly at random (l. 4/5). We assume, without loss of generality, that the the object that we will query for, q, is the first one (l. 6). Then we sample whether the current sequence should be an anomaly (l. 7). If it should be a missing-anomaly (i.e., no matches for the query), we overwrite the target object and image with a new random draw from the pool (l. 9/10). If we decide to turn it into a multiple-anomaly (i.e., with multiple matches for the query), we randomly select another position in the sequence and overwrite it with the query object and a new image (l. 12/13). Finally, we shuffle the sequence so that the query is assigned a random position (l. 14).

B Data Creation for the Object+Attribute Dataset (Experiment 2)

Figure 5 shows the intuition for sampling the Object+Attribute dataset. Arrows indicate compatibility constraints in sampling. We start from the query pair (object 1 – attribute 1). Then we sample two more attributes that are both compatible with object 1. Finally, we sample two more objects that are compatible both with the original attribute 1 and one of the two attributes.

Algorithm 2 defines the sampling procedure formally. We sample the first triple randomly (l. 2). Then we sample two two compatible attributes for this object (l. 3), and one more object for each attribute (l. 4). This yields a set of six confounders (l. 5–10). After sampling the length of the final sequence l (l. 11), we build the sequence from the first triple and \(l-1\) confounders (l. 12–13), with the first triple as query (l. 14). The treatment of the anomalies is exactly as before.

Table 2. Statistics on Object-Only and Object+Attribute datasets. O: object, A: attribute, I: image.

Full size table

C Statistics on the Datasets

Table 2 shows statistics on the dataset. The first line covers the Object-Only dataset. Objects occur on average 90 times in the train portion of Object-Only, specific images only twice; the numbers for the test set are commensurately lower. While all objects in the test set are seen during training, 23% of the images are not. Due to the creation by random sampling, a minimal number of sequences is repeated (5 sequences occur twice in the training set, 1 four times) and shared between training and validation set (1 sequence). All other sequences occur just once.

The second line covers the Object+Attribute dataset. The average frequencies for objects and object images mirror those in Object-Only quite closely. The new columns on object-attribute (O+A) and object-attribute-image (O+A+I) combinations show that object-attribute combinations occur relatively infrequently (each object is paired with many attributes) but that the combination is considerably restricted (almost no combinations are new in the test set). The full entity representations (object-attribute-image triples), however, are very infrequent (average frequency just above 1), and more than 80% of these are unseen in the test set. A single sequence occurs twice in the test set, all others once; one sequence is shared between train and test.

D Hyperparameter Tuning

We tuned the following hyperparameters on the Object-Only validation set and re-used them for Object+Attribute without further tuning (except for the Pipeline heuristics’ thresholds). Chosen values are given in parentheses.

PoP: multimodal embedding size (300), anomaly sensor size (100), nonlinearities \(\psi \) (relu) and \(\phi \) (sigmoid), learning rate (0.09), epoch count (14).
TRPoP: same settings, except epoch count (36).
Pipeline: multimodal embedding size (300), margin size (0.5), learning rate (0.09), maximum similarity threshold (0.1 for Object-Only, 0.4 for Object+Attribute), top-two similarity difference threshold (0.05 and 0.07).

Momentum was set to 0.09, learning rate decay to 1E-4 for all models, based on informal preliminary experimentation.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baroni, M., Boleda, G., Padó, S. (2018). “Show Me the Cup”: Reference with Continuous Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-77113-7_17
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77112-0
Online ISBN: 978-3-319-77113-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics