Skip to main content

“Show Me the Cup”: Reference with Continuous Representations

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

  • 906 Accesses

Abstract

One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We ignore the thorny philosophical issues of reference, such as its relationship to reality. For an overview and references (no pun intended), see [3].

  2. 2.

    For neural network design and training see, e.g., [6].

  3. 3.

    We do not enter the determiner in the query, since it does not vary across data points: our setup is equivalent to always having “the” in the input. The network learns the intended semantics through training.

  4. 4.

    http://imagenet.stanford.edu/.

  5. 5.

    We use the MatConvNet toolkit, http://www.vlfeat.org/matconvnet/.

References

  1. Russell, B.: On denoting. Mind 14, 479–493 (1905)

    Article  Google Scholar 

  2. Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)

    Article  Google Scholar 

  3. Reimer, M., Michaelson, E.: Reference. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2014 edn. (2014)

    Google Scholar 

  4. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)

    Article  MathSciNet  Google Scholar 

  5. Bos, J., Clark, S., Steedman, M., Curran, J.R., Hockenmaier, J.: Wide-coverage semantic representations from a CCG parser. In: Proceedings of the COLING, Geneva, Switzerland, pp. 1240–1246 (2004)

    Google Scholar 

  6. Nielsen, M.: Neural Networks and Deep Learning. Determination Press, New York (2015). http://neuralnetworksanddeeplearning.com/

  7. Frome, A., et al.: DeViSE: A deep visual-semantic embedding model. In: Proceedings of NIPS, Lake Tahoe, NV, pp. 2121–2129 (2013)

    Google Scholar 

  8. Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of ACL, Beijing, China, pp. 270–280 (2015)

    Google Scholar 

  9. Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: Proceedings of IJCAI, Barcelona, Spain, pp. 2764–2770 (2011)

    Google Scholar 

  10. Lazaridou, A., Pham, N., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Proceedings of NAACL, Denver, CO, pp. 153–163 (2015)

    Google Scholar 

  11. Baroni, M., Lenci, A.: Distributional memory: a general framework for corpus-based semantics. Comput. Linguist. 36, 673–721 (2010)

    Article  Google Scholar 

  12. Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46, 904–911 (2014)

    Article  Google Scholar 

  13. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of ACL, Baltimore, MD, pp. 238–247 (2014)

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main

  15. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main

  16. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of ICML, Lille, France, pp. 2048–2057 (2015)

    Google Scholar 

  17. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Proceedings of NIPS, Montreal, Canada, pp. 2692–2700 (2015)

    Google Scholar 

  18. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks (2015). http://arxiv.org/abs/1503.08895

  19. Weston, J., Chopra, S., Bordes, A.: Memory networks. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main

  20. Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)

    Article  Google Scholar 

  21. Larsson, S.: Formal semantics for perceptual classification. J. Logic Comput. 25, 335–369 (2015)

    Article  MathSciNet  Google Scholar 

  22. Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: Proceedings of AAAI, Quebec City, Canada, pp. 2556–2563 (2014)

    Google Scholar 

  23. Steels, L., Belpaeme, T.: Coordinating perceptually grounded categories through language: a case study for colour. Behav. Brain Sci. 28, 469–529 (2005)

    Google Scholar 

  24. Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL, Beijing, China, pp. 292–301 (2015)

    Google Scholar 

  25. Krahmer, E., van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38 (2012)

    Google Scholar 

  26. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of EMNLP, Doha, Qatar, pp. 787–798 (2014)

    Google Scholar 

  27. Tily, H., Piantadosi, S.: Refer efficiently: use less informative expressions for more predictable meanings. In: Proceedings of the CogSci Workshop on the Production of Referring Expressions, Amsterdam, The Netherlands (2009)

    Google Scholar 

  28. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of CVPR, Las Vegas, NV (2016) (in Press)

    Google Scholar 

  29. Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Nat. Acad. Sci. 112, 3618–3623 (2015)

    Google Scholar 

  30. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS, Montreal, Canada, pp. 1682–1690 (2014)

    Google Scholar 

  31. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Proceedings of NIPS, Montreal, Canada (2015). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015

  32. Baroni, M.: Grounding distributional semantics in the visual world. Lang. Linguist. Compass 10, 3–13 (2016)

    Article  Google Scholar 

  33. Abbott, B.: Reference. Oxford University Press, Oxford (2010)

    Google Scholar 

  34. Datta, R., Joshi, D., Li, J., Wang, J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40, 1–60 (2008)

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to Elia Bruni for the CNN baseline idea, and to Angeliki Lazaridou for providing us with the visual vectors used in the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe) and ERC grant agreement No 715154 (AMORE); ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES); DFG (SFB 732, Project D10); and Spanish MINECO (grant FFI2013-41301-P). This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Baroni .

Editor information

Editors and Affiliations

Appendices

A Data Creation for the Object-Only Dataset (Experiment 1)

The process to generate a object sequence is shown in Algorithm 1. We start with an empty sequence and sample the length of the sequence uniformly at random from the permitted sequence lengths (l. 2). We fill the sequence with objects and images sampled uniformly at random (l. 4/5). We assume, without loss of generality, that the the object that we will query for, q, is the first one (l. 6). Then we sample whether the current sequence should be an anomaly (l. 7). If it should be a missing-anomaly (i.e., no matches for the query), we overwrite the target object and image with a new random draw from the pool (l. 9/10). If we decide to turn it into a multiple-anomaly (i.e., with multiple matches for the query), we randomly select another position in the sequence and overwrite it with the query object and a new image (l. 12/13). Finally, we shuffle the sequence so that the query is assigned a random position (l. 14).

figure c

B Data Creation for the Object+Attribute Dataset (Experiment 2)

Figure 5 shows the intuition for sampling the Object+Attribute dataset. Arrows indicate compatibility constraints in sampling. We start from the query pair (object 1 – attribute 1). Then we sample two more attributes that are both compatible with object 1. Finally, we sample two more objects that are compatible both with the original attribute 1 and one of the two attributes.

Algorithm 2 defines the sampling procedure formally. We sample the first triple randomly (l. 2). Then we sample two two compatible attributes for this object (l. 3), and one more object for each attribute (l. 4). This yields a set of six confounders (l. 5–10). After sampling the length of the final sequence l (l. 11), we build the sequence from the first triple and \(l-1\) confounders (l. 12–13), with the first triple as query (l. 14). The treatment of the anomalies is exactly as before.

Fig. 5.
figure 5

Sampling intuition for Object+Attribute

Table 2. Statistics on Object-Only and Object+Attribute datasets. O: object, A: attribute, I: image.
figure d

C Statistics on the Datasets

Table 2 shows statistics on the dataset. The first line covers the Object-Only dataset. Objects occur on average 90 times in the train portion of Object-Only, specific images only twice; the numbers for the test set are commensurately lower. While all objects in the test set are seen during training, 23% of the images are not. Due to the creation by random sampling, a minimal number of sequences is repeated (5 sequences occur twice in the training set, 1 four times) and shared between training and validation set (1 sequence). All other sequences occur just once.

The second line covers the Object+Attribute dataset. The average frequencies for objects and object images mirror those in Object-Only quite closely. The new columns on object-attribute (O+A) and object-attribute-image (O+A+I) combinations show that object-attribute combinations occur relatively infrequently (each object is paired with many attributes) but that the combination is considerably restricted (almost no combinations are new in the test set). The full entity representations (object-attribute-image triples), however, are very infrequent (average frequency just above 1), and more than 80% of these are unseen in the test set. A single sequence occurs twice in the test set, all others once; one sequence is shared between train and test.

D Hyperparameter Tuning

We tuned the following hyperparameters on the Object-Only validation set and re-used them for Object+Attribute without further tuning (except for the Pipeline heuristics’ thresholds). Chosen values are given in parentheses.

  • PoP: multimodal embedding size (300), anomaly sensor size (100), nonlinearities \(\psi \) (relu) and \(\phi \) (sigmoid), learning rate (0.09), epoch count (14).

  • TRPoP: same settings, except epoch count (36).

  • Pipeline: multimodal embedding size (300), margin size (0.5), learning rate (0.09), maximum similarity threshold (0.1 for Object-Only, 0.4 for Object+Attribute), top-two similarity difference threshold (0.05 and 0.07).

Momentum was set to 0.09, learning rate decay to 1E-4 for all models, based on informal preliminary experimentation.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baroni, M., Boleda, G., Padó, S. (2018). “Show Me the Cup”: Reference with Continuous Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77113-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77112-0

  • Online ISBN: 978-3-319-77113-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics