Skip to main content
Log in

The Focus–Aspect–Value model for predicting subjective visual attributes

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus–Aspect–Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Our dataset can be downloaded at http://madm.dfki.de/downloads.

  2. Unless testing is done on nouns which are left out completely for training. But this would require additional knowledge about how this noun relates to the ones available for training.

References

  1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283

    Google Scholar 

  2. Bamman D, Dyer C, Smith NA (2014) Distributed representations of geographically situated language. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (vol 2: Short papers), pp 828–834. Association for Computational Linguistics https://doi.org/10.3115/v1/P14-2134. http://www.aclweb.org/anthology/P14-2134. Accessed Sept 2018

  3. Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10, pp 1183–1193. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773. Accessed Sept 2018

  4. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, vol 3

  5. Blandfort P, Karayil T, Raue F, Hees J, Dengel A (2019) Fusion strategies for learning user embeddings with neural networks. In: Proceedings of the 2019 international joint conference on neural networks

  6. Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on Multimedia, pp 459–460. ACM

  7. Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, pp 223–232. ACM

  8. Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, MM ’13, pp 223–232. ACM, New York, NY, USA. https://doi.org/10.1145/2502081.2502282

  9. Yuille AL, Bülthoff HH (1996) Bayesian decision theory and psychophysics. In: Krill DC, Richards W (eds) Perception as Bayesian inference. Cambridge University Press, Cambridge, UK, pp 123–162

    Chapter  Google Scholar 

  10. Carbon CC (2011) Cognitive mechanisms for explaining dynamics of aesthetic appreciation. i-Perception 2(7):708–719

    Article  Google Scholar 

  11. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEE

  12. Dhar S, Ordonez V, Berg TL (2011) High level describable attributes for predicting aesthetics and interestingness. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 1657–1664. IEEE

  13. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 1778–1785. IEEE

  14. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 1778–1785. https://doi.org/10.1109/CVPR.2009.5206772

  15. Guevara E (2010) A regression model of adjective-noun compositionality in distributional semantics. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, GEMS ’10, pp 33–37. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870516.1870521. Accessed Sept 2018

  16. Hamp B, Feldweg H (1997) Germanet—a lexical-semantic net for German. In: Proceedings of ACL workshop automatic information extraction and building of lexical semantic resources for NLP applications, pp 9–15

  17. Hartung M, Kaupmann F, Jebbara S, Cimiano P (2017) Learning compositionality functions on word embeddings for modelling attribute meaning in adjective-noun phrases. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 1, Long papers, pp 54–64. Association for Computational Linguistics. http://aclweb.org/anthology/E17-1006

  18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778 . https://doi.org/10.1109/CVPR.2016.90

  19. Hesselmann G, Kell CA, Kleinschmidt A (2008) Ongoing activity fluctuations in HMT + bias the perception of coherent visual motion. J Neurosci 28(53):14481–14485

    Article  Google Scholar 

  20. Jou B, Chang SF (2016). Deep cross residual learning for multitask visual recognition. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 998–1007

  21. Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF (2015) Visual affect around the world: a large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd ACM international conference on multimedia, pp 159–168. ACM

  22. Kalkowski S, Schulze C, Dengel A, Borth D (2015) Real-time analysis and visualization of the YFCC100M dataset. In: Proceedings of the 2015 workshop on community-organized multimodal mining: opportunities for novel solutions, pp 25–30. ACM

  23. Karayil T, Blandfort P, Hees J, Dengel A (2019) The focus-aspect-value model for explainable prediction of subjective visual interpretation. In: Proceedings of the 2019 international conference on multimedia retrieval, pp 16–24. ACM

  24. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  25. Lazaridou A, Dinu G, Liska A, Baroni M (2015) From visual attributes to adjectives through decompositional distributional semantics. TACL 3:183–196

    Google Scholar 

  26. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196

  27. Liu B, Zhang L (2012) A survey of opinion mining and sentiment analysis. In: Aggarwal C, Zhai C (eds) Mining text data, pp 415–463. Springer. https://doi.org/10.1007/978-1-4614-3223-4_13

  28. Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL workshop on effective tools and methodologies for teaching natural language processing and computational linguistics. Association for Computational Linguistics, Philadelphia

  29. Merriam-Webster (2018) Definition of subjective. https://www.merriam-webster.com/dictionary/subjective. Accessed Oct 2018

  30. Moorthy AK, Obrador P, Oliver N (2010) Towards computational models of the visual aesthetic appeal of consumer videos. In: European conference on computer vision, pp 1–14. Springer

  31. Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1–2):59–81

    Article  Google Scholar 

  32. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  33. Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL

  34. Smith ML, Gosselin F, Schyns PG (2012) Measuring internal representations from behavioral and brain data. Curr Biol 22(3):191–196

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Philipp Blandfort or Tushar Karayil.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest other than the ones stated in the acknowledgements.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Blandfort, P., Karayil, T., Hees, J. et al. The Focus–Aspect–Value model for predicting subjective visual attributes. Int J Multimed Info Retr 9, 47–60 (2020). https://doi.org/10.1007/s13735-019-00188-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-019-00188-5

Keywords

Navigation