The Focus–Aspect–Value model for predicting subjective visual attributes

Blandfort, Philipp; Karayil, Tushar; Hees, Jörn; Dengel, Andreas

doi:10.1007/s13735-019-00188-5

The Focus–Aspect–Value model for predicting subjective visual attributes

Regular Paper
Published: 02 January 2020

Volume 9, pages 47–60, (2020)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Philipp Blandfort ORCID: orcid.org/0000-0002-1516-3780¹^na1,
Tushar Karayil¹^na1,
Jörn Hees² &
…
Andreas Dengel¹

290 Accesses
1 Citation
Explore all metrics

Abstract

Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus–Aspect–Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Affective image recognition with multi-attribute knowledge in deep neural networks

Article 17 July 2023

Hao Zhang, Gaifang Luo, … Dan Xu

Deep Relative Attributes

“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text

Notes

Our dataset can be downloaded at http://madm.dfki.de/downloads.
Unless testing is done on nouns which are left out completely for training. But this would require additional knowledge about how this noun relates to the ones available for training.

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283
Google Scholar
Bamman D, Dyer C, Smith NA (2014) Distributed representations of geographically situated language. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (vol 2: Short papers), pp 828–834. Association for Computational Linguistics https://doi.org/10.3115/v1/P14-2134. http://www.aclweb.org/anthology/P14-2134. Accessed Sept 2018
Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10, pp 1183–1193. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773. Accessed Sept 2018
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, vol 3
Blandfort P, Karayil T, Raue F, Hees J, Dengel A (2019) Fusion strategies for learning user embeddings with neural networks. In: Proceedings of the 2019 international joint conference on neural networks
Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on Multimedia, pp 459–460. ACM
Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, pp 223–232. ACM
Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, MM ’13, pp 223–232. ACM, New York, NY, USA. https://doi.org/10.1145/2502081.2502282
Yuille AL, Bülthoff HH (1996) Bayesian decision theory and psychophysics. In: Krill DC, Richards W (eds) Perception as Bayesian inference. Cambridge University Press, Cambridge, UK, pp 123–162
Chapter Google Scholar
Carbon CC (2011) Cognitive mechanisms for explaining dynamics of aesthetic appreciation. i-Perception 2(7):708–719
Article Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEE
Dhar S, Ordonez V, Berg TL (2011) High level describable attributes for predicting aesthetics and interestingness. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 1657–1664. IEEE
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 1778–1785. IEEE
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 1778–1785. https://doi.org/10.1109/CVPR.2009.5206772
Guevara E (2010) A regression model of adjective-noun compositionality in distributional semantics. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, GEMS ’10, pp 33–37. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870516.1870521. Accessed Sept 2018
Hamp B, Feldweg H (1997) Germanet—a lexical-semantic net for German. In: Proceedings of ACL workshop automatic information extraction and building of lexical semantic resources for NLP applications, pp 9–15
Hartung M, Kaupmann F, Jebbara S, Cimiano P (2017) Learning compositionality functions on word embeddings for modelling attribute meaning in adjective-noun phrases. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 1, Long papers, pp 54–64. Association for Computational Linguistics. http://aclweb.org/anthology/E17-1006
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778 . https://doi.org/10.1109/CVPR.2016.90
Hesselmann G, Kell CA, Kleinschmidt A (2008) Ongoing activity fluctuations in HMT + bias the perception of coherent visual motion. J Neurosci 28(53):14481–14485
Article Google Scholar
Jou B, Chang SF (2016). Deep cross residual learning for multitask visual recognition. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 998–1007
Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF (2015) Visual affect around the world: a large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd ACM international conference on multimedia, pp 159–168. ACM
Kalkowski S, Schulze C, Dengel A, Borth D (2015) Real-time analysis and visualization of the YFCC100M dataset. In: Proceedings of the 2015 workshop on community-organized multimodal mining: opportunities for novel solutions, pp 25–30. ACM
Karayil T, Blandfort P, Hees J, Dengel A (2019) The focus-aspect-value model for explainable prediction of subjective visual interpretation. In: Proceedings of the 2019 international conference on multimedia retrieval, pp 16–24. ACM
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Lazaridou A, Dinu G, Liska A, Baroni M (2015) From visual attributes to adjectives through decompositional distributional semantics. TACL 3:183–196
Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Liu B, Zhang L (2012) A survey of opinion mining and sentiment analysis. In: Aggarwal C, Zhai C (eds) Mining text data, pp 415–463. Springer. https://doi.org/10.1007/978-1-4614-3223-4_13
Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL workshop on effective tools and methodologies for teaching natural language processing and computational linguistics. Association for Computational Linguistics, Philadelphia
Merriam-Webster (2018) Definition of subjective. https://www.merriam-webster.com/dictionary/subjective. Accessed Oct 2018
Moorthy AK, Obrador P, Oliver N (2010) Towards computational models of the visual aesthetic appeal of consumer videos. In: European conference on computer vision, pp 1–14. Springer
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1–2):59–81
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL
Smith ML, Gosselin F, Schyns PG (2012) Measuring internal representations from behavioral and brain data. Curr Biol 22(3):191–196
Article Google Scholar

Download references

Acknowledgements

This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program.

Author information

Philipp Blandfort and Tushar Karayil have contributed equally to this paper.

Authors and Affiliations

DFKI & TU Kaiserslautern, Kaiserslautern, Germany
Philipp Blandfort, Tushar Karayil & Andreas Dengel
DFKI, Kaiserslautern, Germany
Jörn Hees

Authors

Philipp Blandfort
View author publications
You can also search for this author in PubMed Google Scholar
Tushar Karayil
View author publications
You can also search for this author in PubMed Google Scholar
Jörn Hees
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Dengel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Philipp Blandfort or Tushar Karayil.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest other than the ones stated in the acknowledgements.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Blandfort, P., Karayil, T., Hees, J. et al. The Focus–Aspect–Value model for predicting subjective visual attributes. Int J Multimed Info Retr 9, 47–60 (2020). https://doi.org/10.1007/s13735-019-00188-5

Download citation

Received: 17 September 2019
Revised: 24 October 2019
Accepted: 06 December 2019
Published: 02 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s13735-019-00188-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Focus–Aspect–Value model for predicting subjective visual attributes

Abstract

Access this article

Similar content being viewed by others

Affective image recognition with multi-attribute knowledge in deep neural networks

Deep Relative Attributes

“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Focus–Aspect–Value model for predicting subjective visual attributes

Abstract

Access this article

Similar content being viewed by others

Affective image recognition with multi-attribute knowledge in deep neural networks

Deep Relative Attributes

“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation