Abstract
Predicting subjective visual interpretation is important for several prominent tasks in computer vision, including multimedia retrieval. Many approaches reduce this problem to the prediction of adjective or attribute labels from images while neglecting attribute semantics and only processing the image in a holistic manner. Furthermore, there is a lack of relevant datasets with fine-grained subjective labels and sufficient scale for machine learning. In this paper, we explain the Focus–Aspect–Value (FAV) model to break down the process of subjective image interpretation into three steps and describe a dataset following this way of modeling. We train and evaluate several deep learning methods on this dataset, while we extend the experiments of the paper originally introducing FAV by adding a new evaluation metric, improving the concatenation approach and adding Multiplicative Fusion as another method. In our experiments, Tensor Fusion is among the best performing methods across all measures and outperforms the default way of information fusion (concatenation). In addition, we find that the way of combining information in neural networks not only affects prediction performance but can drastically change other properties of the model as well.
Similar content being viewed by others
Notes
Our dataset can be downloaded at http://madm.dfki.de/downloads.
Unless testing is done on nouns which are left out completely for training. But this would require additional knowledge about how this noun relates to the ones available for training.
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. OSDI 16:265–283
Bamman D, Dyer C, Smith NA (2014) Distributed representations of geographically situated language. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (vol 2: Short papers), pp 828–834. Association for Computational Linguistics https://doi.org/10.3115/v1/P14-2134. http://www.aclweb.org/anthology/P14-2134. Accessed Sept 2018
Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10, pp 1183–1193. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773. Accessed Sept 2018
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, vol 3
Blandfort P, Karayil T, Raue F, Hees J, Dengel A (2019) Fusion strategies for learning user embeddings with neural networks. In: Proceedings of the 2019 international joint conference on neural networks
Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on Multimedia, pp 459–460. ACM
Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, pp 223–232. ACM
Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, MM ’13, pp 223–232. ACM, New York, NY, USA. https://doi.org/10.1145/2502081.2502282
Yuille AL, Bülthoff HH (1996) Bayesian decision theory and psychophysics. In: Krill DC, Richards W (eds) Perception as Bayesian inference. Cambridge University Press, Cambridge, UK, pp 123–162
Carbon CC (2011) Cognitive mechanisms for explaining dynamics of aesthetic appreciation. i-Perception 2(7):708–719
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 248–255. IEEE
Dhar S, Ordonez V, Berg TL (2011) High level describable attributes for predicting aesthetics and interestingness. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 1657–1664. IEEE
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 1778–1785. IEEE
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 1778–1785. https://doi.org/10.1109/CVPR.2009.5206772
Guevara E (2010) A regression model of adjective-noun compositionality in distributional semantics. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, GEMS ’10, pp 33–37. Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870516.1870521. Accessed Sept 2018
Hamp B, Feldweg H (1997) Germanet—a lexical-semantic net for German. In: Proceedings of ACL workshop automatic information extraction and building of lexical semantic resources for NLP applications, pp 9–15
Hartung M, Kaupmann F, Jebbara S, Cimiano P (2017) Learning compositionality functions on word embeddings for modelling attribute meaning in adjective-noun phrases. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 1, Long papers, pp 54–64. Association for Computational Linguistics. http://aclweb.org/anthology/E17-1006
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778 . https://doi.org/10.1109/CVPR.2016.90
Hesselmann G, Kell CA, Kleinschmidt A (2008) Ongoing activity fluctuations in HMT + bias the perception of coherent visual motion. J Neurosci 28(53):14481–14485
Jou B, Chang SF (2016). Deep cross residual learning for multitask visual recognition. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 998–1007
Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF (2015) Visual affect around the world: a large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd ACM international conference on multimedia, pp 159–168. ACM
Kalkowski S, Schulze C, Dengel A, Borth D (2015) Real-time analysis and visualization of the YFCC100M dataset. In: Proceedings of the 2015 workshop on community-organized multimodal mining: opportunities for novel solutions, pp 25–30. ACM
Karayil T, Blandfort P, Hees J, Dengel A (2019) The focus-aspect-value model for explainable prediction of subjective visual interpretation. In: Proceedings of the 2019 international conference on multimedia retrieval, pp 16–24. ACM
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Lazaridou A, Dinu G, Liska A, Baroni M (2015) From visual attributes to adjectives through decompositional distributional semantics. TACL 3:183–196
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Liu B, Zhang L (2012) A survey of opinion mining and sentiment analysis. In: Aggarwal C, Zhai C (eds) Mining text data, pp 415–463. Springer. https://doi.org/10.1007/978-1-4614-3223-4_13
Loper E, Bird S (2002) NLTK: The natural language toolkit. In: Proceedings of the ACL workshop on effective tools and methodologies for teaching natural language processing and computational linguistics. Association for Computational Linguistics, Philadelphia
Merriam-Webster (2018) Definition of subjective. https://www.merriam-webster.com/dictionary/subjective. Accessed Oct 2018
Moorthy AK, Obrador P, Oliver N (2010) Towards computational models of the visual aesthetic appeal of consumer videos. In: European conference on computer vision, pp 1–14. Springer
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1–2):59–81
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL
Smith ML, Gosselin F, Schyns PG (2012) Measuring internal representations from behavioral and brain data. Curr Biol 22(3):191–196
Acknowledgements
This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest other than the ones stated in the acknowledgements.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Blandfort, P., Karayil, T., Hees, J. et al. The Focus–Aspect–Value model for predicting subjective visual attributes. Int J Multimed Info Retr 9, 47–60 (2020). https://doi.org/10.1007/s13735-019-00188-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-019-00188-5