Hierarchical aesthetic quality assessment using deep convolutional neural networks

https://doi.org/10.1016/j.image.2016.05.004Get rights and content

Highlights

  • Propose an aesthetic quality assessment framework by dividing images into 3 categories.

  • Three specific CNNs and an A&C CNN are constructed for aesthetic features learning.

  • Different models for each CNN are developed to predict aesthetic class and score.

Abstract

Aesthetic image analysis has attracted much attention in recent years. However, assessing the aesthetic quality and assigning an aesthetic score are challenging problems. In this paper, we propose a novel framework for assessing the aesthetic quality of images. Firstly, we divide the images into three categories: “scene”, “object” and “texture”. Each category has an associated convolutional neural network (CNN) which learns the aesthetic features for the category in question. The object CNN is trained using the whole images and a salient region in each image. The texture CNN is trained using small regions in the original images. Furthermore, an A&C CNN is developed to simultaneously assess the aesthetic quality and identify the category for overall images. For each CNN, classification and regression models are developed separately to predict aesthetic class (high or low) and to assign an aesthetic score. Experimental results on a recently published large-scale dataset show that the proposed method can outperform the state-of-the-art methods for each category.

Introduction

Aesthetic image analysis has attracted increasing attention recently in the computer vision community [1], [2], [3]. Automated models for assessing aesthetic image quality are useful in many applications, e.g., image retrieval, photo management, photo enhancement, and photography [4], [5]. It is also interesting to investigate the high-level perception of visual aesthetics. In the last decade, some studies have shown that data-driven approaches [2], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] can be used to assess the aesthetic quality of images, although such assessments are difficult, even for humans. In early works, many handcrafted features were proposed based on intuitions about how people perceive the aesthetic quality of images. These features include color [5], [6], the rule of thirds [6], content [7], [2], and composition [8]. Later, generic image descriptors such as Bag-Of-Visual-words (BOV) and Fisher Vectors (FV) were used to assess aesthetic quality. In [9] it is shown that the generic image descriptors can outperform the traditional handcrafted features. More recently, deep convolutional neural networks (CNNs) have been successfully applied to aesthetic quality assessment [16], [17]. CNNs can extract powerful features thus we use them in this paper to learn features for aesthetic quality assessment.

Most existing methods for assessing the aesthetic quality of images [6], [5], [7], [8], [9], [12], [16], [17], [18] treat all images equally without taking into account the diversity in image content or type. However, Oliva et al. [19] discriminate “scene” from “object” and “texture”. They design a GIST descriptor for scene recognition. Considering that scene recognition, object recognition, and texture recognition are studied separately, the three categories should be treated differently for aesthetic quality assessment. In this paper, we classify all images into three categories, namely “scene”, “object” and “texture”. Fig. 1 shows example images for the three categories, which suggest the different spatial layouts and fixated points in them. “Scene” images are composed of numerous objects, textures and colored regions, which are arranged in a variety of spatial layouts [20], [19]. All the elements in the scene may influence the humans’ aesthetic judgments in ways which have been studied by psychologists [21], [22]. Object images generally contain a large salient object, which attracts the attention of a human viewer and may be a key factor for the assessment of visual aesthetics [7], [23]. Texture images have some statistical properties, and may contain repeating structures [24], [25], [26]. Humans may have different criteria for assessing the aesthetics of images in the three categories.

The adoption of different photographic styles for the three categories emphasizes their differences. For example, professional photographers often reduce the depth of field (DOF) to shoot single objects to create close-up photographs for the category “object”, in which the foreground is clear and the background is blurred [1]. However, in photography for images in the category “scene”, landscapes shot with a narrow DOF are not considered pleasing; instead, photographers prefer to have the foreground, middle ground, and background all in focus [1]. It is likely that the three categories may have different aesthetic criteria for human perception. Therefore in this paper different convolutional networks are proposed to learn the features required to make aesthetic judgements about images in the “scene”, “object” and “texture” categories.

Aesthetic quality assessment can be formulated as a classification problem or a regression problem. It is known that aesthetic quality is a subjective attribute of images and there is a lack of precise definition. In most previous work on the aesthetic quality of images, the image datasets are obtained by online photo-sharing communities and rated by members of the community. The average score of user ratings is usually taken as a measure of the aesthetic quality of an image and it is also used to label the image. Typically, aesthetic quality assessment is reduced to a classification problem, by thresholding the average score to create a high quality class and a low quality class [4], [5], [6], [7], [8], [9], [27]. The images between the two classes are discarded. Only a few related works [6], [28], [17] use regression problem to calculate an aesthetic score. Visual aesthetic quality assessment should be formulated as a regression problem and the results compared with the ratings made by the human visual system [27]. In this work, a classification model and a regression model are both developed for each of the three categories “scene”, “object” and “texture”.

Based on the considerations mentioned above and on our previous work [17], we propose a novel framework for visual aesthetic quality assessment. Firstly, each image is assigned to one of the three categories “scene”, “object” and “texture”. Then, for each category, a specific convolutional neural network is constructed to learn aesthetic features automatically and to assess the aesthetic quality of an image. The aesthetic quality is described using a class (high or low) and a numerical score. In addition, a single CNN is also developed for the aesthetic quality assessment and the category recognition simultaneously for overall images. The CNN is simple and can also simultaneously consider the aesthetic labels and the different categories of images in contrast with the three specific CNNs. Experimental results on the recently published large-scale AVA dataset [29] demonstrate the effectiveness of our framework. Both our classification and regression methods outperform the state-of-the-art methods for each category and our regression methods can achieve comparable results to our classifications.

The main contributions of our proposed method are summarized as follows.

  • Inspired by the difference ways in which humans make aesthetic judgements and by the adoption of particular photographic techniques depending on the nature of the images, we propose a novel framework for visual aesthetic quality assessment by dividing images into three categories: “scene”, “object” and “texture”.

  • Three specific CNNs, namely Scene CNN, Object CNN and Texture CNN, are constructed. The CNNs learn aesthetic features automatically. Moreover, a single CNN, namely A&C CNN, is also developed to learn effective features simultaneously for two targets: the aesthetic quality assessment and the category recognition.

  • Each CNN classifies an image from the appropriate class according to its aesthetic level (high or low) and also uses regression to assign to the image a numerical score of its aesthetic quality.

The rest of this paper is organized as follows. In Section 2, the related works are summarized. The methods for aesthetic quality assessment are described in detail in Section 3. Section 4 describes the experimental setup and results. Finally we conclude the paper in Section 5.

Section snippets

Related work

Most previous works [6], [5], [8], [9], [18] on aesthetic image analysis focus on the challenging problem of designing appropriate features. Typically, handcrafted features are proposed based on intuitions about human perception of the aesthetic quality of images. For example, Datta et al. [6] design certain visual features such as colorfulness, the rule of thirds, and low depth of field indicators, to discriminate between aesthetically pleasing and displeasing images. Dhar et al. [8] extract

Our approach

The proposed framework is illustrated in Fig. 2. Firstly, we divide all images into three categories: “scene”, “object” and “texture”. Then, for each category, a specific convolutional neural network is trained to learn aesthetic features automatically, shown in Fig. 2(a). Furthermore, the framework can also be proposed with a single CNN (shown in Fig. 2(b)). The CNN can simultaneously assess the aesthetic quality and identify the category for overall images. The classification and regression

Experimental results

In this section, we evaluate the proposed framework and other state-of-the-art methods on the recently published AVA dataset [29]. The experimental results show that our networks for each category can automatically assess the aesthetic quality (class and score) of images and can outperform existing state-of-the-art methods.

Conclusion

In this paper, we propose a novel framework for visual aesthetic quality assessment by dividing the images into three categories: “scene”, “object” and “texture”. For the three categories, considering their difference on the composition, spatial layout, fixation point and photographic styles etc., three specific CNNs (Scene CNN, Object CNN and Texture CNN) and a simple A&C CNN are designed to learn aesthetic features automatically. In detail, the Scene CNN has an input the global view; the

Acknowledgment

This work is funded by the National Basic Research Program of China (Grant no. 2012CB316302), National Natural Science Foundation of China (Grant no. 61322209), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant no. XDB02050000).

References (49)

  • S. Dhar, V. Ordonez, T.L. Berg, High level describable attributes for predicting aesthetics and interestingness, in:...
  • L. Marchesotti, F. Perronnin, D. Larlus, G. Csurka, Assessing the aesthetic quality of photographs using generic image...
  • Y. Niu et al.

    What makes a professional video? A computational aesthetics approach

    IEEE Trans. Circuits Syst. Video Technol.

    (2012)
  • H.-H. Yeh et al.

    Video aesthetic quality assessment by temporal integration of photo-and motion-based features

    IEEE Trans. Multim.

    (2013)
  • Y. Wang, Q. Dai, R. Feng, Y.-G. Jiang, Beauty is here: evaluating aesthetics in videos using multimodal features and...
  • L. Zhang et al.

    Fusion of multichannel local and global structural cues for photo aesthetics evaluation

    IEEE Trans. Image Process.

    (2014)
  • L. Zhang, Y. Gao, C. Zhang, H. Zhang, Q. Tian, R. Zimmermann, Perception-guided multimodal feature fusion for photo...
  • X. Lu, Z. Lin, H. Jin, J. Yang, J.Z. Wang, Rapid: rating pictorial aesthetics using deep learning, in: Proceedings of...
  • Y. Kao, C. Wang, K. Huang, Visual aesthetic quality assessment with a regression model, in: IEEE International...
  • W. Luo, X. Wang, X. Tang, Content-based photo quality assessment, in: IEEE International Conference on Computer Vision,...
  • A. Oliva et al.

    Modeling the shape of the scenea holistic representation of the spatial envelope

    Int. J. Comput. Vis.

    (2001)
  • A. Oliva, M.L. Mack, M. Shrestha, A. Peeper, Identifying the perceptual dimensions of visual complexity of scenes, in:...
  • O. Axelsson

    Towards a psychology of photographydimensions underlying aesthetic appeal of photographs

    Percept. Motor Skills

    (2007)
  • C.E. Nothelfer et al.

    The role of spatial composition in preference for color pairs

    J. Vis.

    (2009)
  • View full text