CLIP knows image aesthetics

Hentschel, Simon; Kobs, Konstantin; Hotho, Andreas

doi:10.3389/frai.2022.976235

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

search hit 10 of 13

Back to Result List

CLIP knows image aesthetics

Please always quote using this URN: urn:nbn:de:bvb:20-opus-297150

Simon Hentschel, Konstantin Kobs, Andreas Hotho

Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a base to fine-tune. We hypothesize that content classification is not an optimal pretraining task for IAA, since the task discourages the extraction of features that are useful for IAA, e.g., composition, lighting, or style. On the other hand, we argue that the Contrastive Language-Image Pretraining (CLIP) model is a better base for IAA models, since it has been trained using natural language supervision. Due to the rich nature of language, CLIPMost Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a base to fine-tune. We hypothesize that content classification is not an optimal pretraining task for IAA, since the task discourages the extraction of features that are useful for IAA, e.g., composition, lighting, or style. On the other hand, we argue that the Contrastive Language-Image Pretraining (CLIP) model is a better base for IAA models, since it has been trained using natural language supervision. Due to the rich nature of language, CLIP needs to learn a broad range of image features that correlate with sentences describing the image content, composition, environments, and even subjective feelings about the image. While it has been shown that CLIP extracts features useful for content classification tasks, its suitability for tasks that require the extraction of style-based features like IAA has not yet been shown. We test our hypothesis by conducting a three-step study, investigating the usefulness of features extracted by CLIP compared to features obtained from the last layer of a comparable ImageNet classification model. In each step, we get more computationally expensive. First, we engineer natural language prompts that let CLIP assess an image's aesthetic without adjusting any weights in the model. To overcome the challenge that CLIP's prompting only is applicable to classification tasks, we propose a simple but effective strategy to convert multiple prompts to a continuous scalar as required when predicting an image's mean aesthetic score. Second, we train a linear regression on the AVA dataset using image features obtained by CLIP's image encoder. The resulting model outperforms a linear regression trained on features from an ImageNet classification model. It also shows competitive performance with fully fine-tuned networks based on ImageNet, while only training a single layer. Finally, by fine-tuning CLIP's image encoder on the AVA dataset, we show that CLIP only needs a fraction of training epochs to converge, while also performing better than a fine-tuned ImageNet model. Overall, our experiments suggest that CLIP is better suited as a base model for IAA methods than ImageNet pretrained networks.…

Metadaten
Author:	Simon Hentschel, Konstantin Kobs, Andreas Hotho
URN:	urn:nbn:de:bvb:20-opus-297150
Document Type:	Journal article
Faculties:	Fakultät für Mathematik und Informatik / Institut für Informatik
Language:	English
Parent Title (English):	Frontiers in Artificial Intelligence
ISSN:	2624-8212
Year of Completion:	2022
Volume:	5
Article Number:	976235
Source:	Frontiers in Artificial Intelligence (2022) 5:976235. doi: 10.3389/frai.2022.976235
DOI:	https://doi.org/10.3389/frai.2022.976235
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Tag:	AVA; CLIP; Image Aesthetic Assessment; language-image pre-training; prompt engineering; text supervision
Release Date:	2023/04/19
Date of first Publication:	2022/11/25
	Open-Access-Publikationsfonds / Förderzeitraum 2022
Licence (German):	CC BY: Creative-Commons-Lizenz: Namensnennung 4.0 International

CLIP knows image aesthetics

Download full text files

Export metadata

Additional Services