CLIP knows image aesthetics

Hentschel, Simon; Kobs, Konstantin; Hotho, Andreas

doi:10.3389/frai.2022.976235

Treffer 1 von 1

Zurück zur Trefferliste

CLIP knows image aesthetics

Zitieren Sie bitte immer diese URN: urn:nbn:de:bvb:20-opus-297150

Simon Hentschel, Konstantin Kobs, Andreas Hotho

Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a base to fine-tune. We hypothesize that content classification is not an optimal pretraining task for IAA, since the task discourages the extraction of features that are useful for IAA, e.g., composition, lighting, or style. On the other hand, we argue that the Contrastive Language-Image Pretraining (CLIP) model is a better base for IAA models, since it has been trained using natural language supervision. Due to the rich nature of language, CLIPMost Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a base to fine-tune. We hypothesize that content classification is not an optimal pretraining task for IAA, since the task discourages the extraction of features that are useful for IAA, e.g., composition, lighting, or style. On the other hand, we argue that the Contrastive Language-Image Pretraining (CLIP) model is a better base for IAA models, since it has been trained using natural language supervision. Due to the rich nature of language, CLIP needs to learn a broad range of image features that correlate with sentences describing the image content, composition, environments, and even subjective feelings about the image. While it has been shown that CLIP extracts features useful for content classification tasks, its suitability for tasks that require the extraction of style-based features like IAA has not yet been shown. We test our hypothesis by conducting a three-step study, investigating the usefulness of features extracted by CLIP compared to features obtained from the last layer of a comparable ImageNet classification model. In each step, we get more computationally expensive. First, we engineer natural language prompts that let CLIP assess an image's aesthetic without adjusting any weights in the model. To overcome the challenge that CLIP's prompting only is applicable to classification tasks, we propose a simple but effective strategy to convert multiple prompts to a continuous scalar as required when predicting an image's mean aesthetic score. Second, we train a linear regression on the AVA dataset using image features obtained by CLIP's image encoder. The resulting model outperforms a linear regression trained on features from an ImageNet classification model. It also shows competitive performance with fully fine-tuned networks based on ImageNet, while only training a single layer. Finally, by fine-tuning CLIP's image encoder on the AVA dataset, we show that CLIP only needs a fraction of training epochs to converge, while also performing better than a fine-tuned ImageNet model. Overall, our experiments suggest that CLIP is better suited as a base model for IAA methods than ImageNet pretrained networks.…

Metadaten
Autor(en):	Simon Hentschel, Konstantin Kobs, Andreas Hotho
URN:	urn:nbn:de:bvb:20-opus-297150
Dokumentart:	Artikel / Aufsatz in einer Zeitschrift
Institute der Universität:	Fakultät für Mathematik und Informatik / Institut für Informatik
Sprache der Veröffentlichung:	Englisch
Titel des übergeordneten Werkes / der Zeitschrift (Englisch):	Frontiers in Artificial Intelligence
ISSN:	2624-8212
Erscheinungsjahr:	2022
Band / Jahrgang:	5
Aufsatznummer:	976235
Originalveröffentlichung / Quelle:	Frontiers in Artificial Intelligence (2022) 5:976235. doi: 10.3389/frai.2022.976235
DOI:	https://doi.org/10.3389/frai.2022.976235
Allgemeine fachliche Zuordnung (DDC-Klassifikation):	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Freie Schlagwort(e):	AVA; CLIP; Image Aesthetic Assessment; language-image pre-training; prompt engineering; text supervision
Datum der Freischaltung:	19.04.2023
Datum der Erstveröffentlichung:	25.11.2022
	Open-Access-Publikationsfonds / Förderzeitraum 2022
Lizenz (Deutsch):	CC BY: Creative-Commons-Lizenz: Namensnennung 4.0 International

CLIP knows image aesthetics

Volltext Dateien herunterladen

Metadaten exportieren

Weitere Dienste