arxiv: 2604.21786 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

Katharina Prasse , Steffen Jung , Isaac Bravo , Stefanie Walter , Patrick Knab , Christian Bartelt , Margret Keuper

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords climate changesocial mediavision-language modelsdiscourse analysisautomated annotationpopulation trendszero-shot evaluationimage classification

0 comments

The pith

Vision-language models can recover population-level trends in climate change images on social media even when individual image accuracy stays moderate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether promptable vision-language models can automate the labeling of millions of social media images to study how people visually discuss climate change. Researchers created an expert-annotated set of 1,038 images and a larger collection of over 1.2 million images, then measured performance across five dimensions including animal content, climate consequences, and types of action shown. They found that the best model recovers the overall distribution of labels across the full corpus quite well, even though it makes mistakes on individual pictures. This distributional reliability matters because it turns automated tools into a practical starting point for tracking which visual strategies appear most often and which might influence public concern.

Core claim

Promptable vision-language models, especially Gemini-3.1-flash-lite, achieve the highest scores across all annotation dimensions on both the expert-labeled 1,038-image set and the 1.2-million-image corpus. While per-image accuracy remains only moderate, the models' aggregate predictions closely match manually validated population trends, demonstrating that VLMs can support scalable discourse analysis. Chain-of-thought prompting lowers performance, whereas dimension-specific prompt design raises it. The authors therefore advocate shifting evaluation from strict instance accuracy to distributional agreement for this use case.

What carries the argument

Distributional evaluation of VLM outputs on five annotation dimensions (animal content, climate change consequences, climate action, image setting, and image type) across an expert-annotated 1,038-image dataset and a 1.2-million-image corpus with partial manual validation.

If this is right

VLMs become a practical first step for analyzing discourse patterns across millions of images instead of relying solely on manual coding.
Gemini-3.1-flash-lite currently leads the tested models, yet the performance gap to open-weight alternatives stays modest.
Dimension-specific prompt engineering improves results more than general chain-of-thought reasoning.
Population-level trend recovery holds even when per-image correctness is imperfect, enabling studies that were previously too labor-intensive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distributional approach could be applied to visual discourse on other contested topics such as elections or public health.
Future work could test whether model biases in one dimension (for example, over- or under-detecting certain animals) systematically distort downstream conclusions about mobilization.
Releasing tweet IDs and code allows independent teams to extend the benchmark to new models or additional annotation dimensions without starting from scratch.

Load-bearing premise

The 1,038 expert-annotated images and the five chosen annotation dimensions together represent the full variety of visual climate discourse appearing on social media.

What would settle it

Draw a fresh random sample of several thousand images from the 1.2-million corpus, obtain fresh expert labels on the same five dimensions, and check whether the VLM-derived label distributions differ from the new expert distributions by more than a few percentage points.

Figures

Figures reproduced from arXiv: 2604.21786 by Christian Bartelt, Isaac Bravo, Katharina Prasse, Margret Keuper, Patrick Knab, Stefanie Walter, Steffen Jung.

**Figure 1.** Figure 1: Confusion matrices for super-category animals for both datasets show more confusions in the automatically annotated dataset. 4.3 RQ3: How relevant is the choice of CV model? Qwen_Qwen3-VL-8B-Instruct Qwen_Qwen3-VL-30B-A3B-Instruct moondream_moondream3-preview google_gemma-3-4b-it gpt-5.4-mini gemini-3.1-flash-lite-preview 0.0 0.2 0.4 0.6 0.8 1.0 Macro Accuracy -0.08 -0.22 -0.36 -0.11 -0.08 -0.00 -0.12 -0.… view at source ↗

**Figure 2.** Figure 2: VLM Benchmarking for consequences on ClimateTV and ClimateCT. We benchmark six promptable VLMs and 15 zero-shot VLMs, such as CLIP, whereof Gemini3.1-flash-lite is the clearly best model for both datasets and all super-categories. Between the compared models in [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: We ablated 9 prompt types for all super-categories and found that short prompts [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: ClimateCT image examples for the super-class animals’ class polar bear. In the following, we provide images from the ClimateCT and ClimateTV data sets to provide intuition for the nature of the analysed images. While all are collected using the same set of keywords, the larger ClimateTV data set naturally contains a more diverse set of images compared to the smaller ClimateCT, which is sampled from the mos… view at source ↗

**Figure 5.** Figure 5: ClimateTV image examples for the super-class animals’ class polar bear. Our most specific class, polar bear visualizes this well [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: ClimateCT image examples for the super-category type’s class illustration. Very different sets of images are found in the super-category type. Since we keep the type of the image constant, the content varies freely. For visualisation, [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: ClimateTV image examples for the super-class type’s class illustration. A.2 Category Diversity We investigate the diversity within categories to better understand the dataset analysed. We compare image diversity using pairwise cosine similarity sc ∈ [−1, 1] and total variation in the DINOv2 embedding space (Oquab et al., 2023) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Within category diversity is the highest and lowest within type [¯s [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Category combinations across super-categories reveal highly diverse visual content [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: VLM Benchmarking across super-categories on [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical benchmarking study on VLMs for climate images on X that usefully pushes distributional evaluation, but the sampling details for the 50k validated labels need to be checked before the main claim lands.

read the letter

The paper benchmarks promptable VLMs and CLIP-style models on climate discourse in social media images. It introduces a five-dimension taxonomy and shows that Gemini-3.1-flash-lite leads while open models stay competitive. The clearest contribution is the argument that moderate per-image accuracy can still recover population-level trends across the 1.2M corpus when you evaluate distributions rather than strict instance matches. They also report that chain-of-thought hurts and that dimension-specific prompts help, which is worth noting for anyone doing similar prompt work.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks promptable VLMs and zero-shot CLIP models for classifying climate-related social media images across five dimensions (animal content, consequences, action, setting, type). Using an expert-annotated 1,038-image dataset and a 1.2M-image corpus with 50k human-validated labels, it reports that Gemini-3.1-flash-lite performs best, chain-of-thought prompting hurts accuracy, dimension-specific prompts help, and that VLMs can recover population-level label distributions despite moderate per-image accuracy, enabling scalable discourse analysis. Code and tweet IDs/labels are released.

Significance. If the distributional evaluation claim holds, the work offers a concrete, reproducible path to scale visual climate discourse analysis beyond small manual samples, which is valuable for social science and communication research. The release of data and code, plus the explicit comparison of instance-level vs. distributional metrics, strengthens its utility as a starting point for applied studies.

major comments (2)

[Dataset construction and large-corpus validation] The central claim that VLM predictions recover true population-level trends on the full 1.2M corpus (despite moderate per-image accuracy) rests on the 50k validated labels being representative. No sampling procedure, stratification, or bias checks for these 50k images are described in the dataset construction or validation sections; without this, the extrapolation from the validated subset to the remaining ~1.15M images is unsupported and the distributional evaluation result cannot be interpreted as evidence for the full corpus.
[Annotation dimensions and expert dataset] The five annotation dimensions and the 1,038-image expert set are presented as the basis for both benchmarking and taxonomy design, yet no justification or coverage analysis is given for why these dimensions capture the full range of visual climate discourse on X; this directly affects the generalizability of the model rankings and the advocated use for discourse analysis at scale.

minor comments (2)

[Prompt engineering and annotation protocol] Exact prompt templates, chain-of-thought variants, and any inter-annotator agreement statistics for the expert annotations are not provided; these details are needed for reproducibility even if the main results are sound.
[Limitations] The abstract and results mention data collection biases and model selection but do not quantify them (e.g., keyword filtering effects on the 1.2M corpus); adding a short limitations paragraph on this would improve clarity without altering the core findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The two major comments identify important gaps in documentation that affect the interpretability of our results. We address each point below and will revise the manuscript to incorporate the requested clarifications and justifications.

read point-by-point responses

Referee: The central claim that VLM predictions recover true population-level trends on the full 1.2M corpus (despite moderate per-image accuracy) rests on the 50k validated labels being representative. No sampling procedure, stratification, or bias checks for these 50k images are described in the dataset construction or validation sections; without this, the extrapolation from the validated subset to the remaining ~1.15M images is unsupported and the distributional evaluation result cannot be interpreted as evidence for the full corpus.

Authors: We agree that the sampling procedure and representativeness checks for the 50k validated labels require explicit description. The 50k images were drawn via random sampling (with a fixed seed for reproducibility) from the full 1.2M corpus after initial filtering for climate-related hashtags and keywords. In the revised manuscript we will add a dedicated paragraph in the Dataset Construction section detailing: (i) the exact random sampling protocol, (ii) any post-sampling stratification by temporal or engagement features, and (iii) bias diagnostics (e.g., Kolmogorov-Smirnov tests on CLIP embedding distributions and basic visual statistics between the validated subset and the full corpus). These additions will directly support the claim that the validated labels are representative and that distributional metrics can be extrapolated. revision: yes
Referee: The five annotation dimensions and the 1,038-image expert set are presented as the basis for both benchmarking and taxonomy design, yet no justification or coverage analysis is given for why these dimensions capture the full range of visual climate discourse on X; this directly affects the generalizability of the model rankings and the advocated use for discourse analysis at scale.

Authors: We acknowledge the need for a clearer justification of the taxonomy. The five dimensions were selected after reviewing key studies in visual climate communication (e.g., work on imagery framing, emotional valence, and action-oriented visuals) and after iterative pilot coding of several hundred tweets to identify recurring visual motifs. In the revision we will insert a new subsection titled “Taxonomy Design and Coverage” that: (a) cites the relevant literature motivating each dimension, (b) describes the expert annotation process and any iterative refinement steps, and (c) explicitly discusses scope limitations, including potential under-representation of niche or emerging visual tropes. This will strengthen the rationale for both the benchmarking results and the broader applicability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking against external human labels

full rationale

The paper conducts direct model benchmarking on two fixed datasets (1,038 expert-annotated images and a 1.2M corpus with 50k human-validated labels) using standard accuracy and distributional metrics. No equations, fitted parameters, or predictions are derived; all claims rest on explicit comparisons to independently annotated ground truth. No self-citations are load-bearing for the core results, and the distributional evaluation is a straightforward statistical comparison rather than a self-referential construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard assumptions from computer vision and social media research; no free parameters or invented entities.

axioms (1)

domain assumption The five annotation dimensions capture the essential aspects of visual climate discourse.
Taxonomy is presented as given without validation against alternative schemes in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1083 out tokens · 37208 ms · 2026-05-09T22:52:27.314578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

158 extracted references · 34 canonical work pages · 10 internal anchors

[1]

ICCV , year=

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , author=. ICCV , year=
[2]

Deep Residual Learning for Image Recognition , year=

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle=. Deep Residual Learning for Image Recognition , year=
[3]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

2021
[4]

International Journal of Computer Vision (IJCV) , year=

Learning to Prompt for Vision-Language Models , author=. International Journal of Computer Vision (IJCV) , year=
[5]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Conditional Prompt Learning for Vision-Language Models , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[6]

Environmental Communication , volume=

Image themes and frames in US print news stories about climate change , author=. Environmental Communication , volume=. 2015 , publisher=

2015
[7]

arXiv preprint arXiv:2211.13965 , year=

Sponsored messaging about climate change on Facebook: Actors, content, frames , author=. arXiv preprint arXiv:2211.13965 , year=

work page arXiv
[8]

Climatic Change , volume=

More than meets the eye: A longitudinal analysis of climate change imagery in the print media , author=. Climatic Change , volume=. 2020 , publisher=

2020
[9]

Global Environmental Change , volume=

The Greta effect: Visualising climate protest in uk media and the Getty images collections , author=. Global Environmental Change , volume=. 2021 , publisher=

2021
[10]

Environmental Communication , volume=

Bearing witness? Polar bears as icons for climate change communication in National Geographic , author=. Environmental Communication , volume=. 2019 , publisher=

2019
[11]

Wiley Interdisciplinary Reviews: Climate Change , volume=

Public engagement with climate imagery in a changing digital landscape , author=. Wiley Interdisciplinary Reviews: Climate Change , volume=. 2018 , publisher=

2018
[12]

Images that Matter: Online Protests and the Mobilizing Role of Pictures , urldate =

Andreu Casas and Nora Webb Williams , journal =. Images that Matter: Online Protests and the Mobilizing Role of Pictures , urldate =
[13]

Psychonomic Science , volume=

Why are pictures easier to recall than words? , author=. Psychonomic Science , volume=. 1968 , publisher=

1968
[14]

2009 , publisher=

Image bite politics: News and the visual framing of elections , author=. 2009 , publisher=

2009
[15]

Journal of visual literacy , volume=

The levels of visual framing , author=. Journal of visual literacy , volume=. 2011 , publisher=

2011
[16]

Media Psychology , volume=

Framing fast and slow: A dual processing account of multimodal framing effects , author=. Media Psychology , volume=. 2019 , publisher=

2019
[17]

cultural geographies , volume =

Darryn Anne DiFrancesco and Nathan Young , title =. cultural geographies , volume =. 2011 , doi =. https://doi.org/10.1177/1474474010382072 , abstract =

work page doi:10.1177/1474474010382072 2011
[18]

Media and Communication , volume=

Fridays for Future and Mondays for Memes: How Climate Crisis Memes Mobilize Social Media Users , author=. Media and Communication , volume=
[19]

Nature Climate Change , pages=

Growing polarization around climate change on social media , author=. Nature Climate Change , pages=. 2022 , publisher=

2022
[20]

The Twitter Parliamentarian Database

Livia van Vliet. The Twitter Parliamentarian Database. 2021. doi:10.6084/m9.figshare.10120685.v2

work page doi:10.6084/m9.figshare.10120685.v2 2021
[21]

Proceedings of the International AAAI Conference on Web and Social Media , volume=

Comparing events coverage in online news and social media: The case of climate change , author=. Proceedings of the International AAAI Conference on Web and Social Media , volume=
[22]

Expert Systems with Applications , volume=

The climate change Twitter dataset , author=. Expert Systems with Applications , volume=. 2022 , publisher=

2022
[23]

palgrave communications , volume=

Credibility of climate change denial in social media , author=. palgrave communications , volume=. 2019 , publisher=

2019
[24]

https://doi

Climate change tweets Ids , author=. https://doi. org/10.7910/DVN/5QCCUU, checked on , volume=

work page doi:10.7910/dvn/5qccuu
[25]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
[26]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[27]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009
[28]

Learning multiple layers of features from tiny images , biburl =

Krizhevsky, Alex and Hinton, Geoffrey and others , year=. Learning multiple layers of features from tiny images , biburl =
[29]

The Cityscapes Dataset for Semantic Urban Scene Understanding , author=. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[30]

Bigearthnet: A large-scale benchmark archive for remote sensing image understanding,

BigEarthNet: A Large-Scale Benchmark Archive For Remote Sensing Image Understanding , url=. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium , author=. 2019 , month=. doi:10.1109/IGARSS.2019.8900532 , note=

work page doi:10.1109/igarss.2019.8900532 2019
[31]

Microsoft COCO: Common Objects in Context

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Bourdev, Lubomir and Girshick, Ross and Hays, James and Perona, Pietro and Ramanan, Deva and Zitnick, C. Lawrence and Dollár, Piotr , year=. Microsoft COCO: Common Objects in Context , url=. doi:10.48550/arXiv.1405.0312 , note=

work page internal anchor Pith review doi:10.48550/arxiv.1405.0312
[32]

and Brendel, Wieland , year=

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A. and Brendel, Wieland , year=. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , url=
[33]

The Kinetics Human Action Video Dataset , url=

Kay, Will and Carreira, Joao and Simonyan, Karen and Zhang, Brian and Hillier, Chloe and Vijayanarasimhan, Sudheendra and Viola, Fabio and Green, Tim and Back, Trevor and Natsev, Paul and Suleyman, Mustafa and Zisserman, Andrew , year=. The Kinetics Human Action Video Dataset , url=
[34]

A short note on the Kinetics-700-2020 human action dataset

Smaira, Lucas and Carreira, João and Noland, Eric and Clancy, Ellen and Wu, Amy and Zisserman, Andrew , year=. A Short Note on the Kinetics-700-2020 Human Action Dataset , url=. arXiv preprint arXiv:2010.10864 , publisher=

work page arXiv 2020
[35]

and Van Gool, L

Everingham, M. and Van Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision. 2010

2010
[36]

NUS-WIDE: a real-world web image database from National University of Singapore , ISBN=

Chua, Tat-Seng and Tang, Jinhui and Hong, Richang and Li, Haojie and Luo, Zhiping and Zheng, Yantao , year=. NUS-WIDE: a real-world web image database from National University of Singapore , ISBN=. Proceedings of the ACM International Conference on Image and Video Retrieval , publisher=. doi:10.1145/1646396.1646452 , abstractNote=

work page doi:10.1145/1646396.1646452
[37]

Haurum, Joakim Bruslund and Moeslund, Thomas B. , year=. Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark , url=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. doi:10.48550/arXiv.2103.10895 , abstractNote=

work page doi:10.48550/arxiv.2103.10895
[38]

Proceedings of International Conference on Computer Vision (ICCV) , month =

Deep Learning Face Attributes in the Wild , author =. Proceedings of International Conference on Computer Vision (ICCV) , month =
[39]

InProceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

Duquenne, Paul-Ambroise and Schwenk, Holger and Sagot, Benoît , language=. SONAR: Sentence-Level Multimodal and Language-Agnostic Representations , abstractNote=. arXiv preprint arXiv:2308.11466 , year=

work page arXiv
[40]

QCAmap: eine interaktive Webapplikation f

Fenzl, Thomas and Mayring, Philipp , journal=. QCAmap: eine interaktive Webapplikation f. 2017 , publisher=

2017
[41]

, author=

Measures of response agreement for qualitative data: some generalizations and alternatives. , author=. Psychological bulletin , volume=. 1971 , publisher=

1971
[42]

2018 , publisher=

Content analysis: An introduction to its methodology , author=. 2018 , publisher=

2018
[43]

2015 , journal=

Deep Residual Learning for Image Recognition , author=. 2015 , journal=

2015
[44]

2018 , journal=

Damage Identification in Social Media Posts using Multimodal Deep Learning , author=. 2018 , journal=

2018
[45]

2020 , journal=

Deep learning benchmarks and datasets for social media image classification for disaster response , author=. 2020 , journal=

2020
[46]

IJCV , year=

Learning to Prompt for Vision-Language Models , author=. IJCV , year=
[47]

International Conference on Machine Learning , pages=

Chils: Zero-shot image classification with hierarchical label sets , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[48]

arXiv preprint arXiv:2306.07282 , year=

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts , author=. arXiv preprint arXiv:2306.07282 , year=

work page arXiv
[49]

Marketing Science , volume=

Visual listening in: Extracting brand image portrayed on social media , author=. Marketing Science , volume=. 2020 , publisher=

2020
[50]

Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response , year=

Alam, Firoj and Ofli, Ferda and Imran, Muhammad and Alam, Tanvirul and Qazi, Umair , booktitle=. Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response , year=
[51]

Visual classification via description from large language models.arXiv preprint arXiv:2210.07183, 2022

Visual classification via description from large language models , author=. arXiv preprint arXiv:2210.07183 , year=

work page arXiv
[52]

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters , url=

Sun, Quan and Wang, Jinsheng and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Xinlong , year=. EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters , url=
[53]

International conference on machine learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[54]

International Conference on Machine Learning , pages=

Scaling vision transformers to 22 billion parameters , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[55]

Florence: A new foundation model for computer vision

Florence: A new foundation model for computer vision , author=. arXiv preprint arXiv:2111.11432 , year=

work page arXiv
[56]

Coca: Contrastive captioners are image- text foundation models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page arXiv
[57]

arXiv preprint arXiv:2401.15896 , year=

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining , author=. arXiv preprint arXiv:2401.15896 , year=

work page arXiv
[58]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. arXiv preprint arXiv:2312.14238 , year=

work page internal anchor Pith review arXiv
[59]

Demysti- fying clip data

Demystifying clip data , author=. arXiv preprint arXiv:2309.16671 , year=

work page arXiv
[60]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Convnext v2: Co-designing and scaling convnets with masked autoencoders , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[62]

2022 , eprint=

LiT: Zero-Shot Transfer with Locked-image text Tuning , author=. 2022 , eprint=

2022
[63]

CVPR , year=

Natural Adversarial Examples , author=. CVPR , year=
[64]

Proceedings of the 40th International Conference on Machine Learning , pages =

Magneto: A Foundation Transformer , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[65]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[66]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[67]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

No" to Say

Learn" No" to Say" Yes" Better: Improving Vision-Language Models via Negations , author=. CoRR , year=
[69]

Learning the Power of “No”: Foundation Models with Negations , year=

Singh, Jaisidh and Shrivastava, Ishaan and Vatsa, Mayank and Singh, Richa and Bharati, Aparna , booktitle=. Learning the Power of “No”: Foundation Models with Negations , year=
[70]

Advances in Neural Information Processing Systems , volume=

Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=
[71]

arXiv preprint arXiv:2112.04323 , year=

Contrastive learning with large memory bank and negative embedding subtraction for accurate copy detection , author=. arXiv preprint arXiv:2112.04323 , year=

work page arXiv
[72]

The 2021 image similarity dataset and challenge.arXiv preprint arXiv:2106.09672, 2021

The 2021 image similarity dataset and challenge , author=. arXiv preprint arXiv:2106.09672 , year=

work page arXiv 2021
[73]

The Geographical Journal , volume=

Visual portrayals of fun in the sun in European news outlets misrepresent heatwave risks , author=. The Geographical Journal , volume=. 2023 , publisher=

2023
[74]

Visual Communication , year=

Viral climate imagery: examining popular climate visuals on Twitter , author=. Visual Communication , year=
[75]

Public Understanding of Science , volume=

Counteracting climate denial: A systematic review , author=. Public Understanding of Science , volume=. 2024 , publisher=

2024
[76]

FAST CAPITALISM , volume=

Climate Change Deniers versus Climate Change Decriers: the Pragmatics of Climate Defense in the Age of Disinformation , author=. FAST CAPITALISM , volume=
[77]

Journal of Communication Pedagogy , volume=

Climate change denial messages as post-truth , author=. Journal of Communication Pedagogy , volume=. 2025 , publisher=

2025
[78]

Next Research , pages=

Climate Hoax: The Shift from Scientific Discourse to Speculative Rhetoric in Climate Change Conversations , author=. Next Research , pages=. 2025 , publisher=

2025
[79]

The Routledge Companion to Visual Journalism , pages=

Beyond the “Iconic” Climate Visual: Investigating absent representations of climate change , author=. The Routledge Companion to Visual Journalism , pages=. 2025 , publisher=

2025
[80]

Journalism Studies , volume=

VISUAL POLITICS, PROTEST, AND POWER: Who shaped the climate visual discourse at COP26? , author=. Journalism Studies , volume=. 2025 , publisher=

2025

Showing first 80 references.