arxiv: 2605.08965 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Naeun Lee , Hyunjong Kim , Sunghwan Choi , Injin Kong , Yohan Jo

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsvisual persuasionreasoning faithfulnesssupervised fine-tuningpersuasiveness predictionrationale evaluationexplanation consistency

0 comments

The pith

Diverse teacher-generated rationales improve MLLM performance on visual persuasiveness prediction and support better evaluation of faithful reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models struggle when prompted to reason before deciding whether an image is persuasive, and such prompts can sometimes lower accuracy. The paper shows that supervised fine-tuning on diverse rationales produced by a teacher model raises prediction performance on this task. It introduces a three-dimensional faithfulness framework to check whether explanations are consistent with the model's decision, grounded in the image, and sensitive to changes in the decision. Results indicate that strong prediction scores do not ensure the explanations are faithful to the model's process. Among the three dimensions, rationale-to-decision sensitivity aligns most closely with human judgments of good reasoning, motivating new training methods that prioritize faithful explanations.

Core claim

The paper claims that naively prompting MLLMs to reason before prediction does not consistently help and can reduce performance on visual persuasiveness tasks, but fine-tuning on diverse teacher-generated rationales improves prediction accuracy. It further claims that a three-dimensional faithfulness evaluation framework, consisting of rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences.

What carries the argument

The three-dimensional faithfulness evaluation framework that measures rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, applied after supervised fine-tuning with diverse teacher rationales.

If this is right

Naive chain-of-thought prompting is unreliable for predicting visual persuasion.
Supervised fine-tuning with diverse rationales raises both accuracy and the potential for faithful reasoning.
High prediction performance does not ensure that rationales faithfully support the model's decisions.
Rationale-to-decision sensitivity provides the strongest signal for aligning with human preferences on explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rationale-supervision approach could extend to other multimodal tasks requiring explanations, such as visual question answering.
Incorporating the sensitivity metric directly into training objectives might produce models whose explanations better match human expectations.
Automated generation of diverse rationales at scale could support larger training datasets without heavy reliance on manual annotation.

Load-bearing premise

That the teacher-generated rationales are diverse and high-quality enough to serve as reliable supervision targets and that the three faithfulness metrics accurately reflect human notions of good reasoning.

What would settle it

A test showing that models with high prediction accuracy but low rationale-to-decision sensitivity scores receive higher human preference ratings than models with high sensitivity scores would falsify the claim that sensitivity best captures faithful reasoning.

Figures

Figures reproduced from arXiv: 2605.08965 by Hyunjong Kim, Injin Kong, Naeun Lee, Sunghwan Choi, Yohan Jo.

**Figure 1.** Figure 1: Example from the PVP dataset. single ground-truth path: the persuasiveness of an image can vary based on a viewer’s personality traits and values [2], allowing for multiple valid rationales for the same image and message. Can current MLLMs reason effectively and faithfully about visual persuasion? Our analysis suggests they cannot (Sections 3 and 4). Instructing models to generate rationales regarding the … view at source ↗

**Figure 2.** Figure 2: Prompt design for rationale extraction. Given an image–message–persuasiveness triple, prompts vary along evidence polarity and visual granularity, yielding four rationale types: supportfocused global, support-focused local, counter-aware global, and counter-aware local. Evidence Polarity. Evidence polarity specifies whether a rationale uses only evidence supporting the persuasiveness decision or also cons… view at source ↗

**Figure 3.** Figure 3: Overview of faithfulness evaluation pipelines. (Left) We validate judge-based pipelines for [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The annotation interface used for validating image editing pipeline. [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: The annotation interface used for evaluating human preference of generated rationales. [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between faithfulness metrics and human preference. [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

read the original abstract

Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that naively prompting MLLMs to reason before predicting visual persuasiveness does not help and can reduce performance. It shows empirically and theoretically that supervised fine-tuning on diverse teacher-generated rationales improves prediction accuracy. The authors introduce a three-dimensional faithfulness evaluation framework (rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity) and demonstrate that high prediction performance does not guarantee faithful rationales, while rationale-to-decision sensitivity best aligns with human rationale preferences. The work motivates faithfulness-aware training objectives and releases code and dataset.

Significance. If the results hold, the paper makes a useful contribution by identifying limitations of standard reasoning prompts for MLLMs on visual persuasion and by providing a structured faithfulness framework that goes beyond accuracy metrics. The finding that sensitivity correlates most with human judgments is actionable for future work. Public release of code and dataset supports reproducibility and is a clear strength.

major comments (2)

[Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.
[Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.

minor comments (1)

[Abstract] The abstract states that code and dataset will be made publicly available; the manuscript should include a specific link or repository reference in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the presentation of our results on MLLM reasoning for visual persuasion. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.

Authors: We agree that explicit quantitative validation of rationale diversity and quality would make the central claim more robust. The current manuscript demonstrates empirical gains from SFT on these rationales but does not report the suggested metrics. In the revised version, we will add (1) average pairwise cosine similarity distances computed over sentence-transformer embeddings of the rationales and (2) entropy over rationale clusters obtained via k-means on the same embeddings. We will also include human quality ratings collected from three independent annotators on a 5-point Likert scale for relevance, coherence, and groundedness, along with inter-annotator agreement measured by Fleiss' kappa. These additions will appear in a new subsection of the experimental results. revision: yes
Referee: [Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.

Authors: We acknowledge that the human correlation analysis would benefit from greater transparency. The manuscript currently reports that sensitivity best aligns with human preferences but provides limited procedural detail. In revision, we will expand the relevant paragraph to state the exact number of annotators, the agreement statistic (Fleiss' kappa), and the bias-mitigation steps employed (randomized item order, blinding to model outputs, and use of a standardized rating interface). These clarifications will confirm that the reported alignment is not an artifact of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core claims rest on empirical experiments (prompting MLLMs, SFT with external teacher-generated rationales, and human preference correlations) rather than any derivation that reduces to self-defined inputs or fitted parameters by construction. The three-dimensional faithfulness framework is introduced as a new evaluation tool and validated externally against human judgments; its definitions (consistency, groundedness, sensitivity) do not equate to the model's own predictions or decisions in a load-bearing circular manner. No self-citation chains, ansatz smuggling, or renaming of known results appear in the abstract or described methodology. The work is self-contained against external benchmarks like teacher models and human studies.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the work assumes standard supervised fine-tuning validity and that human preferences provide an external ground truth for faithfulness; no explicit free parameters or invented physical entities are mentioned.

axioms (2)

domain assumption Teacher-generated rationales provide higher-quality supervision than model self-generated ones for this task
Invoked when claiming SFT gains from diverse teacher rationales
domain assumption Human rationale preferences serve as a reliable external benchmark for faithfulness
Used to validate that sensitivity metric aligns best with humans

invented entities (1)

Three-dimensional faithfulness evaluation framework no independent evidence
purpose: To measure rationale quality beyond prediction accuracy
New evaluation construct covering consistency, groundedness, and sensitivity

pith-pipeline@v0.9.0 · 5499 in / 1493 out tokens · 48157 ms · 2026-05-12T02:08:27.034007+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first show that prompting MLLMs to reason before prediction does not consistently help... introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

OUP Oxford, 2011

Daniel Chandler and Rod Munday.A dictionary of media and communication. OUP Oxford, 2011

work page 2011
[2]

Pvp: An image dataset for personalized visual persuasion with persuasion strategies, viewer characteristics, and persuasiveness ratings

Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, and Yohan Jo. Pvp: An image dataset for personalized visual persuasion with persuasion strategies, viewer characteristics, and persuasiveness ratings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19209–19237, 2025

work page 2025
[3]

Cap: Evaluation of persuasive and creative image generation

Aysan Aghazadeh and Adriana Kovashka. Cap: Evaluation of persuasive and creative image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16970–16980, 2025

work page 2025
[4]

The possibility and actuality of visual arguments

J Anthony Blair. The possibility and actuality of visual arguments. InGroundwork in the theory of Argumentation: Selected papers of J. Anthony Blair, pages 205–223. Springer, 2011

work page 2011
[5]

The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015

Jens E Kjeldsen. The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015

work page 2015
[6]

Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses.Journal of consumer research, 26(1):37–54, 1999

Edward F McQuarrie and David Glen Mick. Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses.Journal of consumer research, 26(1):37–54, 1999

work page 1999
[7]

Beyond visual metaphor: A new typology of visual rhetoric in advertising.Marketing theory, 4(1-2):113–136, 2004

Barbara J Phillips and Edward F McQuarrie. Beyond visual metaphor: A new typology of visual rhetoric in advertising.Marketing theory, 4(1-2):113–136, 2004

work page 2004
[8]

Visual persuasion: Inferring communicative intents of images

Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring communicative intents of images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 216–223, 2014

work page 2014
[9]

Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022

Zhexiong Liu, Meiqi Guo, Yue Dai, and Diane Litman. Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022

work page 2022
[10]

Forest before trees: The precedence of global features in visual perception

David Navon. Forest before trees: The precedence of global features in visual perception. Cognitive psychology, 9(3):353–383, 1977

work page 1977
[11]

Building the gist of a scene: The role of global image features in recognition.Progress in brain research, 155:23–36, 2006

Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in recognition.Progress in brain research, 155:23–36, 2006

work page 2006
[12]

Teaching small language models to reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, 2023

work page 2023
[13]

Large language models are reasoning teachers

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 14852–14882, 2023

work page 2023
[14]

Distilling reasoning capabilities into smaller language models

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

work page 2023
[15]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 10

work page internal anchor Pith review arXiv 2023
[16]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024
[17]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

work page 2020
[18]

Eraser: A benchmark to evaluate rationalized nlp models

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, 2020

work page 2020
[19]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024

work page 2024
[21]

Cambridge University Press, 2005

Jonathan Baron.Rationality and intelligence. Cambridge University Press, 2005

work page 2005
[22]

Reasoning independently of prior belief and individual differences in actively open-minded thinking.Journal of educational psychology, 89(2):342, 1997

Keith E Stanovich and Richard F West. Reasoning independently of prior belief and individual differences in actively open-minded thinking.Journal of educational psychology, 89(2):342, 1997

work page 1997
[23]

Confirmation bias: A ubiquitous phenomenon in many guises.Review of general psychology, 2(2):175–220, 1998

Raymond S Nickerson. Confirmation bias: A ubiquitous phenomenon in many guises.Review of general psychology, 2(2):175–220, 1998

work page 1998
[24]

Myside bias, rational thinking, and intelligence.Current Directions in Psychological Science, 22(4):259–264, 2013

Keith E Stanovich, Richard F West, and Maggie E Toplak. Myside bias, rational thinking, and intelligence.Current Directions in Psychological Science, 22(4):259–264, 2013

work page 2013
[25]

Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions

Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. InProceedings of the 25th International Conference on World Wide Web, WWW ’16, page 613–624. International World Wide Web Conferences Steering Committee, April 2016

work page 2016
[26]

Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023

Zhexiong Liu, Mohamed Elaraby, Yang Zhong, and Diane Litman. Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023

work page 2023
[27]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025

work page 2025
[28]

arXiv preprint arXiv:2511.19663 , year=

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas, and Rachel Ward. Phi-4-vision-reasoning technical report.arXiv:2511.19663, 2026

work page arXiv 2026
[29]

One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023

Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, and Chang Xu. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023

work page 2023
[30]

Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030, 2024

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030, 2024

work page arXiv 2024
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2025
[33]

Multimodal explanations: Justifying decisions and pointing to the evidence

Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788, 2018

work page 2018
[34]

Taking a hint: Leveraging explanations to make vision and language models more grounded

Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. InProceedings of the IEEE/CVF international conference on computer vision, pages 2591–2600, 2019

work page 2019
[35]

Clevr-x: A visual reasoning dataset for natural language explanations

Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020

work page 2020
[36]

A benchmark for in- terpretability methods in deep neural networks.Advances in neural information processing systems, 32, 2019

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for in- terpretability methods in deep neural networks.Advances in neural information processing systems, 32, 2019

work page 2019
[37]

Faithfulness tests for natural language explanations

Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, 2023

work page 2023
[38]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025
[39]

All that’s ‘human’is not gold: Evaluating human evaluation of generated text

Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long ...

work page 2021
[40]

The perils of using mechanical turk to evaluate open-ended text generation

Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, 2021

work page 2021
[41]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page 2024
[42]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[43]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

work page 2019
[44]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

work page 2020
[45]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020
[46]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977. A Prompt Templates for Reasoning Data Collection This appendix provides the prompt templates used to collect visual persuasion reasoning. Each prompt is conditioned on an input image, an intended message, and a target binary persua...

work page 1977
[47]

This indicates that some annotators tend to assign systematically low scores while others tend to assign systematically high scores. To mitigate this bias, we convert each annotator’s raw score into a binary vote relative to that annotator’s own score distribution, rather than averaging raw scores across annotators. For each annotator a, we compute the fi...

work page 2047
[48]

CLIP with text:Measuring CLIP similarity between the rationale text and the image description provided in the PVP dataset

work page
[49]

CLIP with image:Measuring CLIP similarity between the rationale text and the corre- sponding image

work page
[50]

GPT-5 atomic facts:Decomposing the rationale into discrete atomic facts and prompting GPT-5 to verify each strictly against the image

work page
[51]

GPT-5 atomic facts (calibrated):The same atomic facts approach, but explicitly calibrated using the validation set to select a threshold for ratio (Nyes/Ntotal)

work page
[52]

Visual Groundedness

GPT-5 prompting:A direct, zero-shot prompt asking GPT-5 to act as a judge of the rationale’s groundedness in the provided image. Table 13: Comparison of groundedness evaluation methods (test set). Methodκ(vs Majority)κ(vs U1)κ(vs U2)κ(vs U3) Bal. Acc. F1 (Yes) F1 (No) CLIP with image (> 0.3) -0.0016 0.0647 0.0349 -0.0192 0.4983 0.7414 0.1616 CLIP with tex...

work page arXiv 2051
[53]

You must determine if the concrete visual claims made in the reasoning text physically exist in the provided image

A reasoning text about the image Your objective is to perform a Groundedness Evaluation. You must determine if the concrete visual claims made in the reasoning text physically exist in the provided image. ### Evaluation Rules:

work page
[54]

The design lacks clarity

Evaluate Concrete Visual Claims ONLY: Mentally break down the reasoning text into physical visual claims (objects, characters, gestures, backgrounds, explicit text). DO NOT evaluate subjective evaluations, viewer impacts, or rhetorical conclusions (e.g., ignore statements like "The design lacks clarity", "This is unpersuasive", or "It compels the viewer")

work page
[55]

Message" vs

"Message" vs. "Physical Text": If the reasoning says "The image conveys the message ’Do not smoke’", do NOT look for a physical sign saying "Do not smoke". Treat it as a thematic statement. ONLY strictly verify text if the reasoning explicitly claims it is written on a physical object (e.g., "The sign reads ’X’")

work page
[56]

EMERGENCY XIT

Lenient Text Matching for AI Images: AI-generated images often contain mangled text. If the reasoning quotes text and the image contains text that is clearly attempting to spell that phrase (e.g., "EMERGENCY XIT", "Mid-d’y "), you MUST consider the claim GROUNDED. Do not penalize minor typos, missing letters, or strange characters. 30

work page
[57]

student" for someone studying. If a claim mentions

Contextual Synonyms & Reasonable Assumptions: Use reasonable human logic. Accept "student" for someone studying. If a claim mentions "empty bottles" and the bottles are opaque, assume they are empty based on context. Be lenient with quantity descriptors like "filled," "abundance," or "scattered ."

work page
[58]

stark contrast between a dirty hood and the rest of the kitchen,

STRICT Relational and State Accuracy: You must verify the specific *state*, * adjectives*, and *relationships* of objects, not just their presence. If the text claims a "stark contrast between a dirty hood and the rest of the kitchen," and the rest of the kitchen is ALSO dirty, the claim is false. If it claims chains are "broken" but they are intact, or "...

work page
[59]

No". If all concrete physical claims and their described states are perfectly visible, the label MUST be

Zero Tolerance for Physical Hallucinations: If the reasoning describes EVEN ONE concrete physical element, object, specific state, or character that is completely NOT visible in the image, the final label MUST be "No". If all concrete physical claims and their described states are perfectly visible, the label MUST be "Yes". ### Output Format: Return a JSO...

work page
[60]

Read the image and the explanation carefully

work page
[61]

Identify the visual element that is most decisive for the explanation’s conclusion

work page
[62]

### Rules

Generate a concise image editing prompt that changes this element so that the image moves against the explanation. ### Rules

work page
[63]

Remove - If the visual element is a concrete object, person, or other visible entity that already appears in the image, remove it

work page
[64]

- Select Modify only if removing the element would make the image unnatural

Modify - If the visual element is atmosphere, setting, action, or emotion, modify it in the opposite visual direction. - Select Modify only if removing the element would make the image unnatural

work page
[65]

editable_target

Add 31 - If the visual element is the absence of something, add that missing element naturally. ### Rules - Focus on one clear, editable visual target. - The edit should be specific, visible, and directly reversible with respect to the explanation. - The final prompt should be simple and precise. ### Remove Example - Visual element: police cars and crime ...

work page