pith. machine review for the scientific record. sign in

arxiv: 2605.08965 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvisual persuasionreasoning faithfulnesssupervised fine-tuningpersuasiveness predictionrationale evaluationexplanation consistency
0
0 comments X

The pith

Diverse teacher-generated rationales improve MLLM performance on visual persuasiveness prediction and support better evaluation of faithful reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models struggle when prompted to reason before deciding whether an image is persuasive, and such prompts can sometimes lower accuracy. The paper shows that supervised fine-tuning on diverse rationales produced by a teacher model raises prediction performance on this task. It introduces a three-dimensional faithfulness framework to check whether explanations are consistent with the model's decision, grounded in the image, and sensitive to changes in the decision. Results indicate that strong prediction scores do not ensure the explanations are faithful to the model's process. Among the three dimensions, rationale-to-decision sensitivity aligns most closely with human judgments of good reasoning, motivating new training methods that prioritize faithful explanations.

Core claim

The paper claims that naively prompting MLLMs to reason before prediction does not consistently help and can reduce performance on visual persuasiveness tasks, but fine-tuning on diverse teacher-generated rationales improves prediction accuracy. It further claims that a three-dimensional faithfulness evaluation framework, consisting of rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences.

What carries the argument

The three-dimensional faithfulness evaluation framework that measures rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, applied after supervised fine-tuning with diverse teacher rationales.

If this is right

  • Naive chain-of-thought prompting is unreliable for predicting visual persuasion.
  • Supervised fine-tuning with diverse rationales raises both accuracy and the potential for faithful reasoning.
  • High prediction performance does not ensure that rationales faithfully support the model's decisions.
  • Rationale-to-decision sensitivity provides the strongest signal for aligning with human preferences on explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rationale-supervision approach could extend to other multimodal tasks requiring explanations, such as visual question answering.
  • Incorporating the sensitivity metric directly into training objectives might produce models whose explanations better match human expectations.
  • Automated generation of diverse rationales at scale could support larger training datasets without heavy reliance on manual annotation.

Load-bearing premise

That the teacher-generated rationales are diverse and high-quality enough to serve as reliable supervision targets and that the three faithfulness metrics accurately reflect human notions of good reasoning.

What would settle it

A test showing that models with high prediction accuracy but low rationale-to-decision sensitivity scores receive higher human preference ratings than models with high sensitivity scores would falsify the claim that sensitivity best captures faithful reasoning.

Figures

Figures reproduced from arXiv: 2605.08965 by Hyunjong Kim, Injin Kong, Naeun Lee, Sunghwan Choi, Yohan Jo.

Figure 1
Figure 1. Figure 1: Example from the PVP dataset. single ground-truth path: the persuasiveness of an image can vary based on a viewer’s personality traits and values [2], allowing for multiple valid rationales for the same image and message. Can current MLLMs reason effectively and faithfully about visual persuasion? Our analysis suggests they cannot (Sections 3 and 4). Instructing models to generate rationales regarding the … view at source ↗
Figure 2
Figure 2. Figure 2: Prompt design for rationale extraction. Given an image–message–persuasiveness triple, prompts vary along evidence polarity and visual granularity, yielding four rationale types: support￾focused global, support-focused local, counter-aware global, and counter-aware local. Evidence Polarity. Evidence polarity specifies whether a rationale uses only evidence supporting the persuasiveness decision or also cons… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of faithfulness evaluation pipelines. (Left) We validate judge-based pipelines for [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The annotation interface used for validating image editing pipeline. [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The annotation interface used for evaluating human preference of generated rationales. [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between faithfulness metrics and human preference. [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
read the original abstract

Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that naively prompting MLLMs to reason before predicting visual persuasiveness does not help and can reduce performance. It shows empirically and theoretically that supervised fine-tuning on diverse teacher-generated rationales improves prediction accuracy. The authors introduce a three-dimensional faithfulness evaluation framework (rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity) and demonstrate that high prediction performance does not guarantee faithful rationales, while rationale-to-decision sensitivity best aligns with human rationale preferences. The work motivates faithfulness-aware training objectives and releases code and dataset.

Significance. If the results hold, the paper makes a useful contribution by identifying limitations of standard reasoning prompts for MLLMs on visual persuasion and by providing a structured faithfulness framework that goes beyond accuracy metrics. The finding that sensitivity correlates most with human judgments is actionable for future work. Public release of code and dataset supports reproducibility and is a clear strength.

major comments (2)
  1. [Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.
  2. [Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.
minor comments (1)
  1. [Abstract] The abstract states that code and dataset will be made publicly available; the manuscript should include a specific link or repository reference in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the presentation of our results on MLLM reasoning for visual persuasion. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.

    Authors: We agree that explicit quantitative validation of rationale diversity and quality would make the central claim more robust. The current manuscript demonstrates empirical gains from SFT on these rationales but does not report the suggested metrics. In the revised version, we will add (1) average pairwise cosine similarity distances computed over sentence-transformer embeddings of the rationales and (2) entropy over rationale clusters obtained via k-means on the same embeddings. We will also include human quality ratings collected from three independent annotators on a 5-point Likert scale for relevance, coherence, and groundedness, along with inter-annotator agreement measured by Fleiss' kappa. These additions will appear in a new subsection of the experimental results. revision: yes

  2. Referee: [Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.

    Authors: We acknowledge that the human correlation analysis would benefit from greater transparency. The manuscript currently reports that sensitivity best aligns with human preferences but provides limited procedural detail. In revision, we will expand the relevant paragraph to state the exact number of annotators, the agreement statistic (Fleiss' kappa), and the bias-mitigation steps employed (randomized item order, blinding to model outputs, and use of a standardized rating interface). These clarifications will confirm that the reported alignment is not an artifact of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core claims rest on empirical experiments (prompting MLLMs, SFT with external teacher-generated rationales, and human preference correlations) rather than any derivation that reduces to self-defined inputs or fitted parameters by construction. The three-dimensional faithfulness framework is introduced as a new evaluation tool and validated externally against human judgments; its definitions (consistency, groundedness, sensitivity) do not equate to the model's own predictions or decisions in a load-bearing circular manner. No self-citation chains, ansatz smuggling, or renaming of known results appear in the abstract or described methodology. The work is self-contained against external benchmarks like teacher models and human studies.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the work assumes standard supervised fine-tuning validity and that human preferences provide an external ground truth for faithfulness; no explicit free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption Teacher-generated rationales provide higher-quality supervision than model self-generated ones for this task
    Invoked when claiming SFT gains from diverse teacher rationales
  • domain assumption Human rationale preferences serve as a reliable external benchmark for faithfulness
    Used to validate that sensitivity metric aligns best with humans
invented entities (1)
  • Three-dimensional faithfulness evaluation framework no independent evidence
    purpose: To measure rationale quality beyond prediction accuracy
    New evaluation construct covering consistency, groundedness, and sensitivity

pith-pipeline@v0.9.0 · 5499 in / 1493 out tokens · 48157 ms · 2026-05-12T02:08:27.034007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    OUP Oxford, 2011

    Daniel Chandler and Rod Munday.A dictionary of media and communication. OUP Oxford, 2011

  2. [2]

    Pvp: An image dataset for personalized visual persuasion with persuasion strategies, viewer characteristics, and persuasiveness ratings

    Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, and Yohan Jo. Pvp: An image dataset for personalized visual persuasion with persuasion strategies, viewer characteristics, and persuasiveness ratings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19209–19237, 2025

  3. [3]

    Cap: Evaluation of persuasive and creative image generation

    Aysan Aghazadeh and Adriana Kovashka. Cap: Evaluation of persuasive and creative image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16970–16980, 2025

  4. [4]

    The possibility and actuality of visual arguments

    J Anthony Blair. The possibility and actuality of visual arguments. InGroundwork in the theory of Argumentation: Selected papers of J. Anthony Blair, pages 205–223. Springer, 2011

  5. [5]

    The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015

    Jens E Kjeldsen. The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015

  6. [6]

    Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses.Journal of consumer research, 26(1):37–54, 1999

    Edward F McQuarrie and David Glen Mick. Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses.Journal of consumer research, 26(1):37–54, 1999

  7. [7]

    Beyond visual metaphor: A new typology of visual rhetoric in advertising.Marketing theory, 4(1-2):113–136, 2004

    Barbara J Phillips and Edward F McQuarrie. Beyond visual metaphor: A new typology of visual rhetoric in advertising.Marketing theory, 4(1-2):113–136, 2004

  8. [8]

    Visual persuasion: Inferring communicative intents of images

    Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring communicative intents of images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 216–223, 2014

  9. [9]

    Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022

    Zhexiong Liu, Meiqi Guo, Yue Dai, and Diane Litman. Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022

  10. [10]

    Forest before trees: The precedence of global features in visual perception

    David Navon. Forest before trees: The precedence of global features in visual perception. Cognitive psychology, 9(3):353–383, 1977

  11. [11]

    Building the gist of a scene: The role of global image features in recognition.Progress in brain research, 155:23–36, 2006

    Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in recognition.Progress in brain research, 155:23–36, 2006

  12. [12]

    Teaching small language models to reason

    Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, 2023

  13. [13]

    Large language models are reasoning teachers

    Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 14852–14882, 2023

  14. [14]

    Distilling reasoning capabilities into smaller language models

    Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

  15. [15]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 10

  16. [16]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  17. [17]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

  18. [18]

    Eraser: A benchmark to evaluate rationalized nlp models

    Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, 2020

  19. [19]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

  20. [20]

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

    Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024

  21. [21]

    Cambridge University Press, 2005

    Jonathan Baron.Rationality and intelligence. Cambridge University Press, 2005

  22. [22]

    Reasoning independently of prior belief and individual differences in actively open-minded thinking.Journal of educational psychology, 89(2):342, 1997

    Keith E Stanovich and Richard F West. Reasoning independently of prior belief and individual differences in actively open-minded thinking.Journal of educational psychology, 89(2):342, 1997

  23. [23]

    Confirmation bias: A ubiquitous phenomenon in many guises.Review of general psychology, 2(2):175–220, 1998

    Raymond S Nickerson. Confirmation bias: A ubiquitous phenomenon in many guises.Review of general psychology, 2(2):175–220, 1998

  24. [24]

    Myside bias, rational thinking, and intelligence.Current Directions in Psychological Science, 22(4):259–264, 2013

    Keith E Stanovich, Richard F West, and Maggie E Toplak. Myside bias, rational thinking, and intelligence.Current Directions in Psychological Science, 22(4):259–264, 2013

  25. [25]

    Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions

    Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. InProceedings of the 25th International Conference on World Wide Web, WWW ’16, page 613–624. International World Wide Web Conferences Steering Committee, April 2016

  26. [26]

    Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023

    Zhexiong Liu, Mohamed Elaraby, Yang Zhong, and Diane Litman. Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023

  27. [27]

    Qwen2.5-vl, January 2025

    Qwen Team. Qwen2.5-vl, January 2025

  28. [28]

    arXiv preprint arXiv:2511.19663 , year=

    Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas, and Rachel Ward. Phi-4-vision-reasoning technical report.arXiv:2511.19663, 2026

  29. [29]

    One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023

    Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, and Chang Xu. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023

  30. [30]

    Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030, 2024

    Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030, 2024

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Openai gpt-5 system card, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  33. [33]

    Multimodal explanations: Justifying decisions and pointing to the evidence

    Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788, 2018

  34. [34]

    Taking a hint: Leveraging explanations to make vision and language models more grounded

    Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. InProceedings of the IEEE/CVF international conference on computer vision, pages 2591–2600, 2019

  35. [35]

    Clevr-x: A visual reasoning dataset for natural language explanations

    Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020

  36. [36]

    A benchmark for in- terpretability methods in deep neural networks.Advances in neural information processing systems, 32, 2019

    Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for in- terpretability methods in deep neural networks.Advances in neural information processing systems, 32, 2019

  37. [37]

    Faithfulness tests for natural language explanations

    Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, 2023

  38. [38]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  39. [39]

    All that’s ‘human’is not gold: Evaluating human evaluation of generated text

    Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long ...

  40. [40]

    The perils of using mechanical turk to evaluate open-ended text generation

    Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, 2021

  41. [41]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

  42. [42]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  43. [43]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  44. [44]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

  45. [45]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  46. [46]

    The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977. A Prompt Templates for Reasoning Data Collection This appendix provides the prompt templates used to collect visual persuasion reasoning. Each prompt is conditioned on an input image, an intended message, and a target binary persua...

  47. [47]

    This indicates that some annotators tend to assign systematically low scores while others tend to assign systematically high scores. To mitigate this bias, we convert each annotator’s raw score into a binary vote relative to that annotator’s own score distribution, rather than averaging raw scores across annotators. For each annotator a, we compute the fi...

  48. [48]

    CLIP with text:Measuring CLIP similarity between the rationale text and the image description provided in the PVP dataset

  49. [49]

    CLIP with image:Measuring CLIP similarity between the rationale text and the corre- sponding image

  50. [50]

    GPT-5 atomic facts:Decomposing the rationale into discrete atomic facts and prompting GPT-5 to verify each strictly against the image

  51. [51]

    GPT-5 atomic facts (calibrated):The same atomic facts approach, but explicitly calibrated using the validation set to select a threshold for ratio (Nyes/Ntotal)

  52. [52]

    Visual Groundedness

    GPT-5 prompting:A direct, zero-shot prompt asking GPT-5 to act as a judge of the rationale’s groundedness in the provided image. Table 13: Comparison of groundedness evaluation methods (test set). Methodκ(vs Majority)κ(vs U1)κ(vs U2)κ(vs U3) Bal. Acc. F1 (Yes) F1 (No) CLIP with image (> 0.3) -0.0016 0.0647 0.0349 -0.0192 0.4983 0.7414 0.1616 CLIP with tex...

  53. [53]

    You must determine if the concrete visual claims made in the reasoning text physically exist in the provided image

    A reasoning text about the image Your objective is to perform a Groundedness Evaluation. You must determine if the concrete visual claims made in the reasoning text physically exist in the provided image. ### Evaluation Rules:

  54. [54]

    The design lacks clarity

    Evaluate Concrete Visual Claims ONLY: Mentally break down the reasoning text into physical visual claims (objects, characters, gestures, backgrounds, explicit text). DO NOT evaluate subjective evaluations, viewer impacts, or rhetorical conclusions (e.g., ignore statements like "The design lacks clarity", "This is unpersuasive", or "It compels the viewer")

  55. [55]

    Message" vs

    "Message" vs. "Physical Text": If the reasoning says "The image conveys the message ’Do not smoke’", do NOT look for a physical sign saying "Do not smoke". Treat it as a thematic statement. ONLY strictly verify text if the reasoning explicitly claims it is written on a physical object (e.g., "The sign reads ’X’")

  56. [56]

    EMERGENCY XIT

    Lenient Text Matching for AI Images: AI-generated images often contain mangled text. If the reasoning quotes text and the image contains text that is clearly attempting to spell that phrase (e.g., "EMERGENCY XIT", "Mid-d’y "), you MUST consider the claim GROUNDED. Do not penalize minor typos, missing letters, or strange characters. 30

  57. [57]

    student" for someone studying. If a claim mentions

    Contextual Synonyms & Reasonable Assumptions: Use reasonable human logic. Accept "student" for someone studying. If a claim mentions "empty bottles" and the bottles are opaque, assume they are empty based on context. Be lenient with quantity descriptors like "filled," "abundance," or "scattered ."

  58. [58]

    stark contrast between a dirty hood and the rest of the kitchen,

    STRICT Relational and State Accuracy: You must verify the specific *state*, * adjectives*, and *relationships* of objects, not just their presence. If the text claims a "stark contrast between a dirty hood and the rest of the kitchen," and the rest of the kitchen is ALSO dirty, the claim is false. If it claims chains are "broken" but they are intact, or "...

  59. [59]

    No". If all concrete physical claims and their described states are perfectly visible, the label MUST be

    Zero Tolerance for Physical Hallucinations: If the reasoning describes EVEN ONE concrete physical element, object, specific state, or character that is completely NOT visible in the image, the final label MUST be "No". If all concrete physical claims and their described states are perfectly visible, the label MUST be "Yes". ### Output Format: Return a JSO...

  60. [60]

    Read the image and the explanation carefully

  61. [61]

    Identify the visual element that is most decisive for the explanation’s conclusion

  62. [62]

    ### Rules

    Generate a concise image editing prompt that changes this element so that the image moves against the explanation. ### Rules

  63. [63]

    Remove - If the visual element is a concrete object, person, or other visible entity that already appears in the image, remove it

  64. [64]

    - Select Modify only if removing the element would make the image unnatural

    Modify - If the visual element is atmosphere, setting, action, or emotion, modify it in the opposite visual direction. - Select Modify only if removing the element would make the image unnatural

  65. [65]

    editable_target

    Add 31 - If the visual element is the absence of something, add that missing element naturally. ### Rules - Focus on one clear, editable visual target. - The edit should be specific, visible, and directly reversible with respect to the explanation. - The final prompt should be simple and precise. ### Remove Example - Visual element: police cars and crime ...