Recognition: 2 theorem links
· Lean TheoremCan MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning
Pith reviewed 2026-05-12 02:08 UTC · model grok-4.3
The pith
Diverse teacher-generated rationales improve MLLM performance on visual persuasiveness prediction and support better evaluation of faithful reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that naively prompting MLLMs to reason before prediction does not consistently help and can reduce performance on visual persuasiveness tasks, but fine-tuning on diverse teacher-generated rationales improves prediction accuracy. It further claims that a three-dimensional faithfulness evaluation framework, consisting of rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences.
What carries the argument
The three-dimensional faithfulness evaluation framework that measures rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity, applied after supervised fine-tuning with diverse teacher rationales.
If this is right
- Naive chain-of-thought prompting is unreliable for predicting visual persuasion.
- Supervised fine-tuning with diverse rationales raises both accuracy and the potential for faithful reasoning.
- High prediction performance does not ensure that rationales faithfully support the model's decisions.
- Rationale-to-decision sensitivity provides the strongest signal for aligning with human preferences on explanations.
Where Pith is reading between the lines
- The same rationale-supervision approach could extend to other multimodal tasks requiring explanations, such as visual question answering.
- Incorporating the sensitivity metric directly into training objectives might produce models whose explanations better match human expectations.
- Automated generation of diverse rationales at scale could support larger training datasets without heavy reliance on manual annotation.
Load-bearing premise
That the teacher-generated rationales are diverse and high-quality enough to serve as reliable supervision targets and that the three faithfulness metrics accurately reflect human notions of good reasoning.
What would settle it
A test showing that models with high prediction accuracy but low rationale-to-decision sensitivity scores receive higher human preference ratings than models with high sensitivity scores would falsify the claim that sensitivity best captures faithful reasoning.
Figures
read the original abstract
Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that naively prompting MLLMs to reason before predicting visual persuasiveness does not help and can reduce performance. It shows empirically and theoretically that supervised fine-tuning on diverse teacher-generated rationales improves prediction accuracy. The authors introduce a three-dimensional faithfulness evaluation framework (rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity) and demonstrate that high prediction performance does not guarantee faithful rationales, while rationale-to-decision sensitivity best aligns with human rationale preferences. The work motivates faithfulness-aware training objectives and releases code and dataset.
Significance. If the results hold, the paper makes a useful contribution by identifying limitations of standard reasoning prompts for MLLMs on visual persuasion and by providing a structured faithfulness framework that goes beyond accuracy metrics. The finding that sensitivity correlates most with human judgments is actionable for future work. Public release of code and dataset supports reproducibility and is a clear strength.
major comments (2)
- [Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.
- [Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.
minor comments (1)
- [Abstract] The abstract states that code and dataset will be made publicly available; the manuscript should include a specific link or repository reference in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the presentation of our results on MLLM reasoning for visual persuasion. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Experimental results section] The central claim that diverse teacher-generated rationales improve SFT performance rests on the assumption that these rationales are sufficiently diverse and high-quality. The manuscript should include explicit quantitative measures of diversity (e.g., pairwise semantic distances or entropy over rationale embeddings) and human quality ratings with inter-annotator agreement to validate this assumption.
Authors: We agree that explicit quantitative validation of rationale diversity and quality would make the central claim more robust. The current manuscript demonstrates empirical gains from SFT on these rationales but does not report the suggested metrics. In the revised version, we will add (1) average pairwise cosine similarity distances computed over sentence-transformer embeddings of the rationales and (2) entropy over rationale clusters obtained via k-means on the same embeddings. We will also include human quality ratings collected from three independent annotators on a 5-point Likert scale for relevance, coherence, and groundedness, along with inter-annotator agreement measured by Fleiss' kappa. These additions will appear in a new subsection of the experimental results. revision: yes
-
Referee: [Faithfulness evaluation framework section] The three-dimensional faithfulness framework, particularly the rationale-to-decision sensitivity metric, is shown to best match human preferences. However, the human correlation study requires more detail on the number of annotators, agreement statistics, and controls for bias to ensure the alignment claim is robust and not dependent on the specific evaluation setup.
Authors: We acknowledge that the human correlation analysis would benefit from greater transparency. The manuscript currently reports that sensitivity best aligns with human preferences but provides limited procedural detail. In revision, we will expand the relevant paragraph to state the exact number of annotators, the agreement statistic (Fleiss' kappa), and the bias-mitigation steps employed (randomized item order, blinding to model outputs, and use of a standardized rating interface). These clarifications will confirm that the reported alignment is not an artifact of the evaluation protocol. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's core claims rest on empirical experiments (prompting MLLMs, SFT with external teacher-generated rationales, and human preference correlations) rather than any derivation that reduces to self-defined inputs or fitted parameters by construction. The three-dimensional faithfulness framework is introduced as a new evaluation tool and validated externally against human judgments; its definitions (consistency, groundedness, sensitivity) do not equate to the model's own predictions or decisions in a load-bearing circular manner. No self-citation chains, ansatz smuggling, or renaming of known results appear in the abstract or described methodology. The work is self-contained against external benchmarks like teacher models and human studies.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Teacher-generated rationales provide higher-quality supervision than model self-generated ones for this task
- domain assumption Human rationale preferences serve as a reliable external benchmark for faithfulness
invented entities (1)
-
Three-dimensional faithfulness evaluation framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first show that prompting MLLMs to reason before prediction does not consistently help... introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Daniel Chandler and Rod Munday.A dictionary of media and communication. OUP Oxford, 2011
work page 2011
-
[2]
Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, and Yohan Jo. Pvp: An image dataset for personalized visual persuasion with persuasion strategies, viewer characteristics, and persuasiveness ratings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19209–19237, 2025
work page 2025
-
[3]
Cap: Evaluation of persuasive and creative image generation
Aysan Aghazadeh and Adriana Kovashka. Cap: Evaluation of persuasive and creative image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16970–16980, 2025
work page 2025
-
[4]
The possibility and actuality of visual arguments
J Anthony Blair. The possibility and actuality of visual arguments. InGroundwork in the theory of Argumentation: Selected papers of J. Anthony Blair, pages 205–223. Springer, 2011
work page 2011
-
[5]
The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015
Jens E Kjeldsen. The study of visual and multimodal argumentation.Argumentation, 29(2):115– 132, 2015
work page 2015
-
[6]
Edward F McQuarrie and David Glen Mick. Visual rhetoric in advertising: Text-interpretive, experimental, and reader-response analyses.Journal of consumer research, 26(1):37–54, 1999
work page 1999
-
[7]
Barbara J Phillips and Edward F McQuarrie. Beyond visual metaphor: A new typology of visual rhetoric in advertising.Marketing theory, 4(1-2):113–136, 2004
work page 2004
-
[8]
Visual persuasion: Inferring communicative intents of images
Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring communicative intents of images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 216–223, 2014
work page 2014
-
[9]
Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022
Zhexiong Liu, Meiqi Guo, Yue Dai, and Diane Litman. Imagearg: A multi-modal tweet dataset for image persuasiveness mining, 2022
work page 2022
-
[10]
Forest before trees: The precedence of global features in visual perception
David Navon. Forest before trees: The precedence of global features in visual perception. Cognitive psychology, 9(3):353–383, 1977
work page 1977
-
[11]
Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in recognition.Progress in brain research, 155:23–36, 2006
work page 2006
-
[12]
Teaching small language models to reason
Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, 2023
work page 2023
-
[13]
Large language models are reasoning teachers
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 14852–14882, 2023
work page 2023
-
[14]
Distilling reasoning capabilities into smaller language models
Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023
work page 2023
-
[15]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[16]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
work page 2024
-
[17]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020
work page 2020
-
[18]
Eraser: A benchmark to evaluate rationalized nlp models
Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, 2020
work page 2020
-
[19]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning
Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024
work page 2024
-
[21]
Cambridge University Press, 2005
Jonathan Baron.Rationality and intelligence. Cambridge University Press, 2005
work page 2005
-
[22]
Keith E Stanovich and Richard F West. Reasoning independently of prior belief and individual differences in actively open-minded thinking.Journal of educational psychology, 89(2):342, 1997
work page 1997
-
[23]
Raymond S Nickerson. Confirmation bias: A ubiquitous phenomenon in many guises.Review of general psychology, 2(2):175–220, 1998
work page 1998
-
[24]
Keith E Stanovich, Richard F West, and Maggie E Toplak. Myside bias, rational thinking, and intelligence.Current Directions in Psychological Science, 22(4):259–264, 2013
work page 2013
-
[25]
Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions
Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. InProceedings of the 25th International Conference on World Wide Web, WWW ’16, page 613–624. International World Wide Web Conferences Steering Committee, April 2016
work page 2016
-
[26]
Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023
Zhexiong Liu, Mohamed Elaraby, Yang Zhong, and Diane Litman. Overview of imagearg-2023: The first shared task in multimodal argument mining, 2023
work page 2023
- [27]
-
[28]
arXiv preprint arXiv:2511.19663 , year=
Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas, and Rachel Ward. Phi-4-vision-reasoning technical report.arXiv:2511.19663, 2026
-
[29]
One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023
Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, and Chang Xu. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation, 2023
work page 2023
-
[30]
Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030, 2024
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Openai gpt-5 system card, 2025
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...
work page 2025
-
[33]
Multimodal explanations: Justifying decisions and pointing to the evidence
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788, 2018
work page 2018
-
[34]
Taking a hint: Leveraging explanations to make vision and language models more grounded
Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. InProceedings of the IEEE/CVF international conference on computer vision, pages 2591–2600, 2019
work page 2019
-
[35]
Clevr-x: A visual reasoning dataset for natural language explanations
Leonard Salewski, A Sophia Koepke, Hendrik PA Lensch, and Zeynep Akata. Clevr-x: A visual reasoning dataset for natural language explanations. InInternational Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 69–88. Springer, 2020
work page 2020
-
[36]
Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for in- terpretability methods in deep neural networks.Advances in neural information processing systems, 32, 2019
work page 2019
-
[37]
Faithfulness tests for natural language explanations
Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, 2023
work page 2023
-
[38]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page 2025
-
[39]
All that’s ‘human’is not gold: Evaluating human evaluation of generated text
Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long ...
work page 2021
-
[40]
The perils of using mechanical turk to evaluate open-ended text generation
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, 2021
work page 2021
-
[41]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...
work page 2024
-
[42]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021
work page 2021
-
[43]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019
work page 2019
-
[44]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020
work page 2020
-
[45]
Zero: Memory optimiza- tions toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020
work page 2020
-
[46]
The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977
J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977. A Prompt Templates for Reasoning Data Collection This appendix provides the prompt templates used to collect visual persuasion reasoning. Each prompt is conditioned on an input image, an intended message, and a target binary persua...
work page 1977
-
[47]
This indicates that some annotators tend to assign systematically low scores while others tend to assign systematically high scores. To mitigate this bias, we convert each annotator’s raw score into a binary vote relative to that annotator’s own score distribution, rather than averaging raw scores across annotators. For each annotator a, we compute the fi...
work page 2047
-
[48]
CLIP with text:Measuring CLIP similarity between the rationale text and the image description provided in the PVP dataset
-
[49]
CLIP with image:Measuring CLIP similarity between the rationale text and the corre- sponding image
-
[50]
GPT-5 atomic facts:Decomposing the rationale into discrete atomic facts and prompting GPT-5 to verify each strictly against the image
-
[51]
GPT-5 atomic facts (calibrated):The same atomic facts approach, but explicitly calibrated using the validation set to select a threshold for ratio (Nyes/Ntotal)
-
[52]
GPT-5 prompting:A direct, zero-shot prompt asking GPT-5 to act as a judge of the rationale’s groundedness in the provided image. Table 13: Comparison of groundedness evaluation methods (test set). Methodκ(vs Majority)κ(vs U1)κ(vs U2)κ(vs U3) Bal. Acc. F1 (Yes) F1 (No) CLIP with image (> 0.3) -0.0016 0.0647 0.0349 -0.0192 0.4983 0.7414 0.1616 CLIP with tex...
-
[53]
A reasoning text about the image Your objective is to perform a Groundedness Evaluation. You must determine if the concrete visual claims made in the reasoning text physically exist in the provided image. ### Evaluation Rules:
-
[54]
Evaluate Concrete Visual Claims ONLY: Mentally break down the reasoning text into physical visual claims (objects, characters, gestures, backgrounds, explicit text). DO NOT evaluate subjective evaluations, viewer impacts, or rhetorical conclusions (e.g., ignore statements like "The design lacks clarity", "This is unpersuasive", or "It compels the viewer")
-
[55]
"Message" vs. "Physical Text": If the reasoning says "The image conveys the message ’Do not smoke’", do NOT look for a physical sign saying "Do not smoke". Treat it as a thematic statement. ONLY strictly verify text if the reasoning explicitly claims it is written on a physical object (e.g., "The sign reads ’X’")
-
[56]
Lenient Text Matching for AI Images: AI-generated images often contain mangled text. If the reasoning quotes text and the image contains text that is clearly attempting to spell that phrase (e.g., "EMERGENCY XIT", "Mid-d’y "), you MUST consider the claim GROUNDED. Do not penalize minor typos, missing letters, or strange characters. 30
-
[57]
student" for someone studying. If a claim mentions
Contextual Synonyms & Reasonable Assumptions: Use reasonable human logic. Accept "student" for someone studying. If a claim mentions "empty bottles" and the bottles are opaque, assume they are empty based on context. Be lenient with quantity descriptors like "filled," "abundance," or "scattered ."
-
[58]
stark contrast between a dirty hood and the rest of the kitchen,
STRICT Relational and State Accuracy: You must verify the specific *state*, * adjectives*, and *relationships* of objects, not just their presence. If the text claims a "stark contrast between a dirty hood and the rest of the kitchen," and the rest of the kitchen is ALSO dirty, the claim is false. If it claims chains are "broken" but they are intact, or "...
-
[59]
Zero Tolerance for Physical Hallucinations: If the reasoning describes EVEN ONE concrete physical element, object, specific state, or character that is completely NOT visible in the image, the final label MUST be "No". If all concrete physical claims and their described states are perfectly visible, the label MUST be "Yes". ### Output Format: Return a JSO...
-
[60]
Read the image and the explanation carefully
-
[61]
Identify the visual element that is most decisive for the explanation’s conclusion
- [62]
-
[63]
Remove - If the visual element is a concrete object, person, or other visible entity that already appears in the image, remove it
-
[64]
- Select Modify only if removing the element would make the image unnatural
Modify - If the visual element is atmosphere, setting, action, or emotion, modify it in the opposite visual direction. - Select Modify only if removing the element would make the image unnatural
-
[65]
Add 31 - If the visual element is the absence of something, add that missing element naturally. ### Rules - Focus on one clear, editable visual target. - The edit should be specific, visible, and directly reversible with respect to the explanation. - The final prompt should be simple and precise. ### Remove Example - Visual element: police cars and crime ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.