pith. sign in

arxiv: 2505.20214 · v2 · submitted 2025-05-26 · 💻 cs.AI

When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords inverse scaling lawtruthfulnessmultimodal reasoningslow-thinkingdepth-first searchbreadth-first searchambiguous inputsreasoning models
0
0 comments X

The pith

Slow-thinking multimodal models fabricate more false details than fast ones when visuals are incomplete or misleading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that slower reasoning does not always produce truer answers in tasks involving images and text. When inputs are ambiguous or misleading, models that engage in extended thinking tend to create plausible-sounding but incorrect justifications for their conclusions. The authors build a dataset of 5,000 progressively complex prompts judged by humans to demonstrate this pattern consistently across models. A reader might care because it questions the benefit of more compute for accuracy in real-world uncertain scenarios like visual question answering.

Core claim

Slow-thinking models are more prone to fabricating plausible yet false details to justify untruthful reasoning when confronted with incomplete or misleading visual inputs. Analysis of a 5,000-sample hierarchical prompt dataset reveals that slower reasoning models tend to follow depth-first search thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search inference and show greater caution under uncertainty.

What carries the argument

The inverse scaling law of truthfulness in multimodal reasoning, manifested through depth-first search in slow models versus breadth-first in fast models.

If this is right

  • Slow-thinking models are fragile when facing ambiguous multimodal inputs despite success in structured domains.
  • DFS reasoning causes persistent exploration of flawed premises.
  • Faster models demonstrate greater caution under uncertainty.
  • The vulnerability highlights a need for better handling of incomplete visual data in reasoning systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI developers may benefit from capping reasoning steps for uncertain visual inputs to avoid fabricated justifications.
  • The finding raises questions about the general reliability of increased computation in ambiguous settings.
  • Further experiments could test if this inverse law appears in other modalities or tasks.

Load-bearing premise

The human-annotated 5,000-sample hierarchical prompt dataset successfully isolates the effect of reasoning depth on truthfulness without confounding from prompt design or model differences.

What would settle it

If a new evaluation with standardized prompts and known ground truth shows no increase in fabricated details from slow-thinking models compared to fast ones, the inverse scaling claim would be challenged.

Figures

Figures reproduced from arXiv: 2505.20214 by Chi-Min Chan, Jiahao Li, Jiaming Ji, Juntao Dai, Sirui Han, Sitong Fang, Wenjing Cao, Xuyao Wang, Yaodong Yang, Yike Guo.

Figure 1
Figure 1. Figure 1: Evaluation landscape of MLLMs on TRUTHFULVQA. We normalized inverse variance of accuracy across levels, capturing stability under misleading hierarchical prompts. The performance of chat models scales with model size. However, performance declines when models are fine-tuned for slower reasoning, revealing an inverse scaling trend in multimodal reasoning. more complex, MLLMs often exploit spurious cor￾relat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview and pipeline of the hierarchical TRUTHFULVQA framework. The dataset was constructed with contributions from 50 human annotators. Images gathered from online sources are paired with hierarchically structured, human-written question sets designed to probe multiple forms of untruthfulness. Thinking, and other models finetuned from instruct models such as Mulberry series models. Addition￾ally, we eval… view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive evaluation of 50+ models on TRUTHFULVQA. a. Grouped bar chart comparing the normalized LAL across level transitions for three pairs of reasoning and chat models. b. Violin plots of model accuracy at the three levels, illustrating the distribution within each level. c. Density map of mean versus variance of model accuracy across the three levels. d. Heat map of mainstream MLLMs’ accuracy acros… view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative experiments on the thinking paradigm. a, Expected Calibration Error (ECE) of nine models, sorted ascending. Each dot marks one model; light blue denotes chat models and dark blue denotes reasoning models. The shaded band groups the reasoning models, and the value of each model’s ECE is printed next to its dot (lower is better). b, Accuracy of five chat models under standard prompting (light bl… view at source ↗
Figure 5
Figure 5. Figure 5: Win rate. Comparison of GPT-4o, Llama4- Maverick and Qwen2.5-VL-72B, evaluated by Gemini￾1.5, TruthfulJudge and human. first, followed by Qwen2.5-VL-72B and GPT-4o. These rankings are consistent with those obtained from multiple-choice evaluations. Full ELo results are provided in Appendix E.4. Overall, our findings show that a well-calibrated MLLM like TruthfulJudge can serve as a reliable proxy for human… view at source ↗
Figure 6
Figure 6. Figure 6: The Classification of TRUTHFULVQA In the pairwise preference interface, annotators are presented with an image, a natural language question, and two model-generated responses (Response A and Response B). They are asked to select the better response, or choose “No Preference” if both are equally good or flawed. Evaluation Dimensions: • Visual Grounding: Whether the response references visual elements in the… view at source ↗
Figure 7
Figure 7. Figure 7: Interface for pairwise preference annotation [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Interface for absolute scoring annotation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise win rates of mainstream MLLMs. - Which response better avoids untruthfulness or fabricated information ? - Which response more correctly interprets factual elements in the image ? Also , consider how each response handles these challenges : 1. Information Hiding : Visual distortion , blurring , concealed features 2. Feature Forgery : Physical manipulation , natural confusion , fake elements 3. Per… view at source ↗
read the original abstract

Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers? Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify untruthful reasoning. To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty. These findings reveal a critical vulnerability of reasoning models: while effective in structured domains such as math, their DFS-style reasoning becomes fragile when confronted with ambiguous, multimodal inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims an inverse scaling law of truthfulness in multimodal reasoning: slow-thinking (reasoning) models are more prone than fast-thinking (chat) models to fabricating plausible but false details when visual inputs are incomplete or misleading. The authors support this via a new 5,000-sample hierarchical prompt dataset annotated by 50 humans, observing that reasoning models follow depth-first search (DFS) patterns that persist on flawed premises while chat models use breadth-first search (BFS) and exhibit greater caution under uncertainty.

Significance. If the central empirical pattern holds after methodological verification, the result would be significant for multimodal AI and reasoning model design. It challenges the assumption that deeper reasoning improves reliability and identifies a concrete vulnerability (DFS-style persistence on ambiguous inputs) that could inform safer model development. The scale of the human-annotated hierarchical dataset is a clear strength for behavioral analysis.

major comments (3)
  1. [Methods] Methods section: The 5,000-sample dataset is central to the inverse scaling claim, yet the manuscript provides no details on inter-annotator agreement (e.g., Fleiss' kappa), resolution of disagreements among the 50 annotators, or statistical controls for prompt difficulty and linguistic confounds. This leaves the ground-truth truthfulness labels unverified.
  2. [Experimental Setup] Experimental Setup: The comparison between reasoning models and chat models does not control for post-training differences (e.g., CoT-specific fine-tuning versus standard alignment). Without matched base models or ablation of training regime, it is unclear whether the observed fabrication tendency arises from reasoning depth or from training disparities.
  3. [Results] Results: The reported consistent pattern across complexity levels lacks statistical significance tests, confidence intervals, or baseline comparisons against non-reasoning multimodal models. This weakens the load-bearing claim that slower thinking produces reliably lower truthfulness.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'inverse scaling law' is introduced without a formal definition or explicit contrast to existing scaling literature on reasoning or truthfulness.
  2. [Introduction] Introduction: The mapping of model behavior to DFS versus BFS search should be clarified with concrete examples of inference traces to avoid conflation with classical search algorithms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for strengthening our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Methods] Methods section: The 5,000-sample dataset is central to the inverse scaling claim, yet the manuscript provides no details on inter-annotator agreement (e.g., Fleiss' kappa), resolution of disagreements among the 50 annotators, or statistical controls for prompt difficulty and linguistic confounds. This leaves the ground-truth truthfulness labels unverified.

    Authors: We agree that these methodological details are necessary to establish the reliability of the annotations. In the revised manuscript, we will add a dedicated subsection in the Methods section reporting Fleiss' kappa for inter-annotator agreement, describing the disagreement resolution process (initial independent annotation followed by group discussion for consensus), and including regression-based controls for prompt difficulty and linguistic confounds. These additions will directly address the verification of ground-truth labels. revision: yes

  2. Referee: [Experimental Setup] Experimental Setup: The comparison between reasoning models and chat models does not control for post-training differences (e.g., CoT-specific fine-tuning versus standard alignment). Without matched base models or ablation of training regime, it is unclear whether the observed fabrication tendency arises from reasoning depth or from training disparities.

    Authors: This is a substantive concern regarding potential confounds. Our study compares representative deployed models embodying the slow-thinking versus fast-thinking paradigms, but we recognize that differences in post-training may play a role. We will revise the Experimental Setup and Limitations sections to explicitly discuss this issue and clarify that isolating the effect would require controlled ablations on open-source models, which we note as an important direction for future work. We do not claim the results isolate reasoning depth from all training factors. revision: partial

  3. Referee: [Results] Results: The reported consistent pattern across complexity levels lacks statistical significance tests, confidence intervals, or baseline comparisons against non-reasoning multimodal models. This weakens the load-bearing claim that slower thinking produces reliably lower truthfulness.

    Authors: We concur that additional statistical rigor and baselines would strengthen the presentation of results. In the revised Results section, we will report statistical significance tests (including p-values from appropriate tests such as chi-squared or ANOVA across complexity levels), 95% confidence intervals for key metrics, and comparisons against additional non-reasoning multimodal baselines (e.g., standard vision-language models without explicit reasoning). These changes will better support the central claim. revision: yes

Circularity Check

0 steps flagged

Empirical observational study with no derivation chain or self-referential reductions

full rationale

The paper conducts an empirical behavioral study by constructing a new 5,000-sample hierarchical prompt dataset, obtaining human annotations from 50 participants, and observing model responses to incomplete or misleading visual inputs. It reports patterns such as slow-thinking models favoring DFS-style reasoning and fabricating details, contrasted with faster models using BFS inference. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented; the central claims rest on direct experimental observations rather than any step that reduces by construction to prior inputs, self-citations, or renamed known results. The study is self-contained against external benchmarks via the newly collected annotations and model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the constructed prompt hierarchy and the reliability of human truthfulness judgments; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human participants can reliably identify fabricated details versus truthful reasoning in model outputs
    The study uses annotations from 50 humans to label truthfulness across the dataset.

pith-pipeline@v0.9.0 · 5751 in / 1169 out tokens · 61722 ms · 2026-05-19T12:41:55.503444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Critique-out-loud reward models.arXiv preprint arXiv:2408.11791, 2024

    Critique-out-loud reward models.arXiv preprint arXiv:2408.11791. Percy artist Moran. 1947. On the method of paired comparisons.Biometrika, 34 Pt 3-4:363–5. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, and 1 others

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee. Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wa...

  4. [4]

    Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575. Arpad E. Elo. 1978.The Rating of Chessplayers, Past and Present, 1st edition. Arco Publishing, New York, NY . Alessandro Favero, Luca Zancato, Matthew Trager, Sid- dharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto

  5. [5]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, and 1 others. 2023. Mme: A comprehensive evaluation benchmark f...

  6. [6]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompt. InNeurips Safe Generative AI Workshop 2024. statista. 2025. Beijing’s minimum hourly wage. https: //www.statista.com/statistics/233886/min imum-wage-per-hour-in-china-by-city-and -province/. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, ...

  7. [7]

    No Preference

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Appendix Table of Contents A Research Permit and Existing Assets Licenses 14 B Experiment Resources and Hardware 14 C Dataset Categorization 14 D Annotation Documents 16 D.1 Annotation Protocol of TRUTHFULVQA . . . . . . . . . . . . ...

  8. [8]

    Information Hiding : Visual distortion , blurring , concealed features

  9. [9]

    Feature Forgery : Physical manipulation , natural confusion , fake elements

  10. [10]

    Perspective Restriction : Cropping , unusual angles , shape distortion

  11. [11]

    Contextual Bias : Background interference , emotional manipulation

  12. [12]

    Information Forgery : Factual fabrication , image manipulation

  13. [13]

    Fictional Information : Fabricated elements , imaginary concepts

  14. [14]

    Imitative Falsehood : Misapplied reasoning , semantic bias

  15. [15]

    - Don't hesitate to assign low confidence score if need to

    Eye Illusion : Perceptual multiplicity , optical illusions # label - A : Response A is better than Response B - B : Response B is better than Response A - None : Response A and Response B are almost of the same quality # confidence - Provide a confidence score between 0 and 1 indicating how certain you are about your judgment - 0.0 -0.3: Very uncertain , ...

  16. [16]

    Be thorough and objective in your evaluation

  17. [17]

    Consider both strengths and weaknesses of each response

  18. [18]

    </ critique > , winner label ( A or B or None ) in < label >

    Enclose your critique in < critique >... </ critique > , winner label ( A or B or None ) in < label >... </ label > , and confidence score in < confidence >... </ confidence >

  19. [19]

    Assign a low or medium confidence score from time to time for data annotation diversity

  20. [20]

    messages

    Use fine - grained confidence scores ( e . g . , 0.01 increments ) to reflect subtle differences . We verify the consistency between the explanatory rationale and the human preference label by string regular match. F.2 Training Process Based on the above procedures, we construct the dataset M={x (i),r (i) 1 ,r (i) 2 ,y (i)}, where x is the question-image ...