pith. sign in

arxiv: 2605.27906 · v1 · pith:OZDOQZBNnew · submitted 2026-05-27 · 💻 cs.AI

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Pith reviewed 2026-06-29 13:01 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal reasoninghallucination mitigationchain of thoughtdirect preference optimizationvision language modelsreasoning modelspreference optimization
0
0 comments X

The pith

Modeling the reasoning chain as a condition for the answer lets preference optimization align supportive reasoning and reduce hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large reasoning models produce step-by-step chains but still hallucinate on vision-language tasks. Standard response-level direct preference optimization treats the full output as one block and ends up optimizing mainly at the answer level, leaving the reasoning steps under-supervised. The paper derives Reasoning-Conditioned DPO that explicitly treats the Chain-of-Thought as a conditioning variable and contrasts the same preferred answer under different reasoning conditions to enforce alignment. It also supplies a data-generation procedure that uses Monte Carlo Tree Search to find grounded positive chains and attention-based pruning to create negative ones. Experiments across models and benchmarks show the method reduces hallucinations and improves reasoning reliability.

Core claim

The central claim is that standard response-level DPO performs similarly to answer-only optimization and therefore fails to exploit CoT-level supervision; RC-DPO corrects this by modeling the CoT as a condition for answer generation and contrasting preference for the identical preferred answer under different CoT conditions, thereby promoting answer-supportive reasoning chain alignment, with the effect strengthened by Monte Carlo Tree Search positive samples and attention-guided negative samples.

What carries the argument

Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which treats the Chain-of-Thought as a conditioning variable on answer generation and contrasts preferences across different CoTs for the same answer.

If this is right

  • RC-DPO produces measurable reductions in hallucinations compared with response-level DPO on the same models and benchmarks.
  • The method improves the reliability of the multimodal reasoning process by aligning reasoning chains with correct answers.
  • The Monte Carlo Tree Search and attention-pruning procedure generates training pairs that further strengthen the optimization.
  • Gains hold across multiple base models and multiple vision-language benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning idea could be inserted into other preference-based alignment algorithms beyond DPO.
  • The separation of reasoning and answer optimization stages suggests a route for diagnosing which parts of a chain cause hallucinations.
  • The search-and-pruning data pipeline could be reused to build larger, higher-quality reasoning datasets for any multimodal model.

Load-bearing premise

Standard response-level DPO performs similarly to answer-only optimization and therefore leaves CoT-level supervision insufficiently exploited.

What would settle it

A controlled ablation in which standard response-level DPO is shown to produce measurably different hallucination rates than pure answer-only optimization on the same benchmarks would falsify the motivation for introducing the conditioning term.

Figures

Figures reproduced from arXiv: 2605.27906 by Bin Chen, Hao Fang, Hao Wu, Jiawei Kong, Jinyu Li, Min Zhang, Shunxiang Liao, Shu-Tao Xia.

Figure 1
Figure 1. Figure 1: A visual question answering example gener [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivating observations for RC-DPO. (a) Loss-ratio curves of CoT and answer segments during OPA [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of RC-DPO. We first construct SFT data to align the CoT format. Then, we construct positive [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Radar chart of the GPT-4 assisted evaluation results. We report the Sentence-level Hallucination Ratio [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies of RC-DPO. (a) Effects of different components on CHAIR [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Segment-level CHAIRS results on Object HalBench. We compare Vanilla, OPA-DPO, and RC-DPO across three MLRMs, and report CHAIRS scores for the full response, CoT, and final answer separately. used as the verifier to score each trajectory based on the prompt shown in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall MMBench scores of RC-DPO and baseline methods on MM-Eureka-7B. to the high hallucination rate of the full response. OPA-DPO, based on response-level preference op￾timization, alleviates hallucinations to some extent on several models. However, due to the answer￾shortcut phenomenon discussed in Section 2, CoT hallucination remains significantly higher than an￾swer hallucination, indicating that reas… view at source ↗
Figure 9
Figure 9. Figure 9: Few-shot prompt used to generate well-structured reasoning trajectories for MCTS-based positive [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used to verify the quality of CoT-answer trajectories during MCTS sampling. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization results comparing our RC-DPO and other methods on MM-Eureka-7B. Hallucinations are [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard response-level DPO on (CoT + answer) pairs behaves similarly to answer-only optimization, leaving CoT-level supervision unexploited and failing to mitigate hallucinations effectively in multimodal large reasoning models. It introduces Reasoning-Conditioned DPO (RC-DPO), which treats the CoT as a conditioning variable and contrasts preferences for the same answer under different CoT conditions to promote answer-supportive reasoning alignment. A data-generation pipeline using MCTS for positive (visually grounded) CoTs and attention-guided pruning for negatives is proposed, with experiments across models and benchmarks reported to show reduced hallucinations and improved reasoning reliability.

Significance. If the central empirical claim holds and the motivation is substantiated, RC-DPO would represent a targeted improvement over monolithic DPO for reasoning models by explicitly aligning CoT with answers. The MCTS-based positive sampling and pruning strategy could be reusable for other preference-based reasoning alignment tasks. The work directly addresses a practical failure mode (hallucinations) in an active area of multimodal LLM research.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method motivation): The claim that 'this formulation performs similarly to answer-only optimization' is presented as an empirical observation but is not accompanied by any quantitative comparison, ablation table, loss decomposition, or statistical test in the provided text. This equivalence is load-bearing for the motivation of the RC-DPO conditioning term; without evidence that standard DPO leaves CoT supervision unexploited, the new contrastive term may not address a genuine gap rather than an incremental reparameterization.
  2. [§4] §4 (experiments): While 'extensive experiments' are asserted to demonstrate mitigation of hallucinations, the abstract supplies no numbers, baselines, or ablation results (e.g., RC-DPO vs. standard DPO on the same data, or vs. answer-only DPO). The central claim that RC-DPO 'effectively mitigates hallucinations' therefore rests on unshown results; a table or figure quantifying the improvement attributable to the conditioning term is required to support the contribution.
minor comments (1)
  1. [§3] Notation: The distinction between the standard DPO loss and the proposed RC-DPO loss should be written explicitly with equations early in §3 to make the conditioning term immediately comparable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support of our motivation and clearer presentation of results. We will revise the manuscript to address these points directly.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method motivation): The claim that 'this formulation performs similarly to answer-only optimization' is presented as an empirical observation but is not accompanied by any quantitative comparison, ablation table, loss decomposition, or statistical test in the provided text. This equivalence is load-bearing for the motivation of the RC-DPO conditioning term; without evidence that standard DPO leaves CoT supervision unexploited, the new contrastive term may not address a genuine gap rather than an incremental reparameterization.

    Authors: We agree the motivation requires explicit quantitative backing, which is absent from the current text. Preliminary internal experiments showed the similarity, but these were not reported. In revision, we will add a new ablation subsection (likely in §3) with tables comparing standard DPO on (CoT+answer) pairs vs. answer-only DPO across hallucination rate, accuracy, and reasoning metrics on multimodal benchmarks. This will demonstrate under-exploitation of CoT supervision and justify the conditioning term. revision: yes

  2. Referee: [§4] §4 (experiments): While 'extensive experiments' are asserted to demonstrate mitigation of hallucinations, the abstract supplies no numbers, baselines, or ablation results (e.g., RC-DPO vs. standard DPO on the same data, or vs. answer-only DPO). The central claim that RC-DPO 'effectively mitigates hallucinations' therefore rests on unshown results; a table or figure quantifying the improvement attributable to the conditioning term is required to support the contribution.

    Authors: We concur that key quantitative results and ablations must be more prominently featured. We will update the abstract with specific improvement numbers (e.g., hallucination reduction percentages) and add a dedicated results table in §4 showing RC-DPO vs. standard DPO and answer-only baselines on identical data. This will isolate the contribution of the CoT-conditioning term and the MCTS/pruning pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: RC-DPO is an explicit non-tautological extension of DPO

full rationale

The paper defines RC-DPO by explicitly adding a CoT-conditioning term to the standard DPO loss and contrasts preferences for the same answer under different CoT conditions. No equations are shown that reduce this new term to a fitted input, a self-citation, or a renaming of the original DPO objective. The motivating observation that response-level DPO behaves like answer-only optimization is stated as an empirical claim without any demonstrated algebraic equivalence or self-referential construction. Data generation via MCTS and attention pruning are standard external techniques, not internal to the loss derivation. The method therefore remains self-contained against the DPO baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach inherits the standard DPO Bradley-Terry assumption that pairwise preferences reflect an underlying reward; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Pairwise human or model preferences can be modeled by a latent reward function via the Bradley-Terry model
    Standard assumption underlying all DPO variants; invoked when the RC-DPO loss is defined.

pith-pipeline@v0.9.1-grok · 5763 in / 1238 out tokens · 32734 ms · 2026-06-29T13:01:01.390669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 6 canonical work pages · 5 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou

  2. [2]

    Hallucination of Multimodal Large Language Models: A Survey

    Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. 2005. An adaptive sampling algo- rithm for solving markov decision processes.Opera- tions research, 53(1):126–139. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, ...

  3. [3]

    A Survey on Hallucination in Large Vision-Language Models

    Mitigating object hallucinations in large vision- language models through visual contrastive decod- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. In Pro...

  4. [4]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multi- modal llms. InProceedings of the IEEE/CVF con- ference o...

  5. [5]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Automated generation of challenging multiple- choice questions for vision language model evalua- tion. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 29580–29590. Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023. Multi- modal chain-of-thought reasoning in language mod- els.arXiv preprint...

  6. [6]

    It contains chal- lenging image-question pairs that require models to generate visually grounded answers rather than re- lying on language priors

    is designed to evaluate hallucinations in open-ended multimodal responses. It contains chal- lenging image-question pairs that require models to generate visually grounded answers rather than re- lying on language priors. The benchmark evaluates both response quality and hallucination severity, making it suitable for assessing whether a model can produce ...

  7. [7]

    Think step by step using explicit reasoning

  8. [8]

    RLAIF-V:

    Each reasoning step MUST: - Describe exactly ONE logical step - Be directly grounded in clearly observable visual evidence - Avoid unsupported assumptions or speculation - End with the special token <END> =========== FINAL ANSWER FORMAT (INSIDE <answer>) =========== After completing all reasoning steps, provide the final answer. The final answer MUST: - B...