Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Di Zhang; Junxian Li; Sai Ma; Sichao Li; Xinyue Xu

arxiv: 2511.08409 · v4 · submitted 2025-11-11 · 💻 cs.AI

Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Junxian Li , Xinyue Xu , Sai Ma , Di Zhang , Sichao Li This is my paper

Pith reviewed 2026-05-17 23:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal LLMsfaithful reasoninghallucination mitigationreasoning planning actingperceptual faithfulnessvisual grounding

0 comments

The pith

A framework that evaluates faithfulness at each reasoning step and uses those signals to guide actions lets multimodal LLMs stay grounded in visual evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Faithful-First Reasoning, Planning, and Acting framework for multimodal large language models. FaithEvi evaluates the faithfulness of intermediate reasoning steps and entire chains against visual input, while FaithAct uses those evaluations to select and carry out actions that keep reasoning aligned with the evidence. Across multiple benchmarks, this approach raises perceptual faithfulness scores by as much as 24 percent compared with ordinary prompting or tool-augmented baselines, and task accuracy stays the same. The work shows that making faithfulness the active control signal during inference reduces drift from visual content and produces more consistent final answers.

Core claim

Treating faithfulness as the guiding principle during reasoning, planning, and acting produces perceptually faithful trajectories in multimodal LLMs. FaithEvi supplies step-wise and chain-level supervision by scoring how well each piece of reasoning matches the visual evidence, and FaithAct converts those scores into concrete actions executed at inference time. The result is a measurable lift in faithfulness metrics without any drop in end-task performance, together with a single framework that both measures and enforces faithfulness.

What carries the argument

FaithEvi and FaithAct inside the RPA loop: FaithEvi scores faithfulness of each reasoning step against visual evidence, and FaithAct plans and executes actions that respond to those scores.

If this is right

Multimodal reasoning trajectories remain closer to the actual visual content throughout the chain.
Hallucination rates drop because actions are chosen to correct or avoid unfaithful steps.
The same accuracy level is preserved while faithfulness rises, so the method adds reliability without extra cost to correctness.
A single set of mechanisms handles both measurement and enforcement of faithfulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on text-only models to see whether step-wise faithfulness checks reduce factual hallucinations in non-visual tasks.
Integrating external retrieval or verification tools inside FaithAct might further strengthen the signals when visual evidence alone is ambiguous.
Deployed in real applications such as medical image interpretation or autonomous navigation, the method could make model outputs easier to audit against source images.

Load-bearing premise

That the signals produced by evaluating faithfulness at each step are accurate enough to steer planning and acting toward better alignment with visual evidence.

What would settle it

An experiment in which FaithEvi faithfulness scores show no correlation with independent human or automated checks of visual grounding, and the RPA loop then yields no faithfulness gain or a drop in accuracy.

Figures

Figures reproduced from arXiv: 2511.08409 by Di Zhang, Junxian Li, Sai Ma, Sichao Li, Xinyue Xu.

**Figure 2.** Figure 2: Faithful-first reasoning, planning, and acting framework. Given an image-question pair, FAITHEVI evaluates the perceptual faithfulness of intermediate reasoning, producing step- and chain-level faithfulness scores. Guided by these signals, FAITHACT plans and acts faithfulness-aware actions during inference. where c i p,t corresponds to the predicted existence probability for each claimed object Oi t . This… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of reasoning chains generated with and without [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of average Fstep difference across reasoning steps. The x-axis are reasoning steps and yaxis represents the Fstep averaged difference between QWEN with and without FAITHACT. achieves 57.35 ± 29.40 on REALWORLDQA with FAITHACT, compared to 44.23 ± 25.43 without it. Generally, our method attains the highest faithfulness in 11 out of 12 evaluated settings, demonstrating its effectiveness acros… view at source ↗

**Figure 5.** Figure 5: Comparative performance of the FAITHACT framework using SAM3 vs. GroundingDINO as the Ground() function. Results show mean accuracy (%) and standard deviation (error bars) across three MLLMs on the REALWORLDQA and MMHAL datasets. Datasets & Models REALWORLDQA (%) MMHAL (%) FaithAct 57.22±27.85 66.45±27.87 FaithAct (w/o Poll) 54.24±28.13 63.25±26.75 FaithAct (w/o Ground) 53.16±29.12 62.47±28.83 [PITH_FULL_… view at source ↗

**Figure 6.** Figure 6: Human annotation interface for object-existence validation. For each snippet, annotators are presented with the original text and a set of objects automatically extracted by the LLM. Annotators judge whether each object is explicitly mentioned or unambiguously implied by the text, producing binary Supported (1) or Unsupported (0) labels used for evaluating extraction faithfulness. 17 [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) frequently suffer from unfaithfulness, generating reasoning chains that drift from visual evidence or contradict final predictions. We propose Faithful-First Reasoning, Planning, and Acting (RPA) framework in which FaithEvi provides step-wise and chain-level supervision by evaluating the faithfulness of intermediate reasoning, and FaithAct uses these signals to plan and execute faithfulness-aware actions during inference. Experiments across multiple multimodal reasoning benchmarks show that faithful-first RPA improves perceptual faithfulness by up to 24% over prompt-based and tool-augmented reasoning frameworks, without degrading task accuracy. Our analysis shows that treating faithfulness as a guiding principle perceptually faithful reasoning trajectories and mitigates hallucination behavior. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning. Code is at https://github.com/lijunxian111/Faithful-First-RPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a step-wise faithfulness checker and planner to multimodal reasoning but the gains rest on whether FaithEvi gives independent signals or just echoes the base model.

read the letter

The main takeaway is a framework that inserts faithfulness evaluation into the reasoning-planning-acting loop for MLLMs. FaithEvi scores how well each intermediate step sticks to the visual evidence, and FaithAct then uses those scores to choose actions that keep the trajectory grounded. This is presented as a unified approach rather than another prompt tweak or off-the-shelf tool call. The headline result is a reported 24% lift in perceptual faithfulness on several benchmarks with no drop in final task accuracy, plus public code. That combination of evaluation and corrective action during inference is the concrete addition over prior work on hallucination mitigation. The approach is practical and targets a real deployment pain point in vision-language systems. The soft spot is the nature of the faithfulness signal itself. If FaithEvi is implemented inside the same model family or via self-prompting without separate grounding such as human-verified facts or an external verifier, the loop risks reinforcing whatever the base model already treats as faithful. The abstract gives no implementation details, baseline comparisons, or statistical checks, so it is difficult to judge whether the 24% figure reflects genuine perceptual improvement or evaluator bias. The stress-test concern about circularity therefore lands until the methods section clarifies the evaluator setup. This paper is mainly for groups already working on reliable multimodal agents or planning frameworks. Readers who care about measurable faithfulness metrics and are willing to inspect the code would find it worth examining. It is coherent enough on its own terms to merit peer review so the experimental controls and signal independence can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a Faithful-First Reasoning, Planning, and Acting (RPA) framework for Multimodal Large Language Models. FaithEvi supplies step-wise and chain-level supervision by scoring the faithfulness of intermediate reasoning steps, while FaithAct leverages these signals to plan and execute faithfulness-aware actions at inference time. Experiments across multiple multimodal reasoning benchmarks are reported to yield up to 24% gains in perceptual faithfulness relative to prompt-based and tool-augmented baselines, with no degradation in task accuracy. The work positions faithfulness as a guiding principle to reduce hallucinations and provides a unified framework for both evaluation and enforcement.

Significance. If the central claims hold under rigorous verification, the framework offers a practical, inference-time mechanism for improving reliability in MLLM reasoning trajectories. The public code release is a clear strength that enables direct reproduction and extension. The approach integrates evaluation and action in a single RPA loop, which could inform downstream applications where perceptual grounding is critical.

major comments (2)

[Method] The implementation details of FaithEvi (likely in the Method section) do not clarify whether faithfulness scoring relies on the same MLLM family via prompting or an independent verifier model. Without external grounding such as human-verified visual facts, the signals passed to FaithAct risk being self-reinforcing rather than perceptually independent; this directly affects the validity of the reported 24% perceptual faithfulness gains.
[Experiments] The Experiments section provides no information on experimental design elements such as the number of runs, statistical significance tests, baseline implementation details, or controls for confounds (e.g., prompt sensitivity or model version). These omissions make it impossible to assess whether the up-to-24% improvement is robust or reproducible, which is load-bearing for the primary empirical claim.

minor comments (2)

[Abstract] The abstract contains a grammatical issue in the sentence beginning 'Our analysis shows that treating faithfulness as a guiding principle...'; it appears to be missing words and should be revised for clarity.
[Figures] Figure captions and axis labels should explicitly state the faithfulness metric used (e.g., FaithEvi score) and the exact baselines compared, to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [Method] The implementation details of FaithEvi (likely in the Method section) do not clarify whether faithfulness scoring relies on the same MLLM family via prompting or an independent verifier model. Without external grounding such as human-verified visual facts, the signals passed to FaithAct risk being self-reinforcing rather than perceptually independent; this directly affects the validity of the reported 24% perceptual faithfulness gains.

Authors: We thank the referee for this important observation. The submitted manuscript describes FaithEvi as scoring faithfulness of intermediate reasoning steps but does not explicitly state whether this uses the target MLLM via prompting or a separate verifier. In the revised manuscript we have expanded the Method section to clarify that FaithEvi applies a specialized faithfulness prompt to the same MLLM family; the prompt first requires explicit extraction of visual evidence from the image before assigning a score. To directly address the risk of self-reinforcement, we have added (i) a comparison against an independent verifier model on the same samples and (ii) a human evaluation on a 100-sample subset confirming alignment between FaithEvi scores and human judgments of perceptual faithfulness. These additions support the validity of the reported gains. revision: yes
Referee: [Experiments] The Experiments section provides no information on experimental design elements such as the number of runs, statistical significance tests, baseline implementation details, or controls for confounds (e.g., prompt sensitivity or model version). These omissions make it impossible to assess whether the up-to-24% improvement is robust or reproducible, which is load-bearing for the primary empirical claim.

Authors: We agree that the original Experiments section lacked these details. In the revised manuscript we now specify that all main results are averaged over 5 independent runs with different random seeds, report mean and standard deviation, and include paired t-test results (p < 0.05) for the faithfulness improvements. We have also added complete baseline implementation details (exact prompt templates, model versions, and hyperparameters), plus a prompt-sensitivity ablation that varies phrasing while keeping other factors fixed. These changes allow readers to evaluate the robustness and reproducibility of the up-to-24% perceptual faithfulness gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on experimental validation of external signals

full rationale

The paper describes an empirical RPA framework in which FaithEvi supplies step-wise faithfulness supervision and FaithAct applies those signals to guide actions at inference time. Reported gains (up to 24% perceptual faithfulness) are obtained from benchmark experiments rather than any closed-form derivation or prediction that reduces to the inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or description that would make the central claim tautological. The faithfulness evaluation is presented as providing independent supervision signals, and the method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced FaithEvi and FaithAct modules whose implementation details and any hyperparameters are not specified in the abstract; no free parameters or external axioms are detailed.

axioms (1)

domain assumption Faithfulness of intermediate reasoning steps can be evaluated and used to guide subsequent actions in multimodal models
Core premise enabling the FaithEvi and FaithAct components as described in the abstract.

invented entities (2)

FaithEvi no independent evidence
purpose: Provides step-wise and chain-level supervision by evaluating the faithfulness of intermediate reasoning
New module introduced to supply supervision signals; no independent evidence outside the paper's experiments is mentioned.
FaithAct no independent evidence
purpose: Uses faithfulness signals to plan and execute faithfulness-aware actions during inference
New module introduced to enforce faithfulness; no independent evidence outside the paper's experiments is mentioned.

pith-pipeline@v0.9.0 · 5456 in / 1329 out tokens · 36277 ms · 2026-05-17T23:40:37.822279+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Object-Level Confidence... ci_t = α c_p,t + (1-α) c_g,t ... three-level mapping function f_i_t ... Fstep,t = average ... Fchain = mean of step scores
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FAITHACT executes a faithfulness-first planning loop... Poll(), Ground(), Select(), Abstain(), Count()... refine-based procedure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

InInternational Conference on Ma- chine Learning, pages 4794–4815

Framework for evaluating faithfulness of local explanations. InInternational Conference on Ma- chine Learning, pages 4794–4815. PMLR. Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, and 1 oth- ers. 2024. Cantor: Inspiring multimodal chain-of- thought of mllm. InProceedings ...

work page 2024
[2]

Alon Jacovi and Yoav Goldberg

Tifa: Accurate and interpretable text-to- image faithfulness evaluation with question answer- ing.arXiv preprint arXiv:2303.11897. Alon Jacovi and Yoav Goldberg. 2020. Towards faith- fully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pa...

work page arXiv 2020
[3]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Faithscore: Fine-grained evaluations of hal- lucinations in large vision-language models.arXiv preprint arXiv:2311.01477. Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithful- ness in chain-of-thought reasoning.arXiv pre...

work page arXiv 2023
[4]

explores hierarchical and compositional rea- soning, encouraging models to form abstract visual 12 concepts that support complex decision-making. Such generated reasoning chains, though often fluent and logically structured, may still include steps unsupported by visual evidence or inconsis- tent with the model’s actual decision process (Yu et al., 2024)....

work page 2024
[5]

Image”, “Object

and TIFA (Hu et al., 2023) propose metrics for evaluating the faithfulness for vision-language models. However, we find that behavioral align- ment (behavioral faithfully) does not guarantee the correctness of the final output (seeleft panelin Fig.1). Object Hallucination as Unfaithful Conse- quences.Object hallucination is identified as a common challeng...

work page 2023

[1] [1]

InInternational Conference on Ma- chine Learning, pages 4794–4815

Framework for evaluating faithfulness of local explanations. InInternational Conference on Ma- chine Learning, pages 4794–4815. PMLR. Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, and 1 oth- ers. 2024. Cantor: Inspiring multimodal chain-of- thought of mllm. InProceedings ...

work page 2024

[2] [2]

Alon Jacovi and Yoav Goldberg

Tifa: Accurate and interpretable text-to- image faithfulness evaluation with question answer- ing.arXiv preprint arXiv:2303.11897. Alon Jacovi and Yoav Goldberg. 2020. Towards faith- fully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pa...

work page arXiv 2020

[3] [3]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Faithscore: Fine-grained evaluations of hal- lucinations in large vision-language models.arXiv preprint arXiv:2311.01477. Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithful- ness in chain-of-thought reasoning.arXiv pre...

work page arXiv 2023

[4] [4]

explores hierarchical and compositional rea- soning, encouraging models to form abstract visual 12 concepts that support complex decision-making. Such generated reasoning chains, though often fluent and logically structured, may still include steps unsupported by visual evidence or inconsis- tent with the model’s actual decision process (Yu et al., 2024)....

work page 2024

[5] [5]

Image”, “Object

and TIFA (Hu et al., 2023) propose metrics for evaluating the faithfulness for vision-language models. However, we find that behavioral align- ment (behavioral faithfully) does not guarantee the correctness of the final output (seeleft panelin Fig.1). Object Hallucination as Unfaithful Conse- quences.Object hallucination is identified as a common challeng...

work page 2023