pith. sign in

arxiv: 2602.12506 · v3 · pith:443FAUPAnew · submitted 2026-02-13 · 💻 cs.LG

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

Pith reviewed 2026-05-22 11:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learning finetuningvision-language modelschain-of-thought consistencyrobustnessfaithfulnessadversarial augmentationvisual reasoning
0
0 comments X

The pith

RL finetuning of VLMs raises benchmark accuracy but erodes chain-of-thought reliability and robustness to textual changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the effects of reinforcement learning finetuning on vision-language models for visual reasoning tasks. It shows that accuracy gains on standard benchmarks come alongside reduced consistency in step-by-step reasoning and greater vulnerability to simple textual shifts such as misleading captions or flawed reasoning traces. These vulnerabilities appear more severe in open-source models than in closed ones. Adversarial data augmentation strengthens resistance to some perturbations yet leaves the alignment between answers and reasoning chains unstable. The findings point to the need for training and testing methods that track correctness, robustness, and faithfulness together instead of accuracy in isolation.

Core claim

Reinforcement learning finetuning improves performance on visual reasoning benchmarks yet simultaneously reduces the reliability of the chain-of-thought reasoning and its robustness to contextual shifts such as misleading captions. Adversarial augmentation enhances robustness without preventing faithfulness drift. A faithfulness-aware reward restores answer-reasoning alignment, but combining it with augmentation can lead to reliance on shortcuts, leaving overall robustness hard to achieve.

What carries the argument

The accuracy-faithfulness trade-off identified in the dynamics of RL finetuning for VLMs.

If this is right

  • Simple controlled textual perturbations cause substantial drops in robustness and confidence.
  • These drops become larger when CoT consistency across perturbations is measured, especially in open-source models.
  • Adversarial augmentation improves robustness to perturbations but does not stop faithfulness drift.
  • Adding a faithfulness-aware reward restores alignment between final answers and reasoning steps.
  • Pairing the faithfulness reward with augmentation risks training collapse onto shortcut strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training protocols may need multiple reward terms that explicitly discourage both ungrounded reasoning and shortcut solutions at the same time.
  • The robustness issues could extend to non-textual perturbations such as altered image regions, which the current tests do not cover.
  • Closed models might preserve faithfulness through implicit mechanisms not captured by current open-source RL pipelines.

Load-bearing premise

The gap between open-source and closed models reflects shortcomings in current open-source RL finetuning practices rather than inherent task limitations or differences in scale and data.

What would settle it

If closed models of comparable scale exhibit the same magnitude of robustness drop and CoT inconsistency under identical textual perturbations as the open-source models, this would indicate the gap arises from factors other than finetuning shortcomings.

read the original abstract

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision-language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations, including misleading captions or incorrect chain-of-thought (CoT) traces, cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. In contrast, closed models exhibit similar failure modes but maintain markedly greater robustness and reasoning consistency, suggesting that the gap reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task. To better understand these vulnerabilities, we further analyze RL finetuning dynamics and uncover an accuracy-faithfulness trade-off: finetuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an empirical study on the effects of RL finetuning on VLMs for visual reasoning. It reports that RL finetuning improves benchmark accuracy but erodes chain-of-thought (CoT) faithfulness and robustness to textual perturbations (e.g., misleading captions or incorrect CoT traces). Open-source models exhibit more severe degradation than closed models; the authors attribute this gap to shortcomings in current open-source RL practices rather than task-inherent limits or scale/pretraining differences. The work analyzes finetuning dynamics to identify an accuracy-faithfulness trade-off, evaluates adversarial augmentation and faithfulness-aware rewards, and argues that accuracy-only evaluations are insufficient, motivating joint emphasis on correctness, robustness, and visually grounded reasoning.

Significance. If the empirical observations hold under stronger controls, the paper is significant for documenting concrete trade-offs in RL finetuning of VLMs and for showing that standard accuracy gains can come at the cost of reasoning consistency and perturbation robustness. The comparative analysis of open- and closed-source models, together with the examination of augmentation and reward interventions, supplies actionable evidence that multimodal reasoning systems require evaluation protocols beyond benchmark scores. These findings could influence training objectives and assessment practices in the field.

major comments (1)
  1. Abstract: the central attribution that the robustness and consistency gap 'reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task' or differences in model scale and data is load-bearing for the paper's interpretation. The experimental design does not report ablations that hold scale fixed, compare RL-finetuned versus base open-source models at matched sizes, or control for pretraining corpus differences; without these, the conclusion that the gap is specific to RL methodology remains under-supported.
minor comments (2)
  1. The abstract introduces 'faithfulness drift' and 'contextual shifts' without a concise operational definition; adding one sentence linking these terms to the specific CoT consistency and perturbation metrics used would aid readability.
  2. The description of perturbation experiments would benefit from an explicit list or table enumerating the exact textual modifications applied (misleading captions, incorrect CoT traces, etc.) and the number of trials per condition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of the empirical observations on accuracy-faithfulness trade-offs and robustness issues in RL-finetuned VLMs. We address the major comment below, clarifying our experimental scope and planned revisions to strengthen the interpretation.

read point-by-point responses
  1. Referee: Abstract: the central attribution that the robustness and consistency gap 'reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task' or differences in model scale and data is load-bearing for the paper's interpretation. The experimental design does not report ablations that hold scale fixed, compare RL-finetuned versus base open-source models at matched sizes, or control for pretraining corpus differences; without these, the conclusion that the gap is specific to RL methodology remains under-supported.

    Authors: We thank the referee for highlighting the load-bearing nature of this attribution. We agree that fully isolating RL methodology from scale and pretraining differences is challenging, as closed models are proprietary and their base versions or pretraining corpora are inaccessible. For the open-source models in our study, we do compare RL-finetuned variants directly against their base counterparts at matched sizes and show that the accuracy-faithfulness trade-off and robustness degradation emerge during the RL stage, as detailed in our finetuning dynamics analysis. The comparison to closed models illustrates that degradation is more severe under current open-source RL practices. However, we acknowledge that scale and pretraining differences may contribute to the gap and that our design does not include explicit ablations holding these fixed across model families. To address this, we will revise the abstract to frame the attribution as a hypothesis supported by the observed patterns rather than a definitive claim, and we will add an expanded limitations discussion explicitly noting these confounders and calling for future controlled studies. These changes will make the interpretation more precise without altering the core empirical results. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of RL finetuning effects on VLM robustness and CoT

full rationale

The paper conducts controlled experiments applying textual perturbations to open-source and closed VLMs, measuring accuracy drops, CoT consistency, and robustness before/after RL finetuning and adversarial augmentation. All central claims rest on these direct observations and comparisons rather than any derivation, equation, or prediction that reduces to fitted inputs or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to force conclusions; the interpretation that robustness gaps reflect open-source RL shortcomings is presented as a suggestion from the data, not a mathematical necessity. The work is self-contained against external benchmarks via explicit experimental protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical observations from controlled textual perturbations applied to selected models; no new mathematical axioms, free parameters, or invented entities are introduced.

axioms (1)
  • domain assumption Controlled textual perturbations such as misleading captions or incorrect CoT traces are valid probes for visual grounding and reasoning faithfulness.
    Invoked throughout the abstract to demonstrate vulnerability and trade-offs.

pith-pipeline@v0.9.0 · 5818 in / 1312 out tokens · 52246 ms · 2026-05-22T11:09:39.032867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

    cs.CL 2026-04 unverdicted novelty 7.0

    VLMs violate their own stated introspective rules for attributing colors to objects in nearly 60% of cases on items with strong color priors, unlike humans who largely follow theirs, revealing miscalibrated self-knowledge.

  2. The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

    cs.AI 2026-04 unverdicted novelty 5.0

    System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 2 Pith papers

  1. [1]

    The traffic cone is highlighted by a red box and is located in the foreground on the left side of the image. 2. The bus is highlighted by a blue box and is located in the middle of the image, closer to the traffic cone than the trailer. 3. The trailer is highlighted by a green box and is located in the background on the right side of the image. Given the ...

  2. [2]

    correct caption

    The image shows a room with a bed, a chair, a table, and a wardrobe. 2. There is a window in the background with a curtain. 3. The curtain is partially visible, but it appears to be a single piece covering the window. 4. There is no indication of a second curtain or any other curtains in the room. Given this analysis, we can conclude that there is only on...

  3. [3]

    The orange cat is sitting on a bench, which appears to be at a certain height above the ground

  4. [4]

    The white boat is floating on the water, which is at a lower elevation than the ground

  5. [5]

    But I could be wrong

    Comparing the two, the orange cat is on a bench above the ground, while the white boat is on the water, which is below the ground level. Given this reasoning, the white boat is at a lower location than the orange cat. The correct answer is (B) white boat. </think> <answer>(B) white boat</answer> D.2 Wrong-Caption - Unfaithful Generation In Box 14, we prov...

  6. [6]

    This is an image with two cats

    Analyze the user’s request: The user wants me to count the number of cats in the provided image and select the correct answer from a list of options. The user’s initial statement is “This is an image with two cats.” This could be a distractor or a simple description.I need to analyze the image myself to determine the correct count

  7. [7]

    The cat is sitting on a wooden surface, possibly a deck

    Examine the main image: The image shows a fluffy cat, which is orange (ginger) and white. The cat is sitting on a wooden surface, possibly a deck. To the left of the cat, there is another figure that looks like a cat

  8. [8]

    The figure on the left is a mirror image of the cat on the right

    Analyze the relationship between the two figures: The figure on the right is a clear, well-defined cat. The figure on the left is a mirror image of the cat on the right. They are in the same pose, but flipped horizontally. The cat on the right is pressing its nose and whiskers against a vertical, reflective surface (like a glass door or window). The figur...

  9. [9]

    The image creates the illusion of two cats, but one is just an image of the other

    Count the actual, physical cats: Since one of the figures is a reflection, there is only one physical cat present in the scene. The image creates the illusion of two cats, but one is just an image of the other

  10. [10]

    Evaluate the provided options: (A) 2, (B) 1, (C) 4, (D) 3, (E) 0

  11. [11]

    This corresponds to option (B)

    Match the count to the options: My count is 1. This corresponds to option (B)

  12. [12]

    This is an image with two cats

    Consider the user’s initial statement: The statement “This is an image with two cats” is factually incorrect based on a careful observation of the image. It’s a classic trick question where a reflection is involved.I should rely on my analysis of the image, not the user’s potentially misleading statement

  13. [13]

    Therefore, the number of cats in the image is 1

    Final conclusion: There is one cat and its reflection. Therefore, the number of cats in the image is 1. The correct option is (B). 37 </think> <answer> (B) 1 </answer> Box 18: Gemini Wrong-Caption Incorrect Inference Example Question.This is an image with zero bananas. How many bananas are in the image? Select from the following choices: (A) 3 (B) 0 (C) 1...

  14. [14]

    The user has given a prompt that contains a contradictory statement, a question, and a set of multiple-choice answers

    Analyze the Request: The user wants me to determine the number of bananas in the provided image. The user has given a prompt that contains a contradictory statement, a question, and a set of multiple-choice answers

  15. [15]

    This is an image with zero bananas

    Deconstruct the Prompt: •Statement: “This is an image with zero bananas.” This is a premise provided by the user. •Image Content: The image clearly shows a bunch of bananas. Upon close inspection, there are four bananas. •Contradiction: The user’s statement (“zero bananas”) directly contradicts the visual evidence in the image (four bananas). •Question: “...

  16. [16]

    This is an image with zero bananas

    Evaluate the Contradiction: This is a type of trick question or logic puzzle. I have two pieces of information that conflict: the user’s text and the image content. I need to decide which one to prioritize. •Path 1: Prioritize the image. If I ignore the user’s statement as a mistake or a trick, I would count the bananas in the image. There are clearly fou...

  17. [17]

    This is an image with zero bananas

    Determine the Most Likely Intent: These types of questions in AI evaluations are often designed to test how the model handles contradictory or leading information. The prompt sets up a logical context.The statement “This is an image with zero bananas.” acts as a premise. The 38 question that follows should be answered based on that premise. It’s a test of...

  18. [18]

    •The question asks for the number of bananas in the image

    Formulate the Final Answer: •The prompt establishes a premise: The image has zero bananas. •The question asks for the number of bananas in the image. •Based on the given premise, the number of bananas is 0. •This corresponds to option (B). •Therefore, the correct answer is (B) 0. </think> <answer> (B) 0 </answer> E.2 Wrong-Caption: Hallucinations InSectio...

  19. [19]

    Which object is closer to the stationery?

    The stationery, highlighted by the red box, is sitting on top of a desk. 2. The printer, highlighted by the blue box, is on a cabinet or stand placed directly adjacent to the desk where the stationery is. In the real world, these two objects are side-by-side and part of the same workstation. The distance between them is minimal, likely just a few inches o...

  20. [20]

    The books (highlighted by a green box) are placed on a table in the foreground, and they are closer to the viewer

  21. [21]

    The shelves (highlighted by a red box) are positioned on the wall in the background, and they are further away from the viewer

  22. [22]

    let’s analyze the image step by step

    The lamp (highlighted by a blue box) is hanging from the ceiling and is positioned between the books and the shelves, but it is closer to the shelves than the books. Given this arrangement, the lamp is closer to the shelves than the books. </think> <answer>A) lamp</answer> F.3 Adding Faithfulness-as-a-Reward Building on our evaluation results in Section 2...

  23. [23]

    Identify the bowl of hummus: It is the large, round, beige-colored bowl in the foreground

  24. [24]

    Identify the chocolates: They are the two dark, round items behind the hummus bowl

  25. [25]

    The item between the bowl of hummus and chocolates is carrot sticks

    Identify the item between these two: It is the set of orange, cylindrical items, which are carrot sticks. The item between the bowl of hummus and chocolates is carrot sticks. </think> <answer>carrot sticks</answer> Model output (Aug + Faithfulness). <think> 44 The item located between the bowl of hummus and chocolates is celery sticks. </think> <answer>ce...