On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs
Pith reviewed 2026-05-22 11:09 UTC · model grok-4.3
The pith
RL finetuning of VLMs raises benchmark accuracy but erodes chain-of-thought reliability and robustness to textual changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reinforcement learning finetuning improves performance on visual reasoning benchmarks yet simultaneously reduces the reliability of the chain-of-thought reasoning and its robustness to contextual shifts such as misleading captions. Adversarial augmentation enhances robustness without preventing faithfulness drift. A faithfulness-aware reward restores answer-reasoning alignment, but combining it with augmentation can lead to reliance on shortcuts, leaving overall robustness hard to achieve.
What carries the argument
The accuracy-faithfulness trade-off identified in the dynamics of RL finetuning for VLMs.
If this is right
- Simple controlled textual perturbations cause substantial drops in robustness and confidence.
- These drops become larger when CoT consistency across perturbations is measured, especially in open-source models.
- Adversarial augmentation improves robustness to perturbations but does not stop faithfulness drift.
- Adding a faithfulness-aware reward restores alignment between final answers and reasoning steps.
- Pairing the faithfulness reward with augmentation risks training collapse onto shortcut strategies.
Where Pith is reading between the lines
- Training protocols may need multiple reward terms that explicitly discourage both ungrounded reasoning and shortcut solutions at the same time.
- The robustness issues could extend to non-textual perturbations such as altered image regions, which the current tests do not cover.
- Closed models might preserve faithfulness through implicit mechanisms not captured by current open-source RL pipelines.
Load-bearing premise
The gap between open-source and closed models reflects shortcomings in current open-source RL finetuning practices rather than inherent task limitations or differences in scale and data.
What would settle it
If closed models of comparable scale exhibit the same magnitude of robustness drop and CoT inconsistency under identical textual perturbations as the open-source models, this would indicate the gap arises from factors other than finetuning shortcomings.
read the original abstract
Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision-language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations, including misleading captions or incorrect chain-of-thought (CoT) traces, cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. In contrast, closed models exhibit similar failure modes but maintain markedly greater robustness and reasoning consistency, suggesting that the gap reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task. To better understand these vulnerabilities, we further analyze RL finetuning dynamics and uncover an accuracy-faithfulness trade-off: finetuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study on the effects of RL finetuning on VLMs for visual reasoning. It reports that RL finetuning improves benchmark accuracy but erodes chain-of-thought (CoT) faithfulness and robustness to textual perturbations (e.g., misleading captions or incorrect CoT traces). Open-source models exhibit more severe degradation than closed models; the authors attribute this gap to shortcomings in current open-source RL practices rather than task-inherent limits or scale/pretraining differences. The work analyzes finetuning dynamics to identify an accuracy-faithfulness trade-off, evaluates adversarial augmentation and faithfulness-aware rewards, and argues that accuracy-only evaluations are insufficient, motivating joint emphasis on correctness, robustness, and visually grounded reasoning.
Significance. If the empirical observations hold under stronger controls, the paper is significant for documenting concrete trade-offs in RL finetuning of VLMs and for showing that standard accuracy gains can come at the cost of reasoning consistency and perturbation robustness. The comparative analysis of open- and closed-source models, together with the examination of augmentation and reward interventions, supplies actionable evidence that multimodal reasoning systems require evaluation protocols beyond benchmark scores. These findings could influence training objectives and assessment practices in the field.
major comments (1)
- Abstract: the central attribution that the robustness and consistency gap 'reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task' or differences in model scale and data is load-bearing for the paper's interpretation. The experimental design does not report ablations that hold scale fixed, compare RL-finetuned versus base open-source models at matched sizes, or control for pretraining corpus differences; without these, the conclusion that the gap is specific to RL methodology remains under-supported.
minor comments (2)
- The abstract introduces 'faithfulness drift' and 'contextual shifts' without a concise operational definition; adding one sentence linking these terms to the specific CoT consistency and perturbation metrics used would aid readability.
- The description of perturbation experiments would benefit from an explicit list or table enumerating the exact textual modifications applied (misleading captions, incorrect CoT traces, etc.) and the number of trials per condition.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of the empirical observations on accuracy-faithfulness trade-offs and robustness issues in RL-finetuned VLMs. We address the major comment below, clarifying our experimental scope and planned revisions to strengthen the interpretation.
read point-by-point responses
-
Referee: Abstract: the central attribution that the robustness and consistency gap 'reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task' or differences in model scale and data is load-bearing for the paper's interpretation. The experimental design does not report ablations that hold scale fixed, compare RL-finetuned versus base open-source models at matched sizes, or control for pretraining corpus differences; without these, the conclusion that the gap is specific to RL methodology remains under-supported.
Authors: We thank the referee for highlighting the load-bearing nature of this attribution. We agree that fully isolating RL methodology from scale and pretraining differences is challenging, as closed models are proprietary and their base versions or pretraining corpora are inaccessible. For the open-source models in our study, we do compare RL-finetuned variants directly against their base counterparts at matched sizes and show that the accuracy-faithfulness trade-off and robustness degradation emerge during the RL stage, as detailed in our finetuning dynamics analysis. The comparison to closed models illustrates that degradation is more severe under current open-source RL practices. However, we acknowledge that scale and pretraining differences may contribute to the gap and that our design does not include explicit ablations holding these fixed across model families. To address this, we will revise the abstract to frame the attribution as a hypothesis supported by the observed patterns rather than a definitive claim, and we will add an expanded limitations discussion explicitly noting these confounders and calling for future controlled studies. These changes will make the interpretation more precise without altering the core empirical results. revision: partial
Circularity Check
No circularity: purely empirical measurements of RL finetuning effects on VLM robustness and CoT
full rationale
The paper conducts controlled experiments applying textual perturbations to open-source and closed VLMs, measuring accuracy drops, CoT consistency, and robustness before/after RL finetuning and adversarial augmentation. All central claims rest on these direct observations and comparisons rather than any derivation, equation, or prediction that reduces to fitted inputs or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to force conclusions; the interpretation that robustness gaps reflect open-source RL shortcomings is presented as a suggestion from the data, not a mathematical necessity. The work is self-contained against external benchmarks via explicit experimental protocols.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Controlled textual perturbations such as misleading captions or incorrect CoT traces are valid probes for visual grounding and reasoning faithfulness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that simple, controlled textual perturbations... cause substantial drops in robustness and confidence... uncover an accuracy-faithfulness trade-off: finetuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reinforcement learning (RL) fine-tuning has emerged as a key post-training method... GRPO... verifiable rewards
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't
VLMs violate their own stated introspective rules for attributing colors to objects in nearly 60% of cases on items with strong color priors, unlike humans who largely follow theirs, revealing miscalibrated self-knowledge.
-
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
Reference graph
Works this paper leans on
-
[1]
The traffic cone is highlighted by a red box and is located in the foreground on the left side of the image. 2. The bus is highlighted by a blue box and is located in the middle of the image, closer to the traffic cone than the trailer. 3. The trailer is highlighted by a green box and is located in the background on the right side of the image. Given the ...
-
[2]
The image shows a room with a bed, a chair, a table, and a wardrobe. 2. There is a window in the background with a curtain. 3. The curtain is partially visible, but it appears to be a single piece covering the window. 4. There is no indication of a second curtain or any other curtains in the room. Given this analysis, we can conclude that there is only on...
-
[3]
The orange cat is sitting on a bench, which appears to be at a certain height above the ground
-
[4]
The white boat is floating on the water, which is at a lower elevation than the ground
-
[5]
Comparing the two, the orange cat is on a bench above the ground, while the white boat is on the water, which is below the ground level. Given this reasoning, the white boat is at a lower location than the orange cat. The correct answer is (B) white boat. </think> <answer>(B) white boat</answer> D.2 Wrong-Caption - Unfaithful Generation In Box 14, we prov...
-
[6]
This is an image with two cats
Analyze the user’s request: The user wants me to count the number of cats in the provided image and select the correct answer from a list of options. The user’s initial statement is “This is an image with two cats.” This could be a distractor or a simple description.I need to analyze the image myself to determine the correct count
-
[7]
The cat is sitting on a wooden surface, possibly a deck
Examine the main image: The image shows a fluffy cat, which is orange (ginger) and white. The cat is sitting on a wooden surface, possibly a deck. To the left of the cat, there is another figure that looks like a cat
-
[8]
The figure on the left is a mirror image of the cat on the right
Analyze the relationship between the two figures: The figure on the right is a clear, well-defined cat. The figure on the left is a mirror image of the cat on the right. They are in the same pose, but flipped horizontally. The cat on the right is pressing its nose and whiskers against a vertical, reflective surface (like a glass door or window). The figur...
-
[9]
The image creates the illusion of two cats, but one is just an image of the other
Count the actual, physical cats: Since one of the figures is a reflection, there is only one physical cat present in the scene. The image creates the illusion of two cats, but one is just an image of the other
-
[10]
Evaluate the provided options: (A) 2, (B) 1, (C) 4, (D) 3, (E) 0
-
[11]
This corresponds to option (B)
Match the count to the options: My count is 1. This corresponds to option (B)
-
[12]
This is an image with two cats
Consider the user’s initial statement: The statement “This is an image with two cats” is factually incorrect based on a careful observation of the image. It’s a classic trick question where a reflection is involved.I should rely on my analysis of the image, not the user’s potentially misleading statement
-
[13]
Therefore, the number of cats in the image is 1
Final conclusion: There is one cat and its reflection. Therefore, the number of cats in the image is 1. The correct option is (B). 37 </think> <answer> (B) 1 </answer> Box 18: Gemini Wrong-Caption Incorrect Inference Example Question.This is an image with zero bananas. How many bananas are in the image? Select from the following choices: (A) 3 (B) 0 (C) 1...
-
[14]
Analyze the Request: The user wants me to determine the number of bananas in the provided image. The user has given a prompt that contains a contradictory statement, a question, and a set of multiple-choice answers
-
[15]
This is an image with zero bananas
Deconstruct the Prompt: •Statement: “This is an image with zero bananas.” This is a premise provided by the user. •Image Content: The image clearly shows a bunch of bananas. Upon close inspection, there are four bananas. •Contradiction: The user’s statement (“zero bananas”) directly contradicts the visual evidence in the image (four bananas). •Question: “...
-
[16]
This is an image with zero bananas
Evaluate the Contradiction: This is a type of trick question or logic puzzle. I have two pieces of information that conflict: the user’s text and the image content. I need to decide which one to prioritize. •Path 1: Prioritize the image. If I ignore the user’s statement as a mistake or a trick, I would count the bananas in the image. There are clearly fou...
-
[17]
This is an image with zero bananas
Determine the Most Likely Intent: These types of questions in AI evaluations are often designed to test how the model handles contradictory or leading information. The prompt sets up a logical context.The statement “This is an image with zero bananas.” acts as a premise. The 38 question that follows should be answered based on that premise. It’s a test of...
-
[18]
•The question asks for the number of bananas in the image
Formulate the Final Answer: •The prompt establishes a premise: The image has zero bananas. •The question asks for the number of bananas in the image. •Based on the given premise, the number of bananas is 0. •This corresponds to option (B). •Therefore, the correct answer is (B) 0. </think> <answer> (B) 0 </answer> E.2 Wrong-Caption: Hallucinations InSectio...
-
[19]
Which object is closer to the stationery?
The stationery, highlighted by the red box, is sitting on top of a desk. 2. The printer, highlighted by the blue box, is on a cabinet or stand placed directly adjacent to the desk where the stationery is. In the real world, these two objects are side-by-side and part of the same workstation. The distance between them is minimal, likely just a few inches o...
-
[20]
The books (highlighted by a green box) are placed on a table in the foreground, and they are closer to the viewer
-
[21]
The shelves (highlighted by a red box) are positioned on the wall in the background, and they are further away from the viewer
-
[22]
let’s analyze the image step by step
The lamp (highlighted by a blue box) is hanging from the ceiling and is positioned between the books and the shelves, but it is closer to the shelves than the books. Given this arrangement, the lamp is closer to the shelves than the books. </think> <answer>A) lamp</answer> F.3 Adding Faithfulness-as-a-Reward Building on our evaluation results in Section 2...
-
[23]
Identify the bowl of hummus: It is the large, round, beige-colored bowl in the foreground
-
[24]
Identify the chocolates: They are the two dark, round items behind the hummus bowl
-
[25]
The item between the bowl of hummus and chocolates is carrot sticks
Identify the item between these two: It is the set of orange, cylindrical items, which are carrot sticks. The item between the bowl of hummus and chocolates is carrot sticks. </think> <answer>carrot sticks</answer> Model output (Aug + Faithfulness). <think> 44 The item located between the bowl of hummus and chocolates is celery sticks. </think> <answer>ce...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.