Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Byung-Kwan Lee; Dongjun Nam; Jeany Son; Seonghoon Yu

arxiv: 2605.11651 · v4 · pith:Z4QTKRP5new · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.CL

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu , Dongjun Nam , Byung-Kwan Lee , Jeany Son This is my paper

Pith reviewed 2026-05-19 16:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords VLM distillationmultimodal reasoningvisual groundingreasoning prefix maskingthink-answer modelsvisual forgettingknowledge distillation

0 comments

The pith

Masking the student's salient reasoning prefixes during distillation makes VLMs anchor their thinking more directly on visual evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large VLMs gain reasoning power by producing intermediate thinking steps before the final answer, but the cost is high and smaller models often lose track of the image during long traces. The paper proposes a distillation method that masks high-influence reasoning prefixes in the student's generation. Without those textual cues, the student must draw more from the visual input to continue predicting. Experiments show the resulting compact models exceed prior distillation baselines on multimodal reasoning benchmarks and exhibit stronger visual attention during their thinking process.

Core claim

By replacing the standard causal mask with a salient reasoning-prefix mask during distillation, the student is trained to rely on visual evidence for its next-token predictions. The mask is applied selectively to prefixes that most influence the student's output, with the amount of masking increased gradually according to the gap between teacher and student distributions. This setup directly targets visual forgetting by blocking both future tokens and the student's own reasoning cues.

What carries the argument

Token-wise salient reasoning-prefix masking paired with self-paced masking budget scheduling, which identifies and hides high-influence prefixes for each next-token prediction while scaling the masking difficulty to the current teacher-student discrepancy.

If this is right

The distilled student outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks.
Analyses confirm increased visual utilization throughout the student's thinking process.
The framework adapts masking scale to distillation difficulty via self-paced scheduling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controlled hiding of internal cues could be tested in text-only distillation to reduce reliance on language priors.
Extending the masking to video or multi-image inputs might improve temporal or cross-image grounding.
The scheduling rule could be reused as an adaptive curriculum in other teacher-student alignment settings.

Load-bearing premise

That removing the student's own salient reasoning prefixes will cause it to substitute visual evidence for the missing textual information.

What would settle it

If attention weights on image tokens fail to rise during the student's reasoning steps under the prefix mask compared with standard distillation, the visual-anchoring mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.11651 by Byung-Kwan Lee, Dongjun Nam, Jeany Son, Seonghoon Yu.

**Figure 2.** Figure 2: Reliance on salient cues In particular, when distilling such long traces of think-answer VLMs, the student relies heavily on a small set of exposed textual cues, which receive disproportionately high attention values ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The illustration of Masking-KD. During distillation, the student is guided by our salient [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Evidence on visual-anchored thinking. (a) changes in visual attention as generation proceeds, and (b) an example of visual attention maps at the peak attention point in gray box [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison on visual attention map. We average the visual attention scores over the entire thinking trace. More visualizations are present in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Prediction behavior of the student during distillation without and with our salient reasoning-prefix mask. Without a salient mask, the student uses a standard causal mask to predict the current token . With a salient mask, the student exploits more visual information to compensate for the masked salient reasoning prefix . More visualizations are provided in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Evidence on textual shortcut learning in student. (a) the reverse KL divergence gradually decreases as reasoning prefixes accumulate, suggesting that the student relies on exposed reasoning cues to imitate the teacher. (b) When response prefixes are masked, the distillation loss is substantially amplified compared with masking other regions. C.2 Statistics of Masked Prefix Positions [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 8.** Figure 8: Masked prefix distance In this section, we analyze the relative position of masked prefixes with respect to the current token (i.e., the distance from the current token to the masked prefix) over 19k teacher responses, as shown in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: illustrates the think-answer response of our Masking-KD compared with the undistilled student (i.e., Qwen3-VL-2B-Thinking). Undistilled student produce perception errors, highlighted in red box , whereas ours show enhanced visual perception, highlighted in green box [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: More Comparison on Visual Attention Map. We average the visual attention scores over the entire thinking trace. C.5 More Prediction Behavior of the Student during Distillation. We illustrate the prediction behavior of the student during distillation without and with our salient reasoning-prefix mask in [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: The instruction is used to prompt the teacher model to generate think-answer trajectories [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: More Prediction Behavior of the Student during Distillation without and with our salient reasoning-prefix mask across four types of reasoning problems: (a) math, (b) STEM, (c) table, and (d) chart. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's masking of salient reasoning prefixes is a reasonable practical idea for pushing visual reliance in VLM distillation, but the mechanism still needs better isolation from general training effects.

read the letter

The key takeaway is that this work tries to fix visual forgetting in distilled think-answer VLMs by masking the student's salient reasoning prefixes during training, forcing more reliance on the visual input. What is new here is the specific masking approach: token-wise selection of high-influence prefixes for each prediction step, plus a self-paced schedule that increases the masking budget as the student-teacher gap narrows or widens. The paper replaces the standard causal mask with one that blocks both future tokens and these salient cues. It does a good job laying out the problem with long reasoning traces losing visual connection and offering a distillation method that reportedly improves results over open-source VLMs and other distillation baselines on reasoning benchmarks. The further analyzes on visual utilization along the thinking process are a positive step if they hold up. The soft spots are around isolating the mechanism. Masking could simply make the optimization harder or act as a regularizer without specifically boosting visual evidence use, as the stress-test suggests. The abstract lacks details on baselines, error bars, or exact controls, so the central claim about enhanced visual anchoring rests on partially supported ground right now. If the full paper has solid ablations showing the visual effect separate from other factors, that would strengthen it. This paper is for people working on efficient multimodal models and knowledge distillation for reasoning tasks. A reader focused on practical improvements in VLM efficiency would get value from the techniques described. I would send this to peer review. The idea addresses a real scaling issue and the methods are concrete enough to merit referee feedback.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces a think-answer distillation framework for compact VLMs that applies two masking strategies—token-wise salient reasoning-prefix masking and self-paced masking budget scheduling—to block high-influence textual reasoning prefixes during training. This is intended to force the student to anchor its intermediate thinking steps on visual evidence rather than imitating the teacher's textual cues, thereby mitigating visual forgetting in long reasoning traces. The authors report that the resulting student models outperform recent open-source VLMs, VLM distillation baselines, and self-distillation methods on multimodal reasoning benchmarks, with additional analyses indicating improved visual utilization along the student's reasoning process.

Significance. If the central claims hold after rigorous controls, the work would provide a lightweight, architecture-agnostic technique for distilling complex multimodal reasoning into smaller models while explicitly promoting visual grounding. The self-paced scheduling and influence-based prefix selection are technically interesting contributions that could generalize beyond the reported setting. The paper would benefit from stronger isolation of the proposed mechanism from generic regularization effects.

major comments (3)

[§3.1–3.2] §3.1–3.2: The central mechanistic claim—that masking salient reasoning prefixes causes the student to substitute visual evidence for blocked textual cues—is not isolated from confounding training effects. No ablation is presented that compares the proposed masking against random prefix masking, uniform difficulty scaling, or equivalent loss-landscape alterations to demonstrate that gains arise specifically from visual anchoring rather than increased training difficulty.
[§4] §4 (Experiments): The reported outperformance on multimodal reasoning benchmarks lacks error bars, number of random seeds, statistical significance tests, and precise descriptions of baseline re-implementations and hyper-parameter matching. These omissions make it impossible to assess whether the claimed gains are robust or reproducible, which directly bears on the primary empirical claim.
[§4.3] §4.3 (Analyses): The 'further analyzes' confirming enhanced visual utilization are described only at a high level; the manuscript should include quantitative metrics (e.g., attention scores on visual tokens, grounding accuracy on intermediate steps) with controls that rule out post-hoc correlation rather than causation.

minor comments (3)

[Abstract] Abstract: The phrase 'further analyzes confirm' should be replaced with a brief enumeration of the specific analyses performed.
[Notation] Notation: Define 'influence' for salient prefix selection explicitly (e.g., gradient-based, attention-based, or loss-based) and state whether it is computed on the teacher or student at each step.
[Figures] Figure captions: Ensure all figures reporting benchmark scores include the exact number of evaluation samples and any filtering criteria applied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the isolation of our proposed mechanism and improving empirical robustness. We have revised the manuscript to address these points and provide detailed responses below.

read point-by-point responses

Referee: [§3.1–3.2] §3.1–3.2: The central mechanistic claim—that masking salient reasoning prefixes causes the student to substitute visual evidence for blocked textual cues—is not isolated from confounding training effects. No ablation is presented that compares the proposed masking against random prefix masking, uniform difficulty scaling, or equivalent loss-landscape alterations to demonstrate that gains arise specifically from visual anchoring rather than increased training difficulty.

Authors: We agree that stronger isolation of the salient prefix masking effect is valuable. In the revised manuscript we have added an ablation comparing token-wise salient reasoning-prefix masking against random prefix masking performed at identical masking ratios and budgets. The results show that random masking produces noticeably smaller gains on the target benchmarks, indicating that the performance improvement is not explained by generic increases in training difficulty alone. We have also clarified in Section 3.2 that our self-paced schedule is driven by the evolving teacher-student distribution discrepancy rather than a fixed difficulty ramp, distinguishing it from uniform scaling. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance on multimodal reasoning benchmarks lacks error bars, number of random seeds, statistical significance tests, and precise descriptions of baseline re-implementations and hyper-parameter matching. These omissions make it impossible to assess whether the claimed gains are robust or reproducible, which directly bears on the primary empirical claim.

Authors: We acknowledge the need for greater statistical transparency. The revised experimental section now reports mean performance with standard-deviation error bars computed across three independent random seeds for all main results. We have added paired t-test p-values comparing our method against each baseline and included these in the result tables. We have also expanded the description of baseline re-implementations, confirming that all methods were trained with identical data splits, optimizer settings, and total training steps to ensure fair hyper-parameter matching. revision: yes
Referee: [§4.3] §4.3 (Analyses): The 'further analyzes' confirming enhanced visual utilization are described only at a high level; the manuscript should include quantitative metrics (e.g., attention scores on visual tokens, grounding accuracy on intermediate steps) with controls that rule out post-hoc correlation rather than causation.

Authors: We have substantially expanded Section 4.3. The revised version now reports quantitative attention scores averaged over visual tokens at each intermediate reasoning step, together with a visual grounding accuracy metric computed on a held-out set of examples. To address potential post-hoc correlation, we include an additional control ablation that applies masking without the self-paced schedule; the full method yields statistically higher visual attention and grounding scores than this control, supporting a causal contribution of the combined masking strategy. revision: yes

Circularity Check

0 steps flagged

No circularity; masking strategies are independent heuristic additions

full rationale

The paper introduces token-wise salient reasoning-prefix masking and self-paced masking budget scheduling as new components in a think-answer distillation framework. These are described as mechanisms to encourage visual reliance by blocking textual cues, with no equations or derivations shown that reduce the claimed visual-anchoring effect to fitted parameters, self-citations, or prior ansatzes by construction. Performance gains are tied to benchmark experiments and qualitative analyses rather than any self-definitional loop or renamed known result. The derivation chain remains self-contained against external benchmarks, with the central premise being an empirical training intervention rather than a mathematical equivalence to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that visual evidence can substitute for masked textual reasoning cues; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)

masking budget schedule parameters
Self-paced scheduling is driven by teacher-student discrepancy, implying tunable thresholds or rates not specified in the abstract.

axioms (1)

domain assumption Masking salient reasoning prefixes encourages the student to rely on visual evidence during distillation
This premise is invoked to justify the masking strategy as an alternative information source.

pith-pipeline@v0.9.0 · 5803 in / 1004 out tokens · 40028 ms · 2026-05-19T16:53:59.584223+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher–student distributions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
cs.CL 2026-06 unverdicted novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...