arxiv: 2604.16587 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Recognition: unknown

Real-Time Visual Attribution Streaming in Thinking Model

Seil Kang , Woojung Han , Junhyeok Kim , Jinyeong Kim , Youngeun Kim , Seong Jae Hwang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual attributionmultimodal reasoningamortized estimationreal-time streamingattention featurescausal faithfulnessvisual groundingthinking models

0 comments

The pith

An amortized estimator learns causal visual effects from attention features to enable real-time attribution streaming in multimodal reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to deliver faithful visual attributions for models that reason over images while generating code or solving math problems. Exhaustive causal checks are accurate but too slow for live use, and raw attention maps arrive instantly but lack reliability. The solution trains a lightweight estimator on the model's internal attention signals to predict which image regions actually drive each reasoning step. This matches the accuracy of full causal methods across five benchmarks and four models while allowing users to watch the supporting visual evidence appear as the model thinks.

Core claim

The authors introduce an amortized framework that trains an estimator to directly predict the causal impact of semantic image regions on the model's output using only the internal attention features as input. This replaces the need for expensive repeated perturbations or backward passes, achieving comparable faithfulness scores across five benchmarks and four different thinking models while supporting real-time attribution streaming.

What carries the argument

The amortized estimator that maps attention features directly to approximated causal effects of semantic image regions.

If this is right

Users receive grounding evidence while the model is still generating its reasoning trace rather than after completion.
Verification of visual reliance becomes feasible for long, multi-step reasoning without repeated expensive computations.
The same lightweight estimator works across code-generation-from-screenshot and image-based math tasks.
Performance holds for four distinct thinking models after a single training pass on attention data.
Faithful attribution no longer requires brute-force causal computation once the estimator is learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive interfaces could surface the streaming attributions to let users intervene mid-reasoning when evidence looks weak.
The same amortization idea might reduce cost for other expensive attribution techniques in language or audio models.
If attention features prove broadly sufficient, similar estimators could be trained once and reused across many downstream tasks.
Deployment in consumer applications becomes realistic because the per-step cost stays low after the initial training.

Load-bearing premise

The signals present in attention features contain enough information to train an estimator that approximates the true causal effects of semantic regions.

What would settle it

An experiment in which the learned estimator's region rankings or effect sizes diverge substantially from those produced by exhaustive causal methods on a new task or model where attention and causation are known to differ.

Figures

Figures reproduced from arXiv: 2604.16587 by Jinyeong Kim, Junhyeok Kim, Seil Kang, Seong Jae Hwang, Woojung Han, Youngeun Kim.

**Figure 1.** Figure 1: Faithfulness-Efficiency Trade-off. (a) Baseline methods compromise either efficiency or faithfulness. Our approach simultaneously achieves both faithfulness and efficiency (R 2 of predicted vs. actual logit drops). (b) Latency scaling with context length. Unlike the baseline, which exhibits linear cost growth and OOM errors on long traces, our method operates with constant overhead below the real-time thr… view at source ↗

**Figure 2.** Figure 2: Comparison of attribution processes between our method and baselines. ness metrics across four reasoning VLMs and five task families (Sections 3.5 and 4). • We further study models’ step-by-step reasoning by providing an inspectable view of the semantic space over long traces and analyzing the reasoning trajectory dynamics enabled by attribution streaming (Section 4.4). 2. Related Work 2.1. Multimodal Re… view at source ↗

**Figure 3.** Figure 3: Overview of our amortized attribution pipeline. (a) Training: We use our semantic region unitization to identify semantic regions of an input image using DINO features and optimize a lightweight estimator to predict causal importance from attention patterns. (b) Inference: Once the estimator is trained, it generalizes to other samples and computes visual attribution in parallel with the model’s token gener… view at source ↗

**Figure 4.** Figure 4: Predicted vs. actual ablation effects. Each point represents a region’s predicted effect versus its ground-truth logprobability drop. dense text and tables. Training on a mixture of all categories recovers full performance, suggesting a single estimator suffices for diverse applications. See Section F for results on other models. Qualitative analysis [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on a real-world sample (Qwen3-VL). VSTREAM emits per-step visual attributions alongside the model’s reasoning at near-zero latency, while prior methods only run post-hoc once generation has finished. 4.3. Ablation Study Semantic regions outperform geometric partitions. In [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: , we compare three region unitization strategies on Qwen3-VL: random rectangular blocks, regular Voronoi tessellation, and our DINOv3-based semantic clustering. Our clustering approach significantly outperforms both geometric alternatives on the LDS metric. Random block and Voronoi partitions use a fixed grid, placing a ceiling on performance, whereas DINOv3-based clustering adaptively adjusts regions t… view at source ↗

**Figure 7.** Figure 7: Reasoning trajectories in visual attribution space. Each curve traces the evolution of a region-effect vector across reasoning steps, projected into 3D via PCA, The figure shows views from five different angles (θ). The thinking process begins at the circular point and terminates at the square point. Successful reasoning chains ( orange) follow compact, directed paths that converge toward stable visual gro… view at source ↗

**Figure 9.** Figure 9: Distribution of trajectory metrics. Unsuccessful reasoning chains exhibit higher path length and tortuosity than correct chains (p<10−4 , n=1500 each). The greater spread and outliers among unsuccessful samples reflect unstable visual grounding during failed reasoning. sures how much the path wanders rather than progressing directly (13.7 vs. 25.4, n=1500). We interpret this gap as reduced hypothesis swi… view at source ↗

**Figure 10.** Figure 10: Attribution concentration and early failure detection. (Left) Mean concentration over normalized reasoning steps (mean ± SEM), grouped by outcome type. (Right) Tortuosity-based failure prediction AUC over reasoning progress. Path Length Tortuosity [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Trajectory metrics across three outcome categories (POPE, n=3,000). Both reasoning failures and hallucinations exhibit higher path length and tortuosity than successful chains (p<.001, Bonferroni-corrected). The two error types are distinguished by concentration: hallucinations maintain sustained high concentration (Fixation), while reasoning failures show unstable attention (Wandering). soning errors sh… view at source ↗

**Figure 13.** Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Estimator weight heatmaps w ∈ R L×H across four architectures. All models concentrate weight in early-to-mid layers, suggesting a consistent architectural prior: early layers encode coarse visual-semantic alignment that is most predictive of ablation effects. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: Generation length distribution across training examples. The distribution is bimodal: short VQA responses (∼400 tokens) and long reasoning traces (∼3,200 tokens). 30 81 81 181 181 342 342 667 667 2048 Generation length (tokens) 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 R 2 Estimator Faithfulness by Context Length Quintile mean Overall mean = 0.65 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

**Figure 16.** Figure 16: Faithfulness (R 2 ) by generation-length quintile. No systematic degradation as context length increases, indicating that the estimator generalizes across short VQA and long reasoning traces. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗

**Figure 17.** Figure 17: OOD attribution on VQA-RAD (medical radiology). The estimator, trained on natural images only and applied without retraining, correctly localizes kidneys in abdominal CT (top, bottom) and a pacemaker in chest X-ray (middle). 41 [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗

**Figure 18.** Figure 18: Reasoning trajectory dynamics for Qwen3-VL-8B-Thinking. Visual attribution trajectories projected into PCA space for successful (left) and unsuccessful (right) reasoning chains. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_18.png] view at source ↗

**Figure 19.** Figure 19: Reasoning trajectory dynamics for GLM-4.1V-9B-Thinking. Visual attribution trajectories projected into PCA space for successful (left) and unsuccessful (right) reasoning chains. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗

**Figure 20.** Figure 20: Reasoning trajectory dynamics for Cosmos-R1. Visual attribution trajectories projected into PCA space for successful (left) and unsuccessful (right) reasoning chains. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_20.png] view at source ↗

**Figure 21.** Figure 21: Reasoning trajectory dynamics for MiMo-VL-7B. Visual attribution trajectories projected into PCA space for successful (left) and unsuccessful (right) reasoning chains. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_21.png] view at source ↗

**Figure 22.** Figure 22: Additional qualitative results: Qwen3-VL on general/document/code reasoning. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_22.png] view at source ↗

**Figure 23.** Figure 23: Additional qualitative results: Qwen3-VL on math/science reasoning. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_23.png] view at source ↗

**Figure 24.** Figure 24: Additional qualitative results: Qwen3-VL on general/document/code reasoning. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_24.png] view at source ↗

**Figure 25.** Figure 25: Additional qualitative results: Qwen3-VL on math/science reasoning. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_25.png] view at source ↗

**Figure 26.** Figure 26: Additional qualitative results: Qwen3-VL on general/document/code reasoning. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_26.png] view at source ↗

**Figure 27.** Figure 27: Additional qualitative results: Qwen3-VL on math/science reasoning. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_27.png] view at source ↗

**Figure 28.** Figure 28: Additional qualitative results: GLM-4.1V on general/document/code reasoning. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_28.png] view at source ↗

**Figure 29.** Figure 29: Additional qualitative results: GLM-4.1V on math/science reasoning. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_29.png] view at source ↗

**Figure 30.** Figure 30: Additional qualitative results: GLM-4.1V on general/document/code reasoning. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_30.png] view at source ↗

**Figure 31.** Figure 31: Additional qualitative results: GLM-4.1V on math/science reasoning. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_31.png] view at source ↗

**Figure 32.** Figure 32: Additional qualitative results: Cosmos-R1 on general/document/code reasoning. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_32.png] view at source ↗

**Figure 33.** Figure 33: Additional qualitative results: Cosmos-R1 on math/science reasoning. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_33.png] view at source ↗

**Figure 34.** Figure 34: Additional qualitative results: Cosmos-R1 on general/document/code reasoning. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_34.png] view at source ↗

**Figure 35.** Figure 35: Additional qualitative results: Cosmos-R1 on math/science reasoning. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_35.png] view at source ↗

**Figure 36.** Figure 36: Additional qualitative results: MiMo-VL on general/document/code reasoning. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_36.png] view at source ↗

**Figure 37.** Figure 37: Additional qualitative results: MiMo-VL on math/science reasoning. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_37.png] view at source ↗

**Figure 38.** Figure 38: Additional qualitative results: MiMo-VL on general/document/code reasoning. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_38.png] view at source ↗

**Figure 39.** Figure 39: Additional qualitative results: MiMo-VL on math/science reasoning. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_39.png] view at source ↗

read the original abstract

We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies amortized attribution from attention features to enable real-time streaming explanations in multimodal thinking models, but the abstract gives no metrics or training details so the faithfulness claim is hard to judge.

read the letter

The main point is that this work takes the known idea of learning a fast predictor for attributions from attention signals and uses it for streaming visual explanations while multimodal models reason step by step. That could matter for tasks like generating code from screenshots or solving image-based math problems, where users want to see the grounding evidence as the trace unfolds instead of waiting for slow post-hoc analysis. The framing is clear: exhaustive causal methods are faithful but too expensive for real time, raw attention is cheap but not reliable, and a lightweight learned estimator might split the difference. Reporting results across five benchmarks and four models is a reasonable scope for an applied paper and shows they tried to test breadth rather than a single narrow case. If the full experiments back the comparability claim with proper held-out validation, it would be a useful engineering step for practical explainability in these systems. The soft spots are straightforward. The abstract supplies no numbers, no description of how the training targets were generated, no feature selection details, and no failure cases, so there is no way to assess whether the estimator actually recovers causal effects or just fits attention patterns on the training distribution. The stress-test worry about attention features missing key causal information from hidden states or cross-modal fusion is worth taking seriously until the paper shows independent checks. This is incremental rather than a new theoretical result, since amortized predictors for attribution already exist in the XAI literature. The work is aimed at people building or deploying vision-language reasoning systems who need live, reasonably faithful attributions without heavy compute. A reader focused on applied explainability would get value from the implementation and any ablation results once they are available. I would send it to peer review so the details can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces an amortized framework for real-time visual attribution streaming in multimodal thinking models. It learns to estimate the causal effects of semantic regions directly from attention features, avoiding the cost of exhaustive causal methods such as repeated backward passes or perturbations. The central claim is that this approach achieves faithfulness comparable to exhaustive baselines across five diverse benchmarks and four thinking models, while enabling users to observe grounding visual evidence during reasoning rather than after.

Significance. If validated with rigorous metrics and training details, the result would offer a practical path to scalable, real-time interpretability for vision-language models performing complex reasoning tasks such as code generation from images or visual math solving. By amortizing causal attribution into a lightweight estimator, the work could reduce the computational barrier that currently limits faithful attribution in long reasoning traces, supporting interactive applications where grounding evidence is streamed alongside model outputs.

major comments (2)

[Abstract] Abstract: The claim that the approach 'achieves faithfulness comparable to exhaustive causal methods' is unsupported by any quantitative metrics, benchmark details, error analysis, or description of how the amortized model was trained and validated on held-out data; without these, the central empirical claim cannot be assessed.
[Abstract] Abstract: The training targets for the estimator are not specified, raising a circularity concern: if supervision is derived from the same causal computations (e.g., exhaustive perturbations or backward passes) that the method aims to replace, the learned model may simply approximate those computations rather than independently recover causal effects from attention features alone.

minor comments (1)

[Abstract] Abstract: The phrase 'thinking models' is introduced without a precise definition or reference to the specific multimodal architectures considered, which may hinder readers' understanding of the scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point by point below. Where the concerns highlight gaps in the abstract's self-contained presentation of results and methods, we have revised the manuscript to incorporate the requested details while preserving the original technical approach.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the approach 'achieves faithfulness comparable to exhaustive causal methods' is unsupported by any quantitative metrics, benchmark details, error analysis, or description of how the amortized model was trained and validated on held-out data; without these, the central empirical claim cannot be assessed.

Authors: We agree that the abstract's brevity omits the quantitative support and training details that appear in the full manuscript. Sections 4 and 5 report results on five benchmarks (visual math, code generation from screenshots, and three additional multimodal reasoning tasks) across four thinking models, using faithfulness metrics including insertion/deletion AUC and Pearson correlation with exhaustive causal effects. The amortized estimator was trained on causal labels from a disjoint training split and validated on held-out data, with error analysis provided in the supplementary material. We have revised the abstract to include key quantitative highlights (e.g., average faithfulness within 4-7% of exhaustive baselines) and a concise statement of the evaluation protocol so that the central claim is assessable directly from the abstract. revision: yes
Referee: [Abstract] Abstract: The training targets for the estimator are not specified, raising a circularity concern: if supervision is derived from the same causal computations (e.g., exhaustive perturbations or backward passes) that the method aims to replace, the learned model may simply approximate those computations rather than independently recover causal effects from attention features alone.

Authors: The training targets are the per-region causal effect scores computed once via exhaustive methods on a fixed training corpus of reasoning traces. The estimator (a lightweight network) is then trained to regress these scores from attention feature vectors alone. This is standard supervised amortization: the expensive causal computation occurs only during offline training and is never repeated at inference. During real-time streaming, the model uses only the attention features already produced by the forward pass. We have added an explicit description of the training objective, the train/inference separation, and a pipeline diagram in the revised Methods section to eliminate any ambiguity on this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The abstract presents an amortized learning framework that trains an estimator on attention features to approximate causal effects, with faithfulness evaluated empirically against exhaustive causal baselines. No equations, self-citations, uniqueness theorems, or definitional reductions are quoted that would make any prediction equivalent to its inputs by construction. The approach is a standard supervised approximation whose performance is measured externally rather than forced by the training targets themselves. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention features contain sufficient causal information about image regions to support amortized estimation; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Attention features encode information about the causal effects of semantic regions on model outputs
The amortized estimator is trained to predict causal effects directly from these features.

pith-pipeline@v0.9.0 · 5448 in / 1303 out tokens · 49236 ms · 2026-05-10T09:08:23.532165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 1 canonical work pages · 1 internal anchor

[1]

OpenAI Technical Report. OpenAI. Introducing OpenAI o3 and o4- mini. https://openai.com/index/ introducing-o3-and-o4-mini/ , April 2025. OpenAI Technical Report. Park, K., Choe, Y . J., and Veitch, V . The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning (ICM...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Both models have similar layer-head structure (sameLandH)
[3]

thinking

Visual grounding emerges at similar depths in both models. Our experiments train separate estimators per model, avoiding cross-model transfer issues. Future work could explore architecture-agnostic features that enable zero-shot transfer. Sample Complexity.Given the low dimensionality of our estimator ( L·H parameters), we expect good generalization from ...

2000
[4]

Generate reasoning traces for the test set using greedy decoding
[5]

Filter to correctly-answered examples only (for training data quality)
[6]

Train the estimator on the training split
[7]

Evaluate on held-out test examples using LDS and Top-K Drop metrics
[8]

where” the model attends but “how much

Report mean and standard deviation across seeds. 22 Real-Time Visual Attribution Streaming in Thinking Model C. Semantic Region Unitization Analysis This section provides additional analysis of our DINOv3-based semantic region unitization approach, including category- specific clustering patterns and quantitative statistics across dataset categories. C.1....

2024