Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal; Gautam Vashishtha; Heng Ji; Hyeonjeong Ha; Jeonghwan Kim; Madhav Kanda; Svetlana Lazebnik; Unnat Jain; Utkarsh Mishra

arxiv: 2510.09741 · v3 · submitted 2025-10-10 · 💻 cs.CV · cs.LG

Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal , Gautam Vashishtha , Utkarsh Mishra , Jeonghwan Kim , Madhav Kanda , Hyeonjeong Ha , Svetlana Lazebnik , Heng Ji

show 1 more author

Unnat Jain

This is my paper

Pith reviewed 2026-05-18 07:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords attention-guided warpingmultimodal large language modelstest-time image adaptationvisual question answeringhallucination reductionresolution reallocationcompositional reasoning

0 comments

The pith

AttWarp warps images using an MLLM's own attention maps to give more resolution to query-relevant areas while keeping global context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple test-time warping of the input image, directed by where the model attends across modalities, lets the same MLLM read small details and spatial relations more reliably. The warping expands important regions and compresses others without discarding any original pixels, so the global layout stays visible but fine-grained content becomes easier for the model to parse. Experiments across four different MLLMs and five benchmarks demonstrate higher accuracy, better compositional reasoning, and fewer hallucinations compared with both the original images and several other image-alteration baselines. A sympathetic reader would see this as evidence that existing models already contain the information they need but are limited by uniform resolution; reallocating it according to their own attention unlocks better performance without retraining or architectural changes.

Core claim

AttWarp reallocates spatial resolution in an input image by performing rectilinear warping guided by the MLLM's cross-modal attention maps, expanding regions the model deems relevant to the query while compressing less informative areas, thereby preserving all original information and global context yet making small objects and subtle relationships easier for the unchanged model to perceive.

What carries the argument

Attention-guided rectilinear warping that redistributes pixel resolution according to the MLLM's cross-modal attention maps without altering model weights or architecture.

If this is right

Accuracy rises on TextVQA, GQA, DocVQA, POPE, and MMMU for LLaVA, Qwen-VL, InternVL, and InstructBLIP.
Compositional reasoning improves because small spatial relations become higher-resolution.
Hallucinations decrease as the model receives clearer visual evidence for its answers.
The method outperforms four competitive test-time image-manipulation baselines while leaving model parameters untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention maps appear to function as a built-in importance detector that can be turned into a resolution-allocation signal.
The same warping principle could be tested on video frames or on tasks such as visual grounding where spatial precision matters.
Because the change is purely input-side, it can be stacked with other test-time techniques such as prompt tuning or self-consistency checks.

Load-bearing premise

The MLLM's attention maps reliably mark the regions whose expansion through warping will improve perception of the query rather than introduce distorting artifacts or hide critical context.

What would settle it

A set of queries where the model answers correctly on the original image but incorrectly on the attention-warped version because the warping creates visual artifacts or obscures necessary global layout.

read the original abstract

Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AttWarp is a straightforward test-time warping method that uses cross-modal attention to reallocate resolution and shows consistent gains on VQA benchmarks, but the bootstrap risk from low-res attention needs direct checks.

read the letter

The core contribution is a test-time procedure that takes an MLLM's cross-modal attention on the original image, then applies rectilinear warping to stretch query-relevant regions while squeezing others, all without retraining or architecture changes. The experiments report accuracy lifts on TextVQA, GQA, DocVQA, POPE, and MMMU for LLaVA, Qwen-VL, InternVL, and InstructBLIP, beating four image-manipulation baselines and noting fewer hallucinations plus better compositional answers. That combination of attention-guided warping at inference time looks new relative to the cited priors, and the method is genuinely lightweight with the claim that global context stays intact because no pixels are dropped. The results are presented across multiple models and tasks, which gives the empirical side some weight. The main soft spot is the bootstrap issue: attention maps come from the uniform low-resolution pass, so if the model already fails to surface small or cluttered details, the warp will reinforce the wrong areas. The paper would be stronger with explicit checks, such as attention visualizations before and after warping or ablations on cases where initial attention is diffuse. Implementation details on how attention is extracted and how the warp is parameterized also matter for reproducibility, though the abstract suggests the approach is procedural rather than learned. This work fits readers who care about practical inference tweaks for visual grounding or document tasks. It is not field-changing but offers a clean, falsifiable idea with broad enough experiments to merit referee time. I would send it for review with a request for more attention-quality diagnostics.

Referee Report

3 major / 2 minor

Summary. The paper introduces AttWarp, a test-time method that uses an MLLM's cross-modal attention maps to perform rectilinear warping on the input image. This reallocates higher spatial resolution to query-relevant regions while compressing less informative areas, preserving all original information and global context without modifying model weights or architecture. Empirical evaluation across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, InstructBLIP) reports consistent accuracy gains, improved compositional reasoning, and reduced hallucinations relative to four image-manipulation baselines.

Significance. If the results hold, AttWarp provides a lightweight, training-free way to improve fine-grained perceptual grounding in MLLMs by redistributing resolution according to the model's own attention signals. The approach is notable for being parameter-free in the sense of not introducing new learned components and for preserving the full original image content while addressing the paper's stated motivation of missed small details in cluttered scenes.

major comments (3)

[Method] Method section: The central claim depends on cross-modal attention computed on the original low-resolution image reliably identifying regions that would benefit from increased resolution. However, this creates a bootstrap risk precisely where the motivation is strongest (small objects or relations missed due to insufficient resolution); the manuscript does not provide analysis or failure-case examples showing when attention succeeds or fails to surface these under-resolved elements before warping occurs.
[Experiments] Experimental evaluation: While aggregate improvements are reported across models and benchmarks, the manuscript provides insufficient detail on attention extraction (which layers/heads, aggregation method), the precise rectilinear warping implementation (grid sampling, interpolation, boundary handling), and any statistical significance testing or variance across runs. These omissions make it difficult to verify that the reported gains are robust and attributable to the proposed mechanism rather than implementation specifics.
[Experiments] Baselines and ablation: The comparison to four competitive test-time image-manipulation baselines is central to the claim of superiority, yet the manuscript does not fully specify how those baselines were reimplemented or controlled for global context preservation, leaving the relative advantage of attention-guided warping incompletely substantiated.

minor comments (2)

[Abstract / Method] The abstract and method description use 'rectilinear warping' without a concise definition or reference to the underlying transformation; adding a short formal description or equation would aid clarity.
[Figures] Figure captions and legends should explicitly label attention overlays, warped vs. original grids, and any quantitative metrics shown to make the visual evidence self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each of the major comments point by point below and commit to revising the manuscript to incorporate clarifications and additional analyses where appropriate.

read point-by-point responses

Referee: [Method] Method section: The central claim depends on cross-modal attention computed on the original low-resolution image reliably identifying regions that would benefit from increased resolution. However, this creates a bootstrap risk precisely where the motivation is strongest (small objects or relations missed due to insufficient resolution); the manuscript does not provide analysis or failure-case examples showing when attention succeeds or fails to surface these under-resolved elements before warping occurs.

Authors: We agree that the bootstrap risk is an important consideration, particularly for the motivation involving small objects missed due to low resolution. Our results show consistent gains, indicating that the attention from the original image is sufficiently informative in practice for the evaluated tasks. However, to address this concern directly, we will add a new subsection in the revised manuscript with qualitative examples of attention maps, failure cases where attention misses small details, and discussion of when the method is most effective. revision: yes
Referee: [Experiments] Experimental evaluation: While aggregate improvements are reported across models and benchmarks, the manuscript provides insufficient detail on attention extraction (which layers/heads, aggregation method), the precise rectilinear warping implementation (grid sampling, interpolation, boundary handling), and any statistical significance testing or variance across runs. These omissions make it difficult to verify that the reported gains are robust and attributable to the proposed mechanism rather than implementation specifics.

Authors: We appreciate this feedback on reproducibility. In the revised manuscript, we will provide detailed specifications for attention extraction, including the specific layers and heads from which cross-modal attention is extracted and the aggregation strategy employed. We will also describe the rectilinear warping implementation in full, covering grid sampling, interpolation method, and boundary handling. Additionally, we will include statistical significance tests and report variance or standard deviations across runs to demonstrate robustness. revision: yes
Referee: [Experiments] Baselines and ablation: The comparison to four competitive test-time image-manipulation baselines is central to the claim of superiority, yet the manuscript does not fully specify how those baselines were reimplemented or controlled for global context preservation, leaving the relative advantage of attention-guided warping incompletely substantiated.

Authors: We acknowledge that more details on the baseline implementations are necessary for a complete comparison. In the revision, we will expand the experimental section to fully describe the reimplementation of each baseline, including any parameters chosen, and explicitly state how global context was preserved in each case (e.g., through full-image transformations without information loss). This will better substantiate the advantages of our attention-guided approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical test-time procedure with independent benchmark validation

full rationale

The paper introduces AttWarp as a lightweight test-time procedure that applies an MLLM's existing cross-modal attention maps to perform rectilinear image warping. No equations, parameters, or predictions are fitted within the paper and then re-used as outputs. The central claims rest on empirical accuracy gains across five benchmarks and four models, which are externally falsifiable and not derived from self-referential definitions or self-citations that bear the load of the result. The method is self-contained as a procedural intervention whose effectiveness is measured directly against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new procedural method but rests on standard domain assumptions about attention mechanisms and image information preservation under warping; no new entities or heavily fitted parameters are evident from the abstract.

axioms (1)

domain assumption Cross-modal attention maps from MLLMs can be used to identify query-relevant image regions for resolution reallocation.
Invoked as the basis for guiding the warping process in the method description.

pith-pipeline@v0.9.0 · 5791 in / 1287 out tokens · 43485 ms · 2026-05-18T07:35:31.260040+00:00 · methodology

Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)