Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Chanyong Yoon; Seong Jae Hwang; Sujung Hong

arxiv: 2605.14530 · v2 · pith:57CUSM3Pnew · submitted 2026-05-14 · 💻 cs.CV

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Sujung Hong , Chanyong Yoon , Seong Jae Hwang This is my paper

Pith reviewed 2026-05-20 21:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion vision-language modelsmask prior driftpositional attentionRoPE scalinglong-form generationvisual groundingtraining-free methodsiterative unmasking

0 comments

The pith

Mask prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in diffusion vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large diffusion vision-language models decode text in parallel by iteratively revealing masked tokens, yet this process produces repetitive strings because the hidden states of mask tokens gradually shift toward one shared direction. The same iterative schedule also misaligns with the fixed positional attention bias, so the model under-attends to informative image tokens and loses visual grounding. The authors trace both problems to these concrete mechanisms and introduce two adjustments that operate only at inference time: one suppresses the accumulating mask prior and the other monotonically scales rotary embeddings to match the unmasking order. When these changes are applied, models generate longer, less repetitive descriptions and show clearer image grounding on standard multimodal and grounding benchmarks.

Core claim

The paper shows that repetitive outputs arise because mask-token hidden representations drift toward a shared prior direction across denoising steps, while degraded visual grounding follows from a mismatch between static positional attention biases and the changing set of unmasked tokens. It introduces Mask Prior Suppression to reduce that drift and Monotonic RoPE Scaling to realign attention with the iterative unmasking schedule, both without retraining or extra parameters.

What carries the argument

Mask Prior Suppression, which counters the progressive drift of mask-token representations, together with Monotonic RoPE Scaling, which adjusts positional biases to follow the order of unmasking.

Load-bearing premise

The problems of repetition and poor grounding are caused mainly by mask-token prior drift and positional attention misalignment, and the two proposed decoding adjustments fix those causes without creating new side effects.

What would settle it

Apply the two adjustments to a baseline LDVLM on a long-form description benchmark and measure whether the rate of repeated phrases falls and whether grounding accuracy on visual referring tasks rises; if both metrics remain unchanged, the claimed mechanisms are not the primary drivers.

Figures

Figures reproduced from arXiv: 2605.14530 by Chanyong Yoon, Seong Jae Hwang, Sujung Hong.

**Figure 1.** Figure 1: Failure case of LDVLMs. Under parallel decoding with 64 generation tokens and 16 generation steps, LLaDA-V produces highly repetitive phrases as highlighted in red, and exhibits degraded visual grounding as highlighted in gray. a representative framework for discrete sequence modeling. MDMs assume an input sequence x0 = [x i ] N i=1 consisting of N tokens, including special mask tokens M. The model define… view at source ↗

**Figure 2.** Figure 2: Visualization of token repetition and mask prior drift. (a) Distinct-n (left) and repetition ratio (right) across different numbers of generation steps. Fewer generation steps lead to lower distinct-n and higher repetition. (b) 3D PCA trajectories of hidden states for the vocabulary mean embedding and the uncontextualized mask token, which converge to a similar region at the final layer (L31). (c) Cosine s… view at source ↗

**Figure 3.** Figure 3: Visualization of positional attention collapse. (a) Mean attention weight (log scale) across relative distance, showing stronger attention to mask tokens than visual tokens at similar distances and an overall decreasing trend in attention to visual tokens as relative distance increases. (b) Sum of attention to visual and mask tokens per generation token across generation steps, revealing a persistent alloc… view at source ↗

**Figure 4.** Figure 4: Overview of the proposed model. (a) Mask prior suppression. The final hidden state h ej L is decomposed along the prior direction u eˆ L, and prior components are adaptively suppressed based on cosine similarity. (b) Monotonic RoPE scaling. Low-frequency RoPE components, which govern long-range positional interactions, are scaled more strongly than high-frequency components to preserve attention to distant… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on visual grounding and long-form generation. (a) RefCOCOg results. A red box indicates the target region. The baseline model LLaDA-V produces descriptions referring to an incorrect object shown in gray, whereas our method correctly grounds the description to the target location and achieves more accurate visual grounding shown in blue. (b) MIA results. The baseline model exhibits re… view at source ↗

**Figure 6.** Figure 6: Results of LaViDa on DetailCaps with varying generation steps. Dashed lines: LaViDa, solid lines: Ours. (a) Distinct-n scores of our method exhibit an increasing trend across the evaluated settings and remain consistently higher than those of the baseline. (b) The repetition ratio under our method shows a decreasing trend across the evaluated settings and remains consistently lower than that of the base… view at source ↗

**Figure 7.** Figure 7: Visualization of result analysis. (a) Box plot of cosine similarity between contextualized mask tokens and the vocabulary mean, showing consistent reduction across generation steps. (b) Relative change in attention with respect to relative distance, where attention to distant visual tokens increases compared to the baseline, while attention to mask tokens is preserved or reduced. Effect of Monotonic RoPE … view at source ↗

**Figure 8.** Figure 8: Visualization of mask prior drift on LaViDa. (a) 3D PCA trajectories of hidden states for the vocabulary mean embedding and the uncontextualized mask token, which converge to a similar region at the final layer (L31). (b) Cosine similarity between contextualized mask token embeddings and the vocabulary mean, showing consistently stronger alignment than random embeddings, especially with fewer generation st… view at source ↗

**Figure 9.** Figure 9: Visualization of Positional Attention Collapse on LaViDa (a) Mean attention weight across relative distance (log scale), showing stronger attention to mask tokens than visual tokens at similar distances and a monotonic decay for visual tokens. (b) Sum of attention to visual and mask tokens per generation token across generation steps, revealing a persistent allocation of comparable attention weights to mas… view at source ↗

**Figure 10.** Figure 10: Visualization of result analysis on LaViDa. (a) Box plot of cosine similarity between contextualized mask tokens and the vocabulary mean, showing consistent reduction across generation steps. (b) Relative change in attention with respect to relative distance, where attention to distant visual tokens increases compared to the baseline, while attention to mask tokens is preserved or reduced. I. Additional E… view at source ↗

**Figure 11.** Figure 11: Relative performance changes on DetailCaps across generation steps using LLaDA-V. Dashed lines: LLaDA-V, solid lines: Ours. (a) ∆Distinct-n (Ours – Base) shows consistent gains, with larger improvements at moderate to larger generation steps. (b) ∆Repetition ratio (Base – Ours) remains positive across most steps, indicating reduced repetition, with the strongest reductions observed at intermediate steps. … view at source ↗

**Figure 12.** Figure 12: Generation step analysis and DetailCaps performance on LaViDa. (a) Top-5 logits of the uncontextualized mask token M, where the |eot| token consistently receives the highest logit. (b) CAPTURE scores on the DetailCaps benchmark as a function of generation steps. Contrary to a standard speed–quality trade-off, performance peaks at 16 steps and degrades with additional steps. (c) Qualitative examples showin… view at source ↗

**Figure 13.** Figure 13: Qualitative results on RefCOCOg using LLaDA-V. The red bounding boxes indicate the target regions in the image. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on Ferret using LLaDA-V. The red bounding boxes indicate the target regions in the image. What type of fruit is this? LLaDA-V: The fruit in the image is a mangoes. Mangoes are known for their sweet, juicy flesh and are commonly found in tropical and subtropical regions. They are typically eaten raw but can also be used in desserts, smoothies, and other dishes. The white flesh of mangoe… view at source ↗

**Figure 15.** Figure 15: Qualitative results on LLaVA-Bench using LLaDA-V. Describe the image using a simile. LLaDA-V: The image is a picture of of two flower shops, one on a brick building and the other on a wooden building ❌, both with green awnings. The flower shops are like two friends standing side by side by side, each with their own unique charm, attracting customers and adding to the beauty of the street LLaDA-V+Ours: The… view at source ↗

**Figure 16.** Figure 16: Qualitative results on MIA using LLaDA-V. Provide a short description for this region. LaViDa: The net of a tennis court. LaViDa+Ours: A woman holding a tennis racket. Provide a short description for this region. LaViDa: Red plate LaViDa+Ours: Glass of wine on the table [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results on RefCOCOg using LaViDa. The red bounding boxes indicate the target regions in the image. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative results on Ferret using LaViDa. The red bounding boxes indicate the target regions in the image. Show the detailed recipe for this dish. LaViDa: None LaViDa+Ours: The image provided a visual representation of the dish, but without textual information or additional context, it is not possible to provide a detailed recipe. Typically, a recipe for a dish would include the ingredients, cooking met… view at source ↗

**Figure 19.** Figure 19: Qualitative results on LLaVA-Bench using LaViDa. Express the feelings that might be elicited by this image using a first-person perspective, specify any author's name visible, while implying a sense of nostalgia. LaViDa: This, this image evokes a sense of nostalgia and the. The, the stack of the,, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the the. … view at source ↗

**Figure 20.** Figure 20: Qualitative results on MIA using LaViDa. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

read the original abstract

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots mask prior drift and positional misalignment in diffusion VLMs and offers two simple training-free fixes, but the causal links rest on correlational observations rather than isolating tests.

read the letter

The paper identifies two concrete problems during long-form generation in large diffusion vision-language models: mask tokens drifting toward a shared prior in their hidden states, and positional attention bias clashing with the iterative unmasking schedule. From there it introduces Mask Prior Suppression and Monotonic RoPE Scaling as lightweight, training-free corrections that can be dropped in at decode time.

Referee Report

3 major / 2 minor

Summary. The paper identifies repetitive generation and degraded visual grounding as key failure modes in large diffusion vision-language models (LDVLMs) under long-form generation. It attributes the former to progressive drift of mask-token hidden states toward a shared prior and the latter to misalignment between positional attention bias and the iterative unmasking schedule. The authors introduce two training-free fixes—Mask Prior Suppression and Monotonic RoPE Scaling—and report that these yield improvements over baseline LDVLMs on multimodal benchmarks and visual grounding tasks, with stronger gains on long-form description benchmarks.

Significance. If the causal mechanisms are substantiated and the interventions shown to act specifically on them, the work would supply a practical, plug-and-play method for stabilizing long-form output in diffusion-based VLMs. This is potentially significant because LDVLMs offer parallel decoding efficiency advantages over autoregressive models, and the paper's emphasis on decoding-time mechanistic diagnosis rather than retraining aligns with current interest in understanding and controlling these architectures.

major comments (3)

[§4] §4 (Experiments): the manuscript states that experiments demonstrate improvements and 'robust gains on long-form description benchmarks' yet supplies no quantitative tables, per-metric scores, ablation breakdowns separating the two interventions, or error bars. Without these, the link between the proposed fixes and the claimed reductions in repetition or gains in visual grounding cannot be verified.
[§3.1–3.2] §3.1–3.2 (Method): the central claim that mask-token prior drift is the primary driver of repetitive generation is presented as an observation, but the results combine both interventions without an isolating ablation that applies Mask Prior Suppression alone while holding the unmasking schedule fixed. This leaves open the possibility that reported gains arise from incidental regularization or entropy changes rather than targeted prior suppression.
[§3.3] §3.3 (Monotonic RoPE Scaling): the paper asserts that the scaling corrects positional attention collapse, yet no attention-map visualizations or quantitative comparisons isolate the effect of the monotonic scaling from other decoding modifications. Such evidence is required to confirm the misalignment mechanism is the operative factor.

minor comments (2)

[Abstract] Abstract: the phrase 'general multimodal benchmarks' is used without naming the specific datasets or tasks; listing them would improve context and reproducibility.
Notation for hidden-state drift and RoPE scaling parameters should be introduced with explicit definitions and consistent symbols in the method section to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the empirical support in our manuscript. We address each major comment below, outlining specific revisions that will provide the requested quantitative evidence, ablations, and visualizations while preserving the core contributions on mask prior drift and positional attention collapse.

read point-by-point responses

Referee: [§4] §4 (Experiments): the manuscript states that experiments demonstrate improvements and 'robust gains on long-form description benchmarks' yet supplies no quantitative tables, per-metric scores, ablation breakdowns separating the two interventions, or error bars. Without these, the link between the proposed fixes and the claimed reductions in repetition or gains in visual grounding cannot be verified.

Authors: We agree that the current experiments section lacks sufficient quantitative detail to fully substantiate the claims. In the revised manuscript, we will add comprehensive tables reporting per-metric scores (e.g., repetition rate, visual grounding accuracy, and long-form description metrics) across multimodal benchmarks. Ablation breakdowns will separate the effects of each intervention, and error bars from multiple runs with different seeds will be included to demonstrate robustness. These additions will directly connect the interventions to reductions in repetition and improvements in visual grounding. revision: yes
Referee: [§3.1–3.2] §3.1–3.2 (Method): the central claim that mask-token prior drift is the primary driver of repetitive generation is presented as an observation, but the results combine both interventions without an isolating ablation that applies Mask Prior Suppression alone while holding the unmasking schedule fixed. This leaves open the possibility that reported gains arise from incidental regularization or entropy changes rather than targeted prior suppression.

Authors: We concur that an isolating ablation is required to establish causality for mask prior drift. The revised paper will include a new ablation experiment applying only Mask Prior Suppression while keeping the original unmasking schedule fixed. This will report metrics on repetitive generation to isolate the effect and address potential confounds such as incidental regularization or entropy shifts, thereby strengthening the mechanistic link. revision: yes
Referee: [§3.3] §3.3 (Monotonic RoPE Scaling): the paper asserts that the scaling corrects positional attention collapse, yet no attention-map visualizations or quantitative comparisons isolate the effect of the monotonic scaling from other decoding modifications. Such evidence is required to confirm the misalignment mechanism is the operative factor.

Authors: We acknowledge the need for targeted evidence on the positional mechanism. The revision will add attention-map visualizations showing positional attention patterns before and after Monotonic RoPE Scaling. We will also include quantitative comparisons of attention metrics and downstream generation quality when applying the scaling in isolation from other changes, confirming its role in correcting the misalignment with the iterative unmasking schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation and external benchmarks

full rationale

The paper identifies mask-token drift and positional misalignment through direct inspection of generation dynamics in baseline LDVLMs, then introduces two training-free interventions (Mask Prior Suppression and Monotonic RoPE Scaling) as plug-and-play corrections. No equations or claims reduce a prediction to a fitted input by construction, nor does any load-bearing premise rest on self-citation chains or imported uniqueness theorems. Experiments on multimodal and grounding benchmarks supply independent verification, keeping the central argument self-contained against external data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on the domain assumption that mask-token initialization and fixed positional biases are the dominant sources of the reported failures; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Generation tokens initialized as mask tokens cause their hidden representations to drift progressively toward a shared prior direction.
Stated directly as the origin of repetitive generation.
domain assumption Positional attention bias is misaligned with the iterative unmasking process, suppressing attention to informative visual tokens.
Presented as the second root cause of degraded visual grounding.

pith-pipeline@v0.9.0 · 5743 in / 1191 out tokens · 83188 ms · 2026-05-20T21:42:05.895629+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 11 internal anchors

[1]

Arif, H. et al. PAINT: Paying attention to INformed tokens to mitigate hallucination in large vision-language models. arXiv preprint arXiv:2501.12835,

work page arXiv
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[4]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Chen, G. H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B. Allava: Har- nessing gpt4v-synthesized data for lite vision-language models.arXiv preprint arXiv:2402.11684,

work page internal anchor Pith review arXiv
[5]

Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

Chen, X., Huang, S., Guo, C., Wei, C., He, Y ., Zhang, J., Li, H., Chen, Y ., et al. Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

work page arXiv
[6]

Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

Dong, H., Li, J., Wu, B., Wang, J., Zhang, Y ., and Guo, H. Benchmarking and improving detail image caption. arXiv preprint arXiv:2405.19092,

work page arXiv
[7]

Visualwebinstruct: Scaling up multimodal instruction data through web search

Jia, Y ., Li, J., Yue, X., Li, B., Nie, P., Zou, K., and Chen, W. Visualwebinstruct: Scaling up multimodal instruction data through web search. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2025
[8]

Referitgame: Referring to objects in photographs of natu- ral scenes

10 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. InProceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014
[9]

A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

Khoshnoodi, M., Jain, V ., Gao, M., Srikanth, M., and Chadha, A. A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

work page arXiv
[10]

Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a

Li, H., Qin, Y ., Ou, B., Xu, L., and Xu, R. Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a. Li, K., Patel, O., Vi ´egas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful an- swers from a language model. InAdvances in Neural Information Proce...

work page arXiv
[11]

Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b

Li, S., Kallidromitis, K., Bansal, H., Gokul, A., Kato, Y ., Kozuka, K., Kuen, J., Lin, Z., Chang, K.-W., and Grover, A. Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b. Li, T., Chen, M., Guo, B., and Shen, Z. A survey on diffu- sion language models.arXiv preprint arXiv:2508.1087...

work page arXiv
[12]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, X., Yan, H., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation. InInternational Conference on Learning Representations, volu...

work page 2024
[13]

Large Language Diffusion Models

Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M., and Li, C. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025a. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language diffusion models.arXiv preprint arXiv:2502.09992...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Instruction Tuning with GPT-4

Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

A., Burns, K., Darrell, T., and Saenko, K

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),

work page 2018
[16]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a

Wang, C., Guo, J., Li, H., Tian, Y ., Nie, Y ., Xu, C., and Han, K. Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a. Wang, J., Wang, Y ., Xu, G., Zhang, J., Gu, Y ., Jia, H., Yan, M., Zhang, J., and Sang, J. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallu...

work page internal anchor Pith review arXiv
[19]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192,

Wang, W., Yang, J., and Peng, W. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025b. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint a...

work page arXiv
[21]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

Xin, Y ., Qin, Q., Luo, S., Zhu, K., Yan, J., Tai, Y ., Lei, J., Cao, Y ., Wang, K., Wang, Y ., et al. Lumina- dimoo: An omni diffusion large language model for multi- modal generation and understanding.arXiv preprint arXiv:2510.06308,

work page arXiv
[22]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

You, Z., Nie, S., Zhang, X., Hu, J., Zhou, J., Lu, Z., Wen, J.-R., and Li, C. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

URL https://arxiv.org/ abs/2407.12772. Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

AMBER is an LLM-free hallucination benchmark covering both generative (AMBER-G) 13 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Table 6.Evaluation Setup.Evaluation splits, inference steps, and generation lengthLfor each benchmark. Dataset Split StepsL Dataset Split StepsL Dataset Split StepsL MME test 2 2 Ferret test 48 96 Detai...

work page 2024
[26]

To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable

For autoregressive baselines, including LLaV A-One-Vision-7B, Qwen2.5-VL-7B, InternVL3-8B, and LLaV A-1.6, we use the default evaluation setups provided by the same framework. To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable. Notably, for LaViDa, we conduct a re-evaluation ...

work page 2024
[27]

Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024)

introduces a piecewise frequency rescaling scheme that preserves high-frequency components while smoothly extrapolating to longer sequences. Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024). While these approaches are effective for extending context length under causal decoding,...

work page 2024
[28]

In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison

and an LLM backbone based on LLaDA-8B or Dream-7B (Ye et al., 2025). In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison. LaViDa introduces a complementary masking strategy during training. Instead of learning from a single masked version of a response, two complementary masked variants...

work page 2025
[29]

For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6

under a consistent evaluation protocol. For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6. For Lumina-DiMOO (Xin et al., 2025), we set λ= 0.1 , β= 0.4 , k= 3 , η= 12.0 , and τ0 = 0.6. In both cases, our method consistently outperforms the corresponding baselines, as shown in Table

work page 2025

[1] [1]

Arif, H. et al. PAINT: Paying attention to INformed tokens to mitigate hallucination in large vision-language models. arXiv preprint arXiv:2501.12835,

work page arXiv

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[4] [4]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Chen, G. H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B. Allava: Har- nessing gpt4v-synthesized data for lite vision-language models.arXiv preprint arXiv:2402.11684,

work page internal anchor Pith review arXiv

[5] [5]

Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

Chen, X., Huang, S., Guo, C., Wei, C., He, Y ., Zhang, J., Li, H., Chen, Y ., et al. Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

work page arXiv

[6] [6]

Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

Dong, H., Li, J., Wu, B., Wang, J., Zhang, Y ., and Guo, H. Benchmarking and improving detail image caption. arXiv preprint arXiv:2405.19092,

work page arXiv

[7] [7]

Visualwebinstruct: Scaling up multimodal instruction data through web search

Jia, Y ., Li, J., Yue, X., Li, B., Nie, P., Zou, K., and Chen, W. Visualwebinstruct: Scaling up multimodal instruction data through web search. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2025

[8] [8]

Referitgame: Referring to objects in photographs of natu- ral scenes

10 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. InProceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014

[9] [9]

A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

Khoshnoodi, M., Jain, V ., Gao, M., Srikanth, M., and Chadha, A. A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

work page arXiv

[10] [10]

Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a

Li, H., Qin, Y ., Ou, B., Xu, L., and Xu, R. Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a. Li, K., Patel, O., Vi ´egas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful an- swers from a language model. InAdvances in Neural Information Proce...

work page arXiv

[11] [11]

Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b

Li, S., Kallidromitis, K., Bansal, H., Gokul, A., Kato, Y ., Kozuka, K., Kuen, J., Lin, Z., Chang, K.-W., and Grover, A. Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b. Li, T., Chen, M., Guo, B., and Shen, Z. A survey on diffu- sion language models.arXiv preprint arXiv:2508.1087...

work page arXiv

[12] [12]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, X., Yan, H., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation. InInternational Conference on Learning Representations, volu...

work page 2024

[13] [13]

Large Language Diffusion Models

Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M., and Li, C. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025a. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language diffusion models.arXiv preprint arXiv:2502.09992...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Instruction Tuning with GPT-4

Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

A., Burns, K., Darrell, T., and Saenko, K

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),

work page 2018

[16] [16]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a

Wang, C., Guo, J., Li, H., Tian, Y ., Nie, Y ., Xu, C., and Han, K. Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a. Wang, J., Wang, Y ., Xu, G., Zhang, J., Gu, Y ., Jia, H., Yan, M., Zhang, J., and Sang, J. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallu...

work page internal anchor Pith review arXiv

[19] [19]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint arXiv:2508.09192,

Wang, W., Yang, J., and Peng, W. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025b. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint a...

work page arXiv

[21] [21]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

Xin, Y ., Qin, Q., Luo, S., Zhu, K., Yan, J., Tai, Y ., Lei, J., Cao, Y ., Wang, K., Wang, Y ., et al. Lumina- dimoo: An omni diffusion large language model for multi- modal generation and understanding.arXiv preprint arXiv:2510.06308,

work page arXiv

[22] [22]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

You, Z., Nie, S., Zhang, X., Hu, J., Zhou, J., Lu, Z., Wen, J.-R., and Li, C. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

URL https://arxiv.org/ abs/2407.12772. Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

AMBER is an LLM-free hallucination benchmark covering both generative (AMBER-G) 13 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Table 6.Evaluation Setup.Evaluation splits, inference steps, and generation lengthLfor each benchmark. Dataset Split StepsL Dataset Split StepsL Dataset Split StepsL MME test 2 2 Ferret test 48 96 Detai...

work page 2024

[26] [26]

To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable

For autoregressive baselines, including LLaV A-One-Vision-7B, Qwen2.5-VL-7B, InternVL3-8B, and LLaV A-1.6, we use the default evaluation setups provided by the same framework. To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable. Notably, for LaViDa, we conduct a re-evaluation ...

work page 2024

[27] [27]

Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024)

introduces a piecewise frequency rescaling scheme that preserves high-frequency components while smoothly extrapolating to longer sequences. Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024). While these approaches are effective for extending context length under causal decoding,...

work page 2024

[28] [28]

In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison

and an LLM backbone based on LLaDA-8B or Dream-7B (Ye et al., 2025). In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison. LaViDa introduces a complementary masking strategy during training. Instead of learning from a single masked version of a response, two complementary masked variants...

work page 2025

[29] [29]

For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6

under a consistent evaluation protocol. For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6. For Lumina-DiMOO (Xin et al., 2025), we set λ= 0.1 , β= 0.4 , k= 3 , η= 12.0 , and τ0 = 0.6. In both cases, our method consistently outperforms the corresponding baselines, as shown in Table

work page 2025