{"paper":{"title":"Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chanyong Yoon, Seongjae Hwang, Sujung Hong","submitted_at":"2026-05-14T08:11:32Z","abstract_excerpt":"Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior... Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process... propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the mask token prior drift and positional attention misalignment are the primary root causes of repetitive generation and degraded grounding rather than downstream symptoms of other training or architectural issues, and that the proposed interventions address them without new side effects.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8436460820c60f209cf615fceae97903cc582e5ccd4e7a7069dec76f1b91d97e"},"source":{"id":"2605.14530","kind":"arxiv","version":1},"verdict":{"id":"e73b8109-fee3-41ba-9921-4eb85e20281d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:12:13.889671Z","strongest_claim":"existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior... Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process... propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding.","one_line_summary":"Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the mask token prior drift and positional attention misalignment are the primary root causes of repetitive generation and degraded grounding rather than downstream symptoms of other training or architectural issues, and that the proposed interventions address them without new side effects.","pith_extraction_headline":"Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models."},"references":{"count":29,"sample":[{"doi":"","year":null,"title":"Adaptive retrieval without self-knowledge? bringing uncertainty back home.arXiv preprint arXiv:2501.12835","work_id":"2e6e3a4a-1c76-4443-9511-9735ea369815","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":2,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":1901,"title":"D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al","work_id":"9806adeb-7378-4bee-a184-3e98c89988dd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model","work_id":"4cf5f4e3-c59a-4ccb-a655-563157a9ce74","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Dpad: Efficient diffusion language models with suffix dropout","work_id":"3f0e3292-b812-4d35-9383-8e3959725c6b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":29,"snapshot_sha256":"560f17501a214f9ea47e32a80b7f1e8cf3acf6054c4fc99d0602379a5c91a36c","internal_anchors":9},"formal_canon":{"evidence_count":2,"snapshot_sha256":"11d8405077e15870fe67976d7fde351b9542b6b62d645686512c15ed19343ddc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}