pith:57CUSM3P
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models.
arxiv:2605.14530 v1 · 2026-05-14 · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{57CUSM3PQTKGHJWAOPVAEB5VQ2}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior... Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process... propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding.
That the mask token prior drift and positional attention misalignment are the primary root causes of repetitive generation and degraded grounding rather than downstream symptoms of other training or architectural issues, and that the proposed interventions address them without new side effects.
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
References
Formal links
Receipt and verification
| First computed | 2026-05-17T23:39:05.952604Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
efc549336f84d463a6c073ea0207b58693bb0e88b10b8e8e0e599412ce8bb510
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/57CUSM3PQTKGHJWAOPVAEB5VQ2 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: efc549336f84d463a6c073ea0207b58693bb0e88b10b8e8e0e599412ce8bb510
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "89fa0445013d0a9ba6bc5594daeaa83bdb309235f76a22a03af3012acafa5375",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CV",
"submitted_at": "2026-05-14T08:11:32Z",
"title_canon_sha256": "1ca0338dfc2d2f1de00a20a2edefcce6b256168827068424f426204c072fb48d"
},
"schema_version": "1.0",
"source": {
"id": "2605.14530",
"kind": "arxiv",
"version": 1
}
}