pith. sign in
Pith Number

pith:57CUSM3P

pith:2026:57CUSM3PQTKGHJWAOPVAEB5VQ2
not attested not anchored not stored refs resolved

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Chanyong Yoon, Seongjae Hwang, Sujung Hong

Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models.

arxiv:2605.14530 v1 · 2026-05-14 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{57CUSM3PQTKGHJWAOPVAEB5VQ2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior... Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process... propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding.

C2weakest assumption

That the mask token prior drift and positional attention misalignment are the primary root causes of repetitive generation and degraded grounding rather than downstream symptoms of other training or architectural issues, and that the proposed interventions address them without new side effects.

C3one line summary

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.

References

29 extracted · 29 resolved · 9 Pith anchors

[1] Adaptive retrieval without self-knowledge? bringing uncertainty back home.arXiv preprint arXiv:2501.12835
[2] Qwen2.5-VL Technical Report · arXiv:2502.13923
[3] D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al 1901
[4] Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model
[5] Dpad: Efficient diffusion language models with suffix dropout

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-17T23:39:05.952604Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

efc549336f84d463a6c073ea0207b58693bb0e88b10b8e8e0e599412ce8bb510

Aliases

arxiv: 2605.14530 · arxiv_version: 2605.14530v1 · doi: 10.48550/arxiv.2605.14530 · pith_short_12: 57CUSM3PQTKG · pith_short_16: 57CUSM3PQTKGHJWA · pith_short_8: 57CUSM3P
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/57CUSM3PQTKGHJWAOPVAEB5VQ2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: efc549336f84d463a6c073ea0207b58693bb0e88b10b8e8e0e599412ce8bb510
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "89fa0445013d0a9ba6bc5594daeaa83bdb309235f76a22a03af3012acafa5375",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T08:11:32Z",
    "title_canon_sha256": "1ca0338dfc2d2f1de00a20a2edefcce6b256168827068424f426204c072fb48d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14530",
    "kind": "arxiv",
    "version": 1
  }
}