pith. sign in
Pith Number

pith:J2A5EGLF

pith:2026:J2A5EGLFRJRKYDAACPYE6ABCTM
not attested not anchored not stored refs resolved

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Dexin Wang, Guanjun Jiang, Hao Li, Lei Lv, Li Wang, Mengyu Zhou, Pascal Poupart, Qi Zhao, Xiaoxi Jiang, Yanting Miao, Yutao Sun

Granular alignment at three levels lets MLLMs generate stable visual latents by fixing decoder-to-input mismatch.

arxiv:2605.12374 v2 · 2026-05-12 · cs.CV · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{J2A5EGLFRJRKYDAACPYE6ABCTM}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

C2weakest assumption

The feature-space mismatch between decoder hidden states and input embeddings in pre-norm MLLMs is a primary contributor to instability in existing output-as-input visual-latent methods.

C3one line summary

GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.

References

30 extracted · 30 resolved · 16 Pith anchors

[1] Imagine while Reasoning in Space: Multimodal Visualization-of-Thought · arXiv:2501.07542
[2] arXiv preprint arXiv:2510.24514 , year=
[3] Gemma 3 Technical Report · arXiv:2503.19786
[4] OpenAI GPT-5 System Card · arXiv:2601.03267
[5] Qwen3-VL Technical Report , author=. 2025 , eprint= 2025

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:00:43.191487Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

4e81d219658a62ac0c0013f04f00229b27cb8e84219c055bab85f053938dad0f

Aliases

arxiv: 2605.12374 · arxiv_version: 2605.12374v2 · doi: 10.48550/arxiv.2605.12374 · pith_short_12: J2A5EGLFRJRK · pith_short_16: J2A5EGLFRJRKYDAA · pith_short_8: J2A5EGLF
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/J2A5EGLFRJRKYDAACPYE6ABCTM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4e81d219658a62ac0c0013f04f00229b27cb8e84219c055bab85f053938dad0f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "10d2f7dd43887f0a044320abd6aca13826cc64a402c84c55fb30a8ee5b52eaa4",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-12T16:41:09Z",
    "title_canon_sha256": "078b5893e734b337f8900d18e5057ee95382897a63ea6c62ef0eb98bba77b3e1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12374",
    "kind": "arxiv",
    "version": 2
  }
}