pith. sign in
Pith Number

pith:QL3I3ZIG

pith:2026:QL3I3ZIGEJMOY3P5YGHF7IFB7E
not attested not anchored not stored refs resolved

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

Junxian Li, Kai Liu, Renjing Pei, Yulun Zhang, Zhikai Chen, Zhixin Wang, Zizhong Ding

Generation-guided selection from the VAE latent cuts visual tokens by 1.94x in separate-encoder unified multimodal models while preserving both reasoning accuracy and editing quality.

arxiv:2605.12309 v2 · 2026-05-12 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QL3I3ZIGEJMOY3P5YGHF7IFB7E}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments on image understanding and editing benchmarks show that G²TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.

C2weakest assumption

That token importance estimated from consistency with VAE latent provides a task-agnostic signal that preserves the model's editing and generation capabilities without degradation, even though the selection is performed only on the understanding-side tokens.

C3one line summary

G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.

References

44 extracted · 44 resolved · 11 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774
[2] Improving image generation with better captions 2023
[3] An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models 2024
[4] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling 2025 · arXiv:2501.17811
[5] Diffusion models in vision: A survey.TPAMI 2023
Receipt and verification
First computed 2026-05-20T00:00:43.036862Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

82f68de5062258ec6dfdc18e5fa0a1f922d2716418558126e73a636cd567f0e9

Aliases

arxiv: 2605.12309 · arxiv_version: 2605.12309v2 · doi: 10.48550/arxiv.2605.12309 · pith_short_12: QL3I3ZIGEJMO · pith_short_16: QL3I3ZIGEJMOY3P5 · pith_short_8: QL3I3ZIG
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QL3I3ZIGEJMOY3P5YGHF7IFB7E \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 82f68de5062258ec6dfdc18e5fa0a1f922d2716418558126e73a636cd567f0e9
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7425937d2c1e0b971a6502239a26ff698ee0bcd215c195eb1e6a98d6679ddda8",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-12T15:56:22Z",
    "title_canon_sha256": "39a769729639f87ff4c5387af0d41e3bc3768ae72fd97f8ee86b78d025b233e2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12309",
    "kind": "arxiv",
    "version": 2
  }
}