pith:QL3I3ZIG
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
Generation-guided selection from the VAE latent cuts visual tokens by 1.94x in separate-encoder unified multimodal models while preserving both reasoning accuracy and editing quality.
arxiv:2605.12309 v2 · 2026-05-12 · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QL3I3ZIGEJMOY3P5YGHF7IFB7E}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Experiments on image understanding and editing benchmarks show that G²TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
That token importance estimated from consistency with VAE latent provides a task-agnostic signal that preserves the model's editing and generation capabilities without degradation, even though the selection is performed only on the understanding-side tokens.
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
References
Receipt and verification
| First computed | 2026-05-20T00:00:43.036862Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
82f68de5062258ec6dfdc18e5fa0a1f922d2716418558126e73a636cd567f0e9
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QL3I3ZIGEJMOY3P5YGHF7IFB7E \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 82f68de5062258ec6dfdc18e5fa0a1f922d2716418558126e73a636cd567f0e9
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "7425937d2c1e0b971a6502239a26ff698ee0bcd215c195eb1e6a98d6679ddda8",
"cross_cats_sorted": [],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CV",
"submitted_at": "2026-05-12T15:56:22Z",
"title_canon_sha256": "39a769729639f87ff4c5387af0d41e3bc3768ae72fd97f8ee86b78d025b233e2"
},
"schema_version": "1.0",
"source": {
"id": "2605.12309",
"kind": "arxiv",
"version": 2
}
}