G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
Announcing grok-1.5.https://x.ai/news/grok-1.5
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
dataset 1
citation-polarity summary
fields
cs.CV 1years
2026 1verdicts
CONDITIONAL 1roles
dataset 1polarities
use dataset 1representative citing papers
citing papers explorer
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.