G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
Show-o: One single transformer to unify multimodal understanding and generation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 2years
2026 2roles
background 1polarities
background 1representative citing papers
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
citing papers explorer
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.