Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Improved baselines with visual instruction tuning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
baseline 1
citation-polarity summary
fields
cs.CV 2representative citing papers
citing papers explorer
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
- EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs