MARRS synthesizes fine-grained reaction motions via unit-distinguished VAE, masked action-conditioned fusion, mutual unit modulation, and compact MLP diffusion predictors.
MMAR: towards lossless multi-modal auto-regressive probabilistic modeling
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2025 3verdicts
UNVERDICTED 3roles
baseline 1polarities
baseline 1representative citing papers
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
citing papers explorer
-
MARRS: Masked Autoregressive Unit-based Reaction Synthesis
MARRS synthesizes fine-grained reaction motions via unit-distinguished VAE, masked action-conditioned fusion, mutual unit modulation, and compact MLP diffusion predictors.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.