MMAR: towards lossless multi-modal auto-regressive probabilistic modeling

· 2024 · arXiv 2410.10798

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

MARRS: Masked Autoregressive Unit-based Reaction Synthesis

cs.CV · 2025-05-16 · unverdicted · novelty 6.0

MARRS synthesizes fine-grained reaction motions via unit-distinguished VAE, masked action-conditioned fusion, mutual unit modulation, and compact MLP diffusion predictors.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

citing papers explorer

Showing 3 of 3 citing papers.

MARRS: Masked Autoregressive Unit-based Reaction Synthesis cs.CV · 2025-05-16 · unverdicted · none · ref 38
MARRS synthesizes fine-grained reaction motions via unit-distinguished VAE, masked action-conditioned fusion, mutual unit modulation, and compact MLP diffusion predictors.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 90
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 134
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

MMAR: towards lossless multi-modal auto-regressive probabilistic modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer