Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Aniruddha Kembhavi; Christopher Clark; Derek Hoiem; Jiasen Lu; Ryan Marten; Sangho Lee; Savya Khosla; Zichen Zhang

arxiv: 2312.17172 · v1 · pith:R3SLMFZPnew · submitted 2023-12-28 · 💻 cs.CV · cs.AI· cs.CL

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Jiasen Lu , Christopher Clark , Sangho Lee , Zichen Zhang , Savya Khosla , Ryan Marten , Derek Hoiem , Aniruddha Kembhavi This is my paper

classification 💻 cs.CV cs.AIcs.CL

keywords modelmultimodalaudiounderstandingactionunified-ioautoregressivediverse

0 comments

read the original abstract

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
cs.CV 2024-06 conditional novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Semantic Generative Tuning for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Learning Spatial-Preserving Hierarchical Representations for Digital Pathology
cs.CV 2024-06 unverdicted novelty 6.0

SPAN is a hierarchical attention framework that constructs multi-scale pyramid representations from single-scale patch inputs for WSI classification and segmentation while preserving spatial relationships.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
Semantic Generative Tuning for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 5.0

Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.