arxiv: 2412.14164 · v1 · submitted 2024-12-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong , David Fan , Jiachen Zhu , Yunyang Xiong , Xinlei Chen , Koustuv Sinha , Michael Rabbat , Yann LeCun

show 2 more authors

Saining Xie Zhuang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal instruction tuningvisual generationvisual understandingautoregressive modelsLLM adaptationunified multimodal modelsMetaMorphVPiT

0 comments

The pith

Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Visual-Predictive Instruction Tuning as a way to extend standard visual instruction tuning so that a pretrained LLM learns to output both text tokens and continuous visual tokens from mixed image-text sequences formatted as instructions. It reports that generation capability appears efficiently once understanding is strengthened, and that data aimed at understanding improves both skills more than data aimed at generation. The resulting MetaMorph model reaches competitive results on understanding benchmarks and on generation tasks by drawing on the LLM's existing world knowledge to avoid typical failure modes seen in other image generators. These outcomes suggest that the same tuning process can adapt LLMs to handle both directions without separate model branches.

Core claim

Training an LLM to predict discrete text tokens and continuous visual tokens from any instruction-formatted sequence of images and text causes visual generation to appear as a side effect of stronger visual understanding, with understanding data proving more useful for both abilities than generation data; the resulting unified autoregressive model performs competitively on both tasks and draws on pretraining knowledge to reduce common generation errors.

What carries the argument

Visual-Predictive Instruction Tuning (VPiT), which formats multimodal data as instructions and trains the model to autoregressively predict the next text or visual token.

If this is right

Understanding data improves both understanding and generation more effectively than generation data.
A relatively small amount of generation data is enough to unlock usable visual output once understanding has advanced.
The model can use world knowledge and reasoning from LLM pretraining to avoid common failure modes in image generation.
A single autoregressive architecture can handle both visual understanding and generation after this tuning process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method points toward simpler multimodal systems that avoid maintaining separate encoders or generators for each modality.
Similar emergence patterns might appear if the same tuning approach is applied to additional modalities such as audio or video.
The efficiency observed suggests that scaling the instruction data mixture could further reduce the amount of generation data required.

Load-bearing premise

The specific curated instruction-following multimodal datasets used are sufficient to reveal a general emergence of generation from understanding that transfers beyond the tested models and data mixtures.

What would settle it

Train the same base LLM with only understanding data and measure whether generation quality remains near baseline levels or improves substantially without any generation-specific examples.

read the original abstract

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPiT turns instruction tuning into a unified text-plus-continuous-visual autoregressive model, with the main empirical claim being that understanding data drives both capabilities more effectively than generation data.

read the letter

The paper's central move is to extend visual instruction tuning so the model predicts continuous visual tokens right alongside discrete text tokens in the same sequence. They report that generation ability shows up with very little dedicated generation data once understanding is in place, and that understanding examples improve both tasks more than generation examples do. The model also uses the LLM's existing reasoning to sidestep some typical generation failure modes, which is a practical plus if it scales.

Referee Report

2 major / 2 minor

Summary. The paper proposes Visual-Predictive Instruction Tuning (VPiT), a simple extension to visual instruction tuning that trains a pretrained LLM to autoregressively predict both discrete text tokens and continuous visual tokens from instruction-following image-text sequences. The central empirical claims are that visual generation emerges as a natural byproduct of improved visual understanding (and can be unlocked with only a small amount of generation data) and that understanding data contributes more effectively to both capabilities than generation data. The authors train the MetaMorph model on this basis and report competitive performance on visual understanding and generation benchmarks, attributing success to LLMs' strong prior vision capabilities.

Significance. If the empirical claims hold after proper controls, the work would demonstrate an efficient route to unified multimodal autoregressive models that leverage existing LLM pretraining for both understanding and generation, potentially reducing the data and compute needed for high-quality visual synthesis while improving robustness via world knowledge.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.
[§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.

minor comments (2)

[Abstract] Abstract and §1: the statement of 'competitive performance' should include the specific benchmarks, metrics, and baseline models for immediate context.
[§4 (Experiments)] Figure captions and tables: several result tables lack error bars or run-to-run variance, making it difficult to assess the reliability of the reported gains from adding small generation data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.

Authors: We agree that holding total training tokens fixed while varying the data mixture provides a stronger test of the relative value of understanding versus generation data. In the revised manuscript we have added a new set of controlled ablations in §4.3 that keep the total number of training tokens constant across different understanding-to-generation ratios. These experiments show that mixtures with a higher proportion of understanding data yield better results on both understanding and generation benchmarks than equivalent-token mixtures dominated by generation data, consistent with our original observations. We have updated the text and figures to present these controls explicitly. revision: yes
Referee: [§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.

Authors: We thank the referee for highlighting this omission. The visual tokens are produced by a frozen pretrained VQ-VAE encoder that maps each image patch to a continuous embedding; these embeddings serve as regression targets. The prediction head is a single linear layer on top of the LLM that outputs vectors of the same dimensionality, and training uses mean-squared-error loss between the predicted and target embeddings. We have expanded the method section (§3.2) with these details, a precise loss equation, and a short pseudocode snippet. The head remains lightweight, supporting the claim that generation capability emerges with minimal additional parameters once the LLM has been instruction-tuned on understanding data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from training runs, not tautological derivation

full rationale

The paper reports empirical results from applying Visual-Predictive Instruction Tuning to pretrained LLMs on curated instruction-following multimodal datasets. Claims that generation emerges as a byproduct of understanding rest on observed performance metrics after joint training, not on any closed mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs by construction. No equations or uniqueness theorems are invoked that collapse to prior outputs; the work is self-contained against external benchmarks via reported training runs and evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained LLMs already encode useful visual priors that instruction tuning can surface; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Pretrained LLMs possess strong prior vision capabilities that can be efficiently adapted to both understanding and generation via instruction tuning.
Explicitly stated in the final sentence of the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1203 out tokens · 59390 ms · 2026-05-17T07:46:11.714221+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
cs.CV 2025-03 unverdicted novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
PhotoFramer: Multi-modal Image Composition Instruction
cs.CV 2025-11 conditional novelty 6.0

PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
cs.CV 2025-05 unverdicted novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 17 Pith papers · 29 internal anchors

[3]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024
[4]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

work page 2022
[5]

ICML 2024 Tutorial: Physics of Language Models , 2024

Zeyuan Allen-Zhu . ICML 2024 Tutorial: Physics of Language Models , 2024. Project page: https://physics.allen-zhu.com/

work page 2024
[6]

Claude, 2024

Anthropic. Claude, 2024

work page 2024
[7]

Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016

work page 2016
[8]

Revisiting feature prediction for learning visual representations from video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. In TMLR, 2024

work page 2024
[9]

High fidelity visualization of what your self-supervised representation knows about

Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised representation knows about. In TMLR, 2022

work page 2022
[10]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023

work page 2023
[12]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS, 2024 a

work page 2024
[14]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024

work page 2024
[15]

Dreamllm: Synergistic multimodal comprehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In ICLR, 2024

work page 2024
[16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021
[17]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024

work page 2024
[20]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017 a

work page 2017
[21]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017 b

work page 2017
[23]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021

work page 2021
[24]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

work page 2017
[25]

The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024

work page 2024
[26]

Brave: Broadening the visual encoding of vision-language models

O g uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. In ECCV, 2025

work page 2025
[27]

Generating images with multimodal language models

Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. In NeurIPS, 2024

work page 2024
[28]

Learning action and reasoning-centric image editing from videos and simulations

Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulations. In NeurIPS, 2024

work page 2024
[29]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Lauren c on, Lucile Saulnier, L \'e o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024 a

work page 2024
[31]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62 0 (1): 0 1--62, 2022

work page 2022
[33]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024 b

work page 2024
[34]

Return of unconditional generation: A self-supervised representation generation method

Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. In NeurIPS, 2024 c

work page 2024
[35]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[37]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024 a

work page 2024
[38]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b

work page 2024
[39]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d

work page 2024
[41]

Decoupled weight decay regularization

I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[42]

Unified-io: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022 a

work page 2022
[43]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024

work page 2024
[44]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022 b

work page 2022
[45]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, 2022

work page 2022
[47]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019

work page 2019
[48]

Autonomous evaluation and refinement of digital agents

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. In COLM, 2024 a

work page 2024
[49]

Kosmos-g: Generating images in context with multimodal large language models

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. In ICLR, 2024 b

work page 2024
[50]

Diffusion autoencoders: Toward a meaningful and decodable representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022

work page 2022
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[52]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020
[53]

Exploring the limits of transfer learning with a unified text-to-text transformer

Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Peter J Liu, Sharan Narang, Wei Li, and Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2019

work page 2019
[54]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\"orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

work page 2022
[55]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022

work page 2022
[56]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In NeurIPS, 2024

work page 2024
[57]

Textcaps: a dataset for image captioning with reading comprehension, 2020

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020

work page 2020
[58]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In CVPR, 2024 a

work page 2024
[59]

Generative pretraining in multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. In ICLR, 2024 b

work page 2024
[60]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpaca: A strong, replicable instruction-following model, 2023

work page 2023
[62]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024 a

work page 2024
[63]

Mass-producing failures of multimodal systems with language models

Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2024 b

work page 2024
[64]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024 c

work page 2024
[65]

LLaMA 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. 2023

work page 2023
[68]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022 a

work page 2022
[69]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022 b

work page 2022
[71]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In CVPR, 2024

work page 2024
[73]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024

work page 2024
[75]

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu . Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process . ArXiv e-prints, abs/2407.20311, 2024. Full version available at http://arxiv.org/abs/2407.20311

work page arXiv 2024
[76]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024 a

work page 2024
[78]

When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022

work page 2022
[79]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

work page 2023
[80]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024

work page 2024
[82]

Pre-trained language models do not help auto-regressive text-to-image generation

Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. Pre-trained language models do not help auto-regressive text-to-image generation. In EMNLP, 2023

work page 2023
[83]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In NeurIPS, 2024 a

work page 2024
[85]

Video-star: Self-training enables video instruction tuning with any supervision

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-levy. Video-star: Self-training enables video instruction tuning with any supervision. In arXiv preprint arXiv:2407.06189, 2024

work page arXiv 2024
[86]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

To see is to believe: Prompting gpt-4v for better visual instruction tuning

To see is to believe: Prompting gpt-4v for better visual instruction tuning , author=. arXiv preprint arXiv:2311.07574 , year=

work page arXiv
[88]

arXiv preprint arXiv:2306.17107 , year=

Llavar: Enhanced visual instruction tuning for text-rich image understanding , author=. arXiv preprint arXiv:2306.17107 , year=

work page arXiv
[89]

CVPR , year=

A convnet for the 2020s , author=. CVPR , year=

work page
[90]

ACL , year=

Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. ACL , year=

work page
[91]

WACV , year=

Docvqa: A dataset for vqa on document images , author=. WACV , year=

work page
[92]

CVPR , year=

Dvqa: Understanding data visualizations via question answering , author=. CVPR , year=

work page
[93]

AAAI , year=

TallyQA: Answering complex counting questions , author=. AAAI , year=

work page
[94]

CVPR , year=

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. CVPR , year=

work page
[95]

How many unicorns are in this image? a safety evaluation benchmark for vision llms

How many unicorns are in this image? a safety evaluation benchmark for vision llms , author=. arXiv preprint arXiv:2311.16101 , year=

work page arXiv
[96]

CVPR , year=

Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=

work page
[97]

EMNLP , year=

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation , author=. EMNLP , year=

work page
[98]

ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model

ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model , author=. arXiv preprint arXiv:2402.11684 , year=

work page arXiv

Showing first 80 references.