pith. machine review for the scientific record. sign in

arxiv: 2412.14164 · v1 · submitted 2024-12-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal instruction tuningvisual generationvisual understandingautoregressive modelsLLM adaptationunified multimodal modelsMetaMorphVPiT
0
0 comments X

The pith

Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Visual-Predictive Instruction Tuning as a way to extend standard visual instruction tuning so that a pretrained LLM learns to output both text tokens and continuous visual tokens from mixed image-text sequences formatted as instructions. It reports that generation capability appears efficiently once understanding is strengthened, and that data aimed at understanding improves both skills more than data aimed at generation. The resulting MetaMorph model reaches competitive results on understanding benchmarks and on generation tasks by drawing on the LLM's existing world knowledge to avoid typical failure modes seen in other image generators. These outcomes suggest that the same tuning process can adapt LLMs to handle both directions without separate model branches.

Core claim

Training an LLM to predict discrete text tokens and continuous visual tokens from any instruction-formatted sequence of images and text causes visual generation to appear as a side effect of stronger visual understanding, with understanding data proving more useful for both abilities than generation data; the resulting unified autoregressive model performs competitively on both tasks and draws on pretraining knowledge to reduce common generation errors.

What carries the argument

Visual-Predictive Instruction Tuning (VPiT), which formats multimodal data as instructions and trains the model to autoregressively predict the next text or visual token.

If this is right

  • Understanding data improves both understanding and generation more effectively than generation data.
  • A relatively small amount of generation data is enough to unlock usable visual output once understanding has advanced.
  • The model can use world knowledge and reasoning from LLM pretraining to avoid common failure modes in image generation.
  • A single autoregressive architecture can handle both visual understanding and generation after this tuning process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method points toward simpler multimodal systems that avoid maintaining separate encoders or generators for each modality.
  • Similar emergence patterns might appear if the same tuning approach is applied to additional modalities such as audio or video.
  • The efficiency observed suggests that scaling the instruction data mixture could further reduce the amount of generation data required.

Load-bearing premise

The specific curated instruction-following multimodal datasets used are sufficient to reveal a general emergence of generation from understanding that transfers beyond the tested models and data mixtures.

What would settle it

Train the same base LLM with only understanding data and measure whether generation quality remains near baseline levels or improves substantially without any generation-specific examples.

read the original abstract

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Visual-Predictive Instruction Tuning (VPiT), a simple extension to visual instruction tuning that trains a pretrained LLM to autoregressively predict both discrete text tokens and continuous visual tokens from instruction-following image-text sequences. The central empirical claims are that visual generation emerges as a natural byproduct of improved visual understanding (and can be unlocked with only a small amount of generation data) and that understanding data contributes more effectively to both capabilities than generation data. The authors train the MetaMorph model on this basis and report competitive performance on visual understanding and generation benchmarks, attributing success to LLMs' strong prior vision capabilities.

Significance. If the empirical claims hold after proper controls, the work would demonstrate an efficient route to unified multimodal autoregressive models that leverage existing LLM pretraining for both understanding and generation, potentially reducing the data and compute needed for high-quality visual synthesis while improving robustness via world knowledge.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.
  2. [§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.
minor comments (2)
  1. [Abstract] Abstract and §1: the statement of 'competitive performance' should include the specific benchmarks, metrics, and baseline models for immediate context.
  2. [§4 (Experiments)] Figure captions and tables: several result tables lack error bars or run-to-run variance, making it difficult to assess the reliability of the reported gains from adding small generation data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.

    Authors: We agree that holding total training tokens fixed while varying the data mixture provides a stronger test of the relative value of understanding versus generation data. In the revised manuscript we have added a new set of controlled ablations in §4.3 that keep the total number of training tokens constant across different understanding-to-generation ratios. These experiments show that mixtures with a higher proportion of understanding data yield better results on both understanding and generation benchmarks than equivalent-token mixtures dominated by generation data, consistent with our original observations. We have updated the text and figures to present these controls explicitly. revision: yes

  2. Referee: [§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.

    Authors: We thank the referee for highlighting this omission. The visual tokens are produced by a frozen pretrained VQ-VAE encoder that maps each image patch to a continuous embedding; these embeddings serve as regression targets. The prediction head is a single linear layer on top of the LLM that outputs vectors of the same dimensionality, and training uses mean-squared-error loss between the predicted and target embeddings. We have expanded the method section (§3.2) with these details, a precise loss equation, and a short pseudocode snippet. The head remains lightweight, supporting the claim that generation capability emerges with minimal additional parameters once the LLM has been instruction-tuned on understanding data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from training runs, not tautological derivation

full rationale

The paper reports empirical results from applying Visual-Predictive Instruction Tuning to pretrained LLMs on curated instruction-following multimodal datasets. Claims that generation emerges as a byproduct of understanding rest on observed performance metrics after joint training, not on any closed mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs by construction. No equations or uniqueness theorems are invoked that collapse to prior outputs; the work is self-contained against external benchmarks via reported training runs and evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained LLMs already encode useful visual priors that instruction tuning can surface; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Pretrained LLMs possess strong prior vision capabilities that can be efficiently adapted to both understanding and generation via instruction tuning.
    Explicitly stated in the final sentence of the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1203 out tokens · 59390 ms · 2026-05-17T07:46:11.714221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 8.0

    VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...

  2. Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.

  3. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  4. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  5. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  6. PhotoFramer: Multi-modal Image Composition Instruction

    cs.CV 2025-11 conditional novelty 6.0

    PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.

  7. LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    cs.LG 2025-05 conditional novelty 6.0

    LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

  8. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  9. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  10. WorldVLA: Towards Autoregressive Action World Model

    cs.RO 2025-06 unverdicted novelty 5.0

    WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

  11. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  12. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  13. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    cs.CV 2025-05 conditional novelty 5.0

    BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

  14. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  15. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  16. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  17. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 17 Pith papers · 29 internal anchors

  1. [3]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  2. [4]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

  3. [5]

    ICML 2024 Tutorial: Physics of Language Models , 2024

    Zeyuan Allen-Zhu . ICML 2024 Tutorial: Physics of Language Models , 2024. Project page: https://physics.allen-zhu.com/

  4. [6]

    Claude, 2024

    Anthropic. Claude, 2024

  5. [7]

    Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016

  6. [8]

    Revisiting feature prediction for learning visual representations from video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. In TMLR, 2024

  7. [9]

    High fidelity visualization of what your self-supervised representation knows about

    Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised representation knows about. In TMLR, 2022

  8. [10]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023

  9. [12]

    Sharegpt4video: Improving video understanding and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS, 2024 a

  10. [14]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024

  11. [15]

    Dreamllm: Synergistic multimodal comprehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In ICLR, 2024

  12. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  13. [17]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024

  14. [20]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017 a

  15. [21]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017 b

  16. [23]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021

  17. [24]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

  18. [25]

    The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024

  19. [26]

    Brave: Broadening the visual encoding of vision-language models

    O g uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. In ECCV, 2025

  20. [27]

    Generating images with multimodal language models

    Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. In NeurIPS, 2024

  21. [28]

    Learning action and reasoning-centric image editing from videos and simulations

    Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulations. In NeurIPS, 2024

  22. [29]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Lauren c on, Lucile Saulnier, L \'e o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024 a

  23. [31]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62 0 (1): 0 1--62, 2022

  24. [33]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024 b

  25. [34]

    Return of unconditional generation: A self-supervised representation generation method

    Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. In NeurIPS, 2024 c

  26. [35]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

  27. [36]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  28. [37]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024 a

  29. [38]

    Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b

  30. [39]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024 c

  31. [40]

    Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d

  32. [41]

    Decoupled weight decay regularization

    I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019

  33. [42]

    Unified-io: A unified model for vision, language, and multi-modal tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022 a

  34. [43]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024

  35. [44]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022 b

  36. [45]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, 2022

  37. [47]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019

  38. [48]

    Autonomous evaluation and refinement of digital agents

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. In COLM, 2024 a

  39. [49]

    Kosmos-g: Generating images in context with multimodal large language models

    Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. In ICLR, 2024 b

  40. [50]

    Diffusion autoencoders: Toward a meaningful and decodable representation

    Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022

  41. [51]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

  42. [52]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

  43. [53]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Peter J Liu, Sharan Narang, Wei Li, and Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2019

  44. [54]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\"orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

  45. [55]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022

  46. [56]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In NeurIPS, 2024

  47. [57]

    Textcaps: a dataset for image captioning with reading comprehension, 2020

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020

  48. [58]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In CVPR, 2024 a

  49. [59]

    Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. In ICLR, 2024 b

  50. [60]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpaca: A strong, replicable instruction-following model, 2023

  51. [62]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024 a

  52. [63]

    Mass-producing failures of multimodal systems with language models

    Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2024 b

  53. [64]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024 c

  54. [65]

    LLaMA 2: Open foundation and fine-tuned chat models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. 2023

  55. [68]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022 a

  56. [69]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022 b

  57. [71]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In CVPR, 2024

  58. [73]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

  59. [74]

    Demystifying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024

  60. [75]

    Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

    Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu . Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process . ArXiv e-prints, abs/2407.20311, 2024. Full version available at http://arxiv.org/abs/2407.20311

  61. [76]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024 a

  62. [78]

    When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022

  63. [79]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

  64. [80]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024

  65. [82]

    Pre-trained language models do not help auto-regressive text-to-image generation

    Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. Pre-trained language models do not help auto-regressive text-to-image generation. In EMNLP, 2023

  66. [83]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In NeurIPS, 2024 a

  67. [85]

    Video-star: Self-training enables video instruction tuning with any supervision

    Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-levy. Video-star: Self-training enables video instruction tuning with any supervision. In arXiv preprint arXiv:2407.06189, 2024

  68. [86]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

  69. [87]

    To see is to believe: Prompting gpt-4v for better visual instruction tuning

    To see is to believe: Prompting gpt-4v for better visual instruction tuning , author=. arXiv preprint arXiv:2311.07574 , year=

  70. [88]

    arXiv preprint arXiv:2306.17107 , year=

    Llavar: Enhanced visual instruction tuning for text-rich image understanding , author=. arXiv preprint arXiv:2306.17107 , year=

  71. [89]

    CVPR , year=

    A convnet for the 2020s , author=. CVPR , year=

  72. [90]

    ACL , year=

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. ACL , year=

  73. [91]

    WACV , year=

    Docvqa: A dataset for vqa on document images , author=. WACV , year=

  74. [92]

    CVPR , year=

    Dvqa: Understanding data visualizations via question answering , author=. CVPR , year=

  75. [93]

    AAAI , year=

    TallyQA: Answering complex counting questions , author=. AAAI , year=

  76. [94]

    CVPR , year=

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. CVPR , year=

  77. [95]

    How many unicorns are in this image? a safety evaluation benchmark for vision llms

    How many unicorns are in this image? a safety evaluation benchmark for vision llms , author=. arXiv preprint arXiv:2311.16101 , year=

  78. [96]

    CVPR , year=

    Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=

  79. [97]

    EMNLP , year=

    Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation , author=. EMNLP , year=

  80. [98]

    ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model

    ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model , author=. arXiv preprint arXiv:2402.11684 , year=

Showing first 80 references.