Recognition: 2 theorem links
· Lean TheoremMetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Pith reviewed 2026-05-17 07:46 UTC · model grok-4.3
The pith
Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training an LLM to predict discrete text tokens and continuous visual tokens from any instruction-formatted sequence of images and text causes visual generation to appear as a side effect of stronger visual understanding, with understanding data proving more useful for both abilities than generation data; the resulting unified autoregressive model performs competitively on both tasks and draws on pretraining knowledge to reduce common generation errors.
What carries the argument
Visual-Predictive Instruction Tuning (VPiT), which formats multimodal data as instructions and trains the model to autoregressively predict the next text or visual token.
If this is right
- Understanding data improves both understanding and generation more effectively than generation data.
- A relatively small amount of generation data is enough to unlock usable visual output once understanding has advanced.
- The model can use world knowledge and reasoning from LLM pretraining to avoid common failure modes in image generation.
- A single autoregressive architecture can handle both visual understanding and generation after this tuning process.
Where Pith is reading between the lines
- The method points toward simpler multimodal systems that avoid maintaining separate encoders or generators for each modality.
- Similar emergence patterns might appear if the same tuning approach is applied to additional modalities such as audio or video.
- The efficiency observed suggests that scaling the instruction data mixture could further reduce the amount of generation data required.
Load-bearing premise
The specific curated instruction-following multimodal datasets used are sufficient to reveal a general emergence of generation from understanding that transfers beyond the tested models and data mixtures.
What would settle it
Train the same base LLM with only understanding data and measure whether generation quality remains near baseline levels or improves substantially without any generation-specific examples.
read the original abstract
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Visual-Predictive Instruction Tuning (VPiT), a simple extension to visual instruction tuning that trains a pretrained LLM to autoregressively predict both discrete text tokens and continuous visual tokens from instruction-following image-text sequences. The central empirical claims are that visual generation emerges as a natural byproduct of improved visual understanding (and can be unlocked with only a small amount of generation data) and that understanding data contributes more effectively to both capabilities than generation data. The authors train the MetaMorph model on this basis and report competitive performance on visual understanding and generation benchmarks, attributing success to LLMs' strong prior vision capabilities.
Significance. If the empirical claims hold after proper controls, the work would demonstrate an efficient route to unified multimodal autoregressive models that leverage existing LLM pretraining for both understanding and generation, potentially reducing the data and compute needed for high-quality visual synthesis while improving robustness via world knowledge.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.
- [§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.
minor comments (2)
- [Abstract] Abstract and §1: the statement of 'competitive performance' should include the specific benchmarks, metrics, and baseline models for immediate context.
- [§4 (Experiments)] Figure captions and tables: several result tables lack error bars or run-to-run variance, making it difficult to assess the reliability of the reported gains from adding small generation data.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated ablations: the claim that 'visual generation ability emerges as a natural byproduct of improved visual understanding' and that 'understanding data contributes to both capabilities more effectively than generation data' requires controls that hold total training tokens or steps fixed while varying the proportion of understanding versus generation examples. Without these, the observed generation performance cannot be distinguished from generic benefits of joint autoregressive modeling on mixed sequences or the specific continuous visual token prediction setup.
Authors: We agree that holding total training tokens fixed while varying the data mixture provides a stronger test of the relative value of understanding versus generation data. In the revised manuscript we have added a new set of controlled ablations in §4.3 that keep the total number of training tokens constant across different understanding-to-generation ratios. These experiments show that mixtures with a higher proportion of understanding data yield better results on both understanding and generation benchmarks than equivalent-token mixtures dominated by generation data, consistent with our original observations. We have updated the text and figures to present these controls explicitly. revision: yes
-
Referee: [§3 (VPiT Method)] §3 (VPiT Method): the description of the continuous visual token prediction head and loss does not specify how the visual tokens are obtained (e.g., from a VQ-VAE or other tokenizer) or the exact regression loss used, which is load-bearing for reproducing the 'parameter-free' emergence claim and for understanding why understanding data transfers so effectively.
Authors: We thank the referee for highlighting this omission. The visual tokens are produced by a frozen pretrained VQ-VAE encoder that maps each image patch to a continuous embedding; these embeddings serve as regression targets. The prediction head is a single linear layer on top of the LLM that outputs vectors of the same dimensionality, and training uses mean-squared-error loss between the predicted and target embeddings. We have expanded the method section (§3.2) with these details, a precise loss equation, and a short pseudocode snippet. The head remains lightweight, supporting the claim that generation capability emerges with minimal additional parameters once the LLM has been instruction-tuned on understanding data. revision: yes
Circularity Check
No circularity: empirical observations from training runs, not tautological derivation
full rationale
The paper reports empirical results from applying Visual-Predictive Instruction Tuning to pretrained LLMs on curated instruction-following multimodal datasets. Claims that generation emerges as a byproduct of understanding rest on observed performance metrics after joint training, not on any closed mathematical derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs by construction. No equations or uniqueness theorems are invoked that collapse to prior outputs; the work is self-contained against external benchmarks via reported training runs and evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained LLMs possess strong prior vision capabilities that can be efficiently adapted to both understanding and generation via instruction tuning.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
PhotoFramer: Multi-modal Image Composition Instruction
PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
- [3]
-
[4]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022
work page 2022
-
[5]
ICML 2024 Tutorial: Physics of Language Models , 2024
Zeyuan Allen-Zhu . ICML 2024 Tutorial: Physics of Language Models , 2024. Project page: https://physics.allen-zhu.com/
work page 2024
- [6]
-
[7]
Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016
work page 2016
-
[8]
Revisiting feature prediction for learning visual representations from video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. In TMLR, 2024
work page 2024
-
[9]
High fidelity visualization of what your self-supervised representation knows about
Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised representation knows about. In TMLR, 2022
work page 2022
-
[10]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023
work page 2023
-
[12]
Sharegpt4video: Improving video understanding and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS, 2024 a
work page 2024
-
[14]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024
work page 2024
-
[15]
Dreamllm: Synergistic multimodal comprehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In ICLR, 2024
work page 2024
-
[16]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021
work page 2021
-
[17]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS, 2024
work page 2024
-
[20]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017 a
work page 2017
-
[21]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017 b
work page 2017
-
[23]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021
work page 2021
-
[24]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017
work page 2017
-
[25]
The platonic representation hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML, 2024
work page 2024
-
[26]
Brave: Broadening the visual encoding of vision-language models
O g uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. In ECCV, 2025
work page 2025
-
[27]
Generating images with multimodal language models
Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. In NeurIPS, 2024
work page 2024
-
[28]
Learning action and reasoning-centric image editing from videos and simulations
Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulations. In NeurIPS, 2024
work page 2024
-
[29]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Lauren c on, Lucile Saulnier, L \'e o Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024 a
work page 2024
-
[31]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62 0 (1): 0 1--62, 2022
work page 2022
-
[33]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024 b
work page 2024
-
[34]
Return of unconditional generation: A self-supervised representation generation method
Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method. In NeurIPS, 2024 c
work page 2024
-
[35]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
work page 2014
-
[36]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[37]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024 a
work page 2024
-
[38]
Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b
work page 2024
-
[39]
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024 c
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024 d
work page 2024
-
[41]
Decoupled weight decay regularization
I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019
work page 2019
-
[42]
Unified-io: A unified model for vision, language, and multi-modal tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022 a
work page 2022
-
[43]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024
work page 2024
-
[44]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022 b
work page 2022
-
[45]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, 2022
work page 2022
-
[47]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019
work page 2019
-
[48]
Autonomous evaluation and refinement of digital agents
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. In COLM, 2024 a
work page 2024
-
[49]
Kosmos-g: Generating images in context with multimodal large language models
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. In ICLR, 2024 b
work page 2024
-
[50]
Diffusion autoencoders: Toward a meaningful and decodable representation
Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022
work page 2022
-
[51]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
-
[52]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020
work page 2020
-
[53]
Exploring the limits of transfer learning with a unified text-to-text transformer
Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Peter J Liu, Sharan Narang, Wei Li, and Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2019
work page 2019
-
[54]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\"orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022
work page 2022
-
[55]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022
work page 2022
-
[56]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In NeurIPS, 2024
work page 2024
-
[57]
Textcaps: a dataset for image captioning with reading comprehension, 2020
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020
work page 2020
-
[58]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In CVPR, 2024 a
work page 2024
-
[59]
Generative pretraining in multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. In ICLR, 2024 b
work page 2024
- [60]
-
[62]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024 a
work page 2024
-
[63]
Mass-producing failures of multimodal systems with language models
Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2024 b
work page 2024
-
[64]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024 c
work page 2024
-
[65]
LLaMA 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. 2023
work page 2023
-
[68]
Finetuned language models are zero-shot learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022 a
work page 2022
-
[69]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022 b
work page 2022
-
[71]
V*: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In CVPR, 2024
work page 2024
-
[73]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024
work page 2024
-
[75]
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu . Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process . ArXiv e-prints, abs/2407.20311, 2024. Full version available at http://arxiv.org/abs/2407.20311
-
[76]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024 a
work page 2024
-
[78]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022
work page 2022
-
[79]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023
work page 2023
-
[80]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024
work page 2024
-
[82]
Pre-trained language models do not help auto-regressive text-to-image generation
Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. Pre-trained language models do not help auto-regressive text-to-image generation. In EMNLP, 2023
work page 2023
-
[83]
Lima: Less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In NeurIPS, 2024 a
work page 2024
-
[85]
Video-star: Self-training enables video instruction tuning with any supervision
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-levy. Video-star: Self-training enables video instruction tuning with any supervision. In arXiv preprint arXiv:2407.06189, 2024
-
[86]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[87]
To see is to believe: Prompting gpt-4v for better visual instruction tuning
To see is to believe: Prompting gpt-4v for better visual instruction tuning , author=. arXiv preprint arXiv:2311.07574 , year=
-
[88]
arXiv preprint arXiv:2306.17107 , year=
Llavar: Enhanced visual instruction tuning for text-rich image understanding , author=. arXiv preprint arXiv:2306.17107 , year=
- [89]
-
[90]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. ACL , year=
- [91]
-
[92]
Dvqa: Understanding data visualizations via question answering , author=. CVPR , year=
- [93]
-
[94]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. CVPR , year=
-
[95]
How many unicorns are in this image? a safety evaluation benchmark for vision llms
How many unicorns are in this image? a safety evaluation benchmark for vision llms , author=. arXiv preprint arXiv:2311.16101 , year=
-
[96]
Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=
-
[97]
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation , author=. EMNLP , year=
-
[98]
ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model , author=. arXiv preprint arXiv:2402.11684 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.