OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Pith reviewed 2026-05-14 01:47 UTC · model grok-4.3
pith:4SOQHN6W Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{4SOQHN6W}
Prints a linked pith:4SOQHN6W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
OpenFlamingo delivers open-source vision-language models that reach 80-89 percent of Flamingo performance across seven datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 and 89 percent of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite, and we share the models and code publicly.
What carries the argument
The OpenFlamingo model family, which uses autoregressive next-token prediction over interleaved image and text sequences to support few-shot multimodal reasoning.
If this is right
- Researchers can now train comparable multimodal models from public resources without starting from proprietary checkpoints.
- The 80-89 percent performance band shows that core few-shot vision-language capabilities transfer to open training pipelines.
- Scaling experiments across the 3B to 9B range supply baselines for studying how size affects multimodal task accuracy.
- The shared evaluation suite enables direct head-to-head comparisons of new open models against the Flamingo reference.
- Public code and weights allow the community to iterate on data mixtures, training schedules, and architectural tweaks.
Where Pith is reading between the lines
- If further open training closes the remaining gap, the results would indicate that Flamingo's gains come mainly from architecture and scale rather than from exclusive data or methods.
- The released framework could serve as a template for open replications of other large closed multimodal systems.
- Extending the same open training recipe to higher parameter counts or additional modalities such as video would test how far the replication approach generalizes.
Load-bearing premise
The reported performance numbers were measured under evaluation conditions comparable to the original Flamingo models and the released code and data suffice for independent reproduction.
What would settle it
Re-running the released code and data on the seven datasets under the same evaluation protocol and obtaining average scores below 70 percent of Flamingo's reported numbers would undermine the replication claim.
read the original abstract
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenFlamingo, an open-source family of autoregressive vision-language models (3B–9B parameters) intended as a replication of DeepMind’s Flamingo. It reports that these models achieve 80–89% of the corresponding Flamingo performance on average across seven vision-language datasets, provides details on architecture, training data, hyperparameters, and evaluation protocols, and releases models and code at the linked GitHub repository.
Significance. If the relative performance numbers hold under matched evaluation conditions, the work is significant as the first public, reproducible alternative to the closed-source Flamingo models. The explicit release of code, models, and training details directly addresses reproducibility concerns in large-scale vision-language research and lowers the barrier for follow-on work.
major comments (3)
- [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
- [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
- [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.
minor comments (2)
- [Abstract] The GitHub URL in the abstract should be accompanied by a permanent archive link (e.g., Zenodo DOI) to guard against repository changes.
- [Model Architecture] Notation for model sizes (3B, 9B) should be defined consistently with parameter counts reported in Table 1 or the model architecture section.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and clarify our evaluation approach, limitations, and the scope of this work as an open-source replication effort.
read point-by-point responses
-
Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
Authors: We agree that a side-by-side protocol comparison would improve transparency. While we cannot access the closed-source Flamingo code for exact verification, our evaluation suite was designed to follow the protocols described in the Flamingo paper as closely as possible, including shot counts, prompt templates, and metrics. The released code at https://github.com/mlfoundations/open_flamingo contains the precise evaluation scripts, data loaders, and preprocessing steps used. In the revision we will add an explicit comparison table in the Evaluation section detailing our choices against the Flamingo paper descriptions, along with any unavoidable differences. revision: yes
-
Referee: [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
Authors: We acknowledge that error bars and multi-seed statistics would strengthen the presentation. However, the computational cost of repeated full evaluations on 3B–9B models across seven datasets is prohibitive. We followed the single-run reporting convention common in large-scale vision-language papers and observed consistent relative performance across diverse tasks, which suggests the gap is not driven by outlier variance. In the revision we will add a limitations paragraph noting this constraint and the consistency evidence. revision: partial
-
Referee: [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.
Authors: We agree that systematic ablations would be informative. The primary objective of this technical report is to release reproducible models and code that close most of the gap to Flamingo, rather than to perform an exhaustive ablation study that would require orders-of-magnitude more compute. We do describe key hyperparameter choices and data mixtures in the Training section. In the revision we will expand the discussion to highlight the design decisions we found most impactful during development, while noting that a full ablation study remains future work. revision: partial
Circularity Check
No circularity; empirical replication claims rest on external baselines without internal self-definition or fitted predictions
full rationale
The paper is a technical report describing an open-source replication of Flamingo models, including architecture, training data, hyperparameters, and an evaluation suite. The headline result (80-89% relative performance on seven datasets) is an empirical measurement against the closed-source Flamingo baseline on public datasets. No equations, derivations, or first-principles results are presented that reduce to the paper's own inputs by construction. There are no self-definitional quantities, no parameters fitted to a subset and then relabeled as predictions, and no load-bearing self-citations. The comparison assumes protocol equivalence, but this is an external validity concern rather than circularity per the enumerated patterns. The derivation chain consists of standard model training and benchmarking steps that are independently verifiable from the released code and data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 48 Pith papers
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
Introduces VietPET-RoI dataset with fine-grained RoI annotations for Vietnamese 3D PET/CT and HiRRA graph framework that improves report generation by modeling region dependencies, claiming large gains over prior models.
-
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
Introduces the first large-scale 3D PET/CT dataset with fine-grained RoI annotations for Vietnamese and a graph-enhanced HiRRA framework that achieves SOTA report generation by modeling RoI dependencies.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Phantasia: Context-Adaptive Backdoors in Vision Language Models
Phantasia is a new backdoor attack on VLMs that dynamically aligns malicious outputs with input context to achieve higher stealth and state-of-the-art success rates compared to static-pattern attacks.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x spe...
-
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
-
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image mo...
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
-
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
Vision-Language Foundation Models as Effective Robot Imitators
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...
-
Online In-Context Distillation for Low-Resource Vision Language Models
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
-
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
com/blog/continuous-batching-llm-inference
Armen Aghajanyan, Po-Yao (Bernie) Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022
-
[2]
Lawrence Zitnick, Devi Parikh, and Dhruv Batra
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question 10 answering. International Journal of Computer Vision, 123:4–31, 2015
work page 2015
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35: 23716–23736, 2022
work page 2022
-
[4]
Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them
Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them. https://github.com/ rom1504/clip-retrieval, 2022
work page 2022
-
[5]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se- bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model. arXiv preprint arXiv:2209.06794, 2022
work page internal anchor Pith review arXiv 2022
-
[7]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Haus- man, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Koala: A dialogue model for academic research,
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108 , 2023
-
[10]
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023
-
[11]
Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: An- swering visual questions from blind people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018
work page 2018
-
[12]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
work page internal anchor Pith review arXiv 2023
-
[13]
Perceiver: General perception with iterative at- tention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative at- tention. In International conference on machine learning, pages 4651–4664. PMLR, 2021
work page 2021
-
[14]
Scaling up vi- sual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up vi- sual and vision-language representation learning with noisy text supervision. In International Con- ference on Machine Learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[15]
The hateful memes challenge: Detecting hate speech in multi- modal memes
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multi- modal memes. arXiv preprint arXiv:2005.04790 , 2020
-
[16]
Grounding language models to images for multimodal gen- eration
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023
-
[17]
Gokul Karthik Kumar and Karthik Nandakumar. Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features. arXiv preprint arXiv:2210.05916 , 2022
-
[18]
Obelisc: An open web-scale filtered dataset of interleaved image-text documents
Hugo Lauren¸ con, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, 11 Thomas Wang, Siddharth Karamcheti, Alexan- der M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelisc: An open web-scale fil- tered dataset of interleaved image-text docu- ments. arXiv preprint arXiv:2306.16527 , 2023
-
[19]
Mimic-it: Multi-modal in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruc- tion tuning. arXiv preprint arXiv:2306.05425 , 2023
-
[20]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023
work page internal anchor Pith review arXiv 2023
-
[21]
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng da Cao, Ji Zhang, Songfang Huang, Feiran Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022
-
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation. In International Con- ference on Machine Learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13 , pages 740–755. Springer, 2014
work page 2014
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3190–3199, 2019
work page 2019
-
[27]
Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
work page 2023
-
[28]
R OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Bryan A Plummer, Liwei Wang, Chris M Cer- vantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision, pages 2641–2649, 2015
work page 2015
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021
work page 2021
-
[31]
arXiv preprint arXiv:2207.07635 , year=
Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022
-
[32]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8309–8318, 2019
work page 2019
-
[34]
Thomas Breuel. WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch. Available at: https: //github.com/webdataset/webdataset, 2020. 12
work page 2020
-
[35]
Together.xyz. Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023
work page 2023
-
[36]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4566–4575, 2014
work page 2014
-
[37]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
An empirical study of gpt-3 for few-shot knowledge-based vqa
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi- aowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022
work page 2022
-
[39]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large lan- guage models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page Pith review arXiv 2023
-
[40]
Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for se- mantic inference over event descriptions. Trans- actions of the Association for Computational Lin- guistics, 2:67–78, 2014
work page 2014
-
[41]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xi- angfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init at- tention. arXiv preprint arXiv:2303.16199 , 2023
work page Pith review arXiv 2023
-
[42]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019
-
[45]
Multimodal c4: An open, billion-scale corpus of images interleaved with text
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939 , 2023. 13 Table 8: Fine-tuned state-of-the-art numbers used in this report. Method Dataset Score...
-
[46]
tell stories, reference real-world entities/events, etc
be creative. tell stories, reference real-world entities/events, etc. The images/sentence can play off each-other in fun ways
-
[47]
be interesting. generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences
-
[48]
(Image A, Image B, Sentence 1, Image C, Image D, Sentence 2, Image E, Image F, Sentence 3)
make sure the image descriptions are self-contained, and the output format follows the requested format. user(human authored) Generate a creative, interesting sequence of sentences/images with the following format: (image A, sentence 1, image B, sentence 2, image C, sentence 3) assistant(human authored) Sure! Sequence format: (image A, sentence 1, image B...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.