OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

arxiv: 2308.01390 · v2 · pith:4SOQHN6Wnew · submitted 2023-08-02 · 💻 cs.CV · cs.AI· cs.LG

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla , Irena Gao , Josh Gardner , Jack Hessel , Yusuf Hanafy , Wanrong Zhu , Kalyani Marathe , Yonatan Bitton

show 8 more authors

Samir Gadre Shiori Sagawa Jenia Jitsev Simon Kornblith Pang Wei Koh Gabriel Ilharco Mitchell Wortsman Ludwig Schmidt

This is my paper

Pith reviewed 2026-05-14 01:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords OpenFlamingovision-language modelsautoregressive modelsopen-source replicationmultimodal learningFlamingofew-shot evaluationmodel benchmarks

0 comments p. Extension

pith:4SOQHN6W Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4SOQHN6W}

Prints a linked pith:4SOQHN6W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

OpenFlamingo delivers open-source vision-language models that reach 80-89 percent of Flamingo performance across seven datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenFlamingo as a public family of autoregressive vision-language models sized from 3 billion to 9 billion parameters. These models replicate an existing closed system by training on openly available data and code. They attain between 80 and 89 percent of the original performance when measured on the same seven vision-language benchmarks. Releasing the full training setup, models, and evaluation details lets other researchers run, verify, and build on the work without access to private resources.

Core claim

We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 and 89 percent of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite, and we share the models and code publicly.

What carries the argument

The OpenFlamingo model family, which uses autoregressive next-token prediction over interleaved image and text sequences to support few-shot multimodal reasoning.

If this is right

Researchers can now train comparable multimodal models from public resources without starting from proprietary checkpoints.
The 80-89 percent performance band shows that core few-shot vision-language capabilities transfer to open training pipelines.
Scaling experiments across the 3B to 9B range supply baselines for studying how size affects multimodal task accuracy.
The shared evaluation suite enables direct head-to-head comparisons of new open models against the Flamingo reference.
Public code and weights allow the community to iterate on data mixtures, training schedules, and architectural tweaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If further open training closes the remaining gap, the results would indicate that Flamingo's gains come mainly from architecture and scale rather than from exclusive data or methods.
The released framework could serve as a template for open replications of other large closed multimodal systems.
Extending the same open training recipe to higher parameter counts or additional modalities such as video would test how far the replication approach generalizes.

Load-bearing premise

The reported performance numbers were measured under evaluation conditions comparable to the original Flamingo models and the released code and data suffice for independent reproduction.

What would settle it

Re-running the released code and data on the seven datasets under the same evaluation protocol and obtaining average scores below 70 percent of Flamingo's reported numbers would undermine the replication claim.

read the original abstract

We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpenFlamingo, an open-source family of autoregressive vision-language models (3B–9B parameters) intended as a replication of DeepMind’s Flamingo. It reports that these models achieve 80–89% of the corresponding Flamingo performance on average across seven vision-language datasets, provides details on architecture, training data, hyperparameters, and evaluation protocols, and releases models and code at the linked GitHub repository.

Significance. If the relative performance numbers hold under matched evaluation conditions, the work is significant as the first public, reproducible alternative to the closed-source Flamingo models. The explicit release of code, models, and training details directly addresses reproducibility concerns in large-scale vision-language research and lowers the barrier for follow-on work.

major comments (3)

[Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
[Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
[Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.

minor comments (2)

[Abstract] The GitHub URL in the abstract should be accompanied by a permanent archive link (e.g., Zenodo DOI) to guard against repository changes.
[Model Architecture] Notation for model sizes (3B, 9B) should be defined consistently with parameter counts reported in Table 1 or the model architecture section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and clarify our evaluation approach, limitations, and the scope of this work as an open-source replication effort.

read point-by-point responses

Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.

Authors: We agree that a side-by-side protocol comparison would improve transparency. While we cannot access the closed-source Flamingo code for exact verification, our evaluation suite was designed to follow the protocols described in the Flamingo paper as closely as possible, including shot counts, prompt templates, and metrics. The released code at https://github.com/mlfoundations/open_flamingo contains the precise evaluation scripts, data loaders, and preprocessing steps used. In the revision we will add an explicit comparison table in the Evaluation section detailing our choices against the Flamingo paper descriptions, along with any unavoidable differences. revision: yes
Referee: [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.

Authors: We acknowledge that error bars and multi-seed statistics would strengthen the presentation. However, the computational cost of repeated full evaluations on 3B–9B models across seven datasets is prohibitive. We followed the single-run reporting convention common in large-scale vision-language papers and observed consistent relative performance across diverse tasks, which suggests the gap is not driven by outlier variance. In the revision we will add a limitations paragraph noting this constraint and the consistency evidence. revision: partial
Referee: [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.

Authors: We agree that systematic ablations would be informative. The primary objective of this technical report is to release reproducible models and code that close most of the gap to Flamingo, rather than to perform an exhaustive ablation study that would require orders-of-magnitude more compute. We do describe key hyperparameter choices and data mixtures in the Training section. In the revision we will expand the discussion to highlight the design decisions we found most impactful during development, while noting that a full ablation study remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical replication claims rest on external baselines without internal self-definition or fitted predictions

full rationale

The paper is a technical report describing an open-source replication of Flamingo models, including architecture, training data, hyperparameters, and an evaluation suite. The headline result (80-89% relative performance on seven datasets) is an empirical measurement against the closed-source Flamingo baseline on public datasets. No equations, derivations, or first-principles results are presented that reduce to the paper's own inputs by construction. There are no self-definitional quantities, no parameters fitted to a subset and then relabeled as predictions, and no load-bearing self-citations. The comparison assumes protocol equivalence, but this is an external validity concern rather than circularity per the enumerated patterns. The derivation chain consists of standard model training and benchmarking steps that are independently verifiable from the released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical engineering report that relies on standard neural-network training practices and public datasets; no new free parameters, axioms, or invented entities are introduced beyond those inherited from the Flamingo architecture.

pith-pipeline@v0.9.0 · 5452 in / 1027 out tokens · 43582 ms · 2026-05-14T01:47:24.782890+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
cs.CV 2026-04 unverdicted novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
cs.CV 2023-10 accept novelty 8.0

MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
cs.HC 2026-05 unverdicted novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
quant-ph 2026-04 unverdicted novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
cs.CV 2026-04 unverdicted novelty 7.0

Introduces VietPET-RoI dataset with fine-grained RoI annotations for Vietnamese 3D PET/CT and HiRRA graph framework that improves report generation by modeling region dependencies, claiming large gains over prior models.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework
cs.CV 2026-04 conditional novelty 7.0

Introduces the first large-scale 3D PET/CT dataset with fine-grained RoI annotations for Vietnamese and a graph-enhanced HiRRA framework that achieves SOTA report generation by modeling RoI dependencies.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
cs.CV 2026-03 unverdicted novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
cs.LG 2026-02 unverdicted novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
cs.CV 2024-06 conditional novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
Phantasia: Context-Adaptive Backdoors in Vision Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Phantasia is a new backdoor attack on VLMs that dynamically aligns malicious outputs with input context to achieve higher stealth and state-of-the-art success rates compared to static-pattern attacks.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
cs.CV 2025-11 conditional novelty 6.0

ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x spe...
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
cs.HC 2025-06 unverdicted novelty 6.0

UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
cs.LG 2025-05 unverdicted novelty 6.0

Muddit is a unified discrete diffusion transformer that integrates strong visual priors from a pretrained text-to-image model with a lightweight text decoder to enable fast parallel generation across text and image mo...
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
cs.AI 2025-04 unverdicted novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
cs.RO 2024-12 accept novelty 6.0

RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
Long Context Transfer from Language to Vision
cs.CV 2024-06 unverdicted novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
CogVLM: Visual Expert for Pretrained Language Models
cs.CV 2023-11 conditional novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Vision-Language Foundation Models as Effective Robot Imitators
cs.RO 2023-11 conditional novelty 6.0

RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
cs.CV 2023-05 accept novelty 6.0

OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 5.0

A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...
Online In-Context Distillation for Low-Resource Vision Language Models
cs.CV 2025-10 unverdicted novelty 5.0

Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
cs.RO 2025-07 unverdicted novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
World Model on Million-Length Video And Language With Blockwise RingAttention
cs.LG 2024-02 unverdicted novelty 5.0

Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
cs.CL 2023-11 unverdicted novelty 5.0

mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
cs.CV 2026-04 unverdicted novelty 4.0

Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 47 Pith papers · 13 internal anchors

[1]

com/blog/continuous-batching-llm-inference

Armen Aghajanyan, Po-Yao (Bernie) Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022

work page arXiv 2022
[2]

Lawrence Zitnick, Devi Parikh, and Dhruv Batra

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question 10 answering. International Journal of Computer Vision, 123:4–31, 2015

work page 2015
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35: 23716–23736, 2022

work page 2022
[4]

Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them

Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them. https://github.com/ rom1504/clip-retrieval, 2022

work page 2022
[5]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se- bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review arXiv 2022
[7]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Haus- man, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Koala: A dialogue model for academic research,

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108 , 2023

work page arXiv 2023
[10]

In NeurIPS

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023

work page arXiv 2023
[11]

Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: An- swering visual questions from blind people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018

work page 2018
[12]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review arXiv 2023
[13]

Perceiver: General perception with iterative at- tention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative at- tention. In International conference on machine learning, pages 4651–4664. PMLR, 2021

work page 2021
[14]

Scaling up vi- sual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up vi- sual and vision-language representation learning with noisy text supervision. In International Con- ference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021
[15]

The hateful memes challenge: Detecting hate speech in multi- modal memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multi- modal memes. arXiv preprint arXiv:2005.04790 , 2020

work page arXiv 2005
[16]

Grounding language models to images for multimodal gen- eration

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023

work page arXiv 2023
[17]

Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features

Gokul Karthik Kumar and Karthik Nandakumar. Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features. arXiv preprint arXiv:2210.05916 , 2022

work page arXiv 2022
[18]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents

Hugo Lauren¸ con, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, 11 Thomas Wang, Siddharth Karamcheti, Alexan- der M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelisc: An open web-scale fil- tered dataset of interleaved image-text docu- ments. arXiv preprint arXiv:2306.16527 , 2023

work page arXiv 2023
[19]

Mimic-it: Multi-modal in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruc- tion tuning. arXiv preprint arXiv:2306.05425 , 2023

work page arXiv 2023
[20]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023

work page internal anchor Pith review arXiv 2023
[21]

mplug: Effective and efficient vision-language learning by cross-modal skip-connections.arXiv preprint arXiv:2205.12005, 2022

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng da Cao, Ji Zhang, Songfang Huang, Feiran Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022

work page arXiv 2022
[22]

Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation. In International Con- ference on Machine Learning, pages 12888–12900. PMLR, 2022

work page 2022
[23]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13 , pages 740–755. Springer, 2014

work page 2014
[25]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3190–3199, 2019

work page 2019
[27]

Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

work page 2023
[28]

GPT-4 Technical Report

R OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cer- vantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021
[31]

arXiv preprint arXiv:2207.07635 , year=

Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022

work page arXiv 2022
[32]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8309–8318, 2019

work page 2019
[34]

WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch

Thomas Breuel. WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch. Available at: https: //github.com/webdataset/webdataset, 2020. 12

work page 2020
[35]

Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models

Together.xyz. Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023

work page 2023
[36]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4566–4575, 2014

work page 2014
[37]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi- aowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022

work page 2022
[39]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large lan- guage models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page Pith review arXiv 2023
[40]

Hockenmaier

Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for se- mantic inference over event descriptions. Trans- actions of the Association for Computational Lin- guistics, 2:67–78, 2014

work page 2014
[41]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xi- angfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init at- tention. arXiv preprint arXiv:2303.16199 , 2023

work page Pith review arXiv 2023
[42]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Corso, and Jianfeng Gao

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019

work page arXiv 1909
[45]

Multimodal c4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939 , 2023. 13 Table 8: Fine-tuned state-of-the-art numbers used in this report. Method Dataset Score...

work page arXiv 2023
[46]

tell stories, reference real-world entities/events, etc

be creative. tell stories, reference real-world entities/events, etc. The images/sentence can play off each-other in fun ways

work page
[47]

generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

be interesting. generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

work page
[48]

(Image A, Image B, Sentence 1, Image C, Image D, Sentence 2, Image E, Image F, Sentence 3)

make sure the image descriptions are self-contained, and the output format follows the requested format. user(human authored) Generate a creative, interesting sequence of sentences/images with the following format: (image A, sentence 1, image B, sentence 2, image C, sentence 3) assistant(human authored) Sure! Sequence format: (image A, sentence 1, image B...

work page

[1] [1]

com/blog/continuous-batching-llm-inference

Armen Aghajanyan, Po-Yao (Bernie) Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022

work page arXiv 2022

[2] [2]

Lawrence Zitnick, Devi Parikh, and Dhruv Batra

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question 10 answering. International Journal of Computer Vision, 123:4–31, 2015

work page 2015

[3] [3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35: 23716–23736, 2022

work page 2022

[4] [4]

Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them

Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them. https://github.com/ rom1504/clip-retrieval, 2022

work page 2022

[5] [5]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se- bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review arXiv 2022

[7] [7]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Haus- man, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Koala: A dialogue model for academic research,

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108 , 2023

work page arXiv 2023

[10] [10]

In NeurIPS

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023

work page arXiv 2023

[11] [11]

Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: An- swering visual questions from blind people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018

work page 2018

[12] [12]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review arXiv 2023

[13] [13]

Perceiver: General perception with iterative at- tention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative at- tention. In International conference on machine learning, pages 4651–4664. PMLR, 2021

work page 2021

[14] [14]

Scaling up vi- sual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up vi- sual and vision-language representation learning with noisy text supervision. In International Con- ference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021

[15] [15]

The hateful memes challenge: Detecting hate speech in multi- modal memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multi- modal memes. arXiv preprint arXiv:2005.04790 , 2020

work page arXiv 2005

[16] [16]

Grounding language models to images for multimodal gen- eration

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023

work page arXiv 2023

[17] [17]

Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features

Gokul Karthik Kumar and Karthik Nandakumar. Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features. arXiv preprint arXiv:2210.05916 , 2022

work page arXiv 2022

[18] [18]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents

Hugo Lauren¸ con, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, 11 Thomas Wang, Siddharth Karamcheti, Alexan- der M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelisc: An open web-scale fil- tered dataset of interleaved image-text docu- ments. arXiv preprint arXiv:2306.16527 , 2023

work page arXiv 2023

[19] [19]

Mimic-it: Multi-modal in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruc- tion tuning. arXiv preprint arXiv:2306.05425 , 2023

work page arXiv 2023

[20] [20]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023

work page internal anchor Pith review arXiv 2023

[21] [21]

mplug: Effective and efficient vision-language learning by cross-modal skip-connections.arXiv preprint arXiv:2205.12005, 2022

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng da Cao, Ji Zhang, Songfang Huang, Feiran Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022

work page arXiv 2022

[22] [22]

Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation. In International Con- ference on Machine Learning, pages 12888–12900. PMLR, 2022

work page 2022

[23] [23]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13 , pages 740–755. Springer, 2014

work page 2014

[25] [25]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3190–3199, 2019

work page 2019

[27] [27]

Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

work page 2023

[28] [28]

GPT-4 Technical Report

R OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cer- vantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021

[31] [31]

arXiv preprint arXiv:2207.07635 , year=

Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022

work page arXiv 2022

[32] [32]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8309–8318, 2019

work page 2019

[34] [34]

WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch

Thomas Breuel. WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch. Available at: https: //github.com/webdataset/webdataset, 2020. 12

work page 2020

[35] [35]

Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models

Together.xyz. Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023

work page 2023

[36] [36]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4566–4575, 2014

work page 2014

[37] [37]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi- aowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022

work page 2022

[39] [39]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large lan- guage models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page Pith review arXiv 2023

[40] [40]

Hockenmaier

Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for se- mantic inference over event descriptions. Trans- actions of the Association for Computational Lin- guistics, 2:67–78, 2014

work page 2014

[41] [41]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xi- angfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init at- tention. arXiv preprint arXiv:2303.16199 , 2023

work page Pith review arXiv 2023

[42] [42]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Corso, and Jianfeng Gao

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019

work page arXiv 1909

[45] [45]

Multimodal c4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939 , 2023. 13 Table 8: Fine-tuned state-of-the-art numbers used in this report. Method Dataset Score...

work page arXiv 2023

[46] [46]

tell stories, reference real-world entities/events, etc

be creative. tell stories, reference real-world entities/events, etc. The images/sentence can play off each-other in fun ways

work page

[47] [47]

generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

be interesting. generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

work page

[48] [48]

(Image A, Image B, Sentence 1, Image C, Image D, Sentence 2, Image E, Image F, Sentence 3)

make sure the image descriptions are self-contained, and the output format follows the requested format. user(human authored) Generate a creative, interesting sequence of sentences/images with the following format: (image A, sentence 1, image B, sentence 2, image C, sentence 3) assistant(human authored) Sure! Sequence format: (image A, sentence 1, image B...

work page