pith. machine review for the scientific record. sign in

arxiv: 2308.01390 · v2 · submitted 2023-08-02 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords OpenFlamingovision-language modelsautoregressive modelsopen-source replicationmultimodal learningFlamingofew-shot evaluationmodel benchmarks
0
0 comments X

The pith

OpenFlamingo delivers open-source vision-language models that reach 80-89 percent of Flamingo performance across seven datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenFlamingo as a public family of autoregressive vision-language models sized from 3 billion to 9 billion parameters. These models replicate an existing closed system by training on openly available data and code. They attain between 80 and 89 percent of the original performance when measured on the same seven vision-language benchmarks. Releasing the full training setup, models, and evaluation details lets other researchers run, verify, and build on the work without access to private resources.

Core claim

We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 and 89 percent of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite, and we share the models and code publicly.

What carries the argument

The OpenFlamingo model family, which uses autoregressive next-token prediction over interleaved image and text sequences to support few-shot multimodal reasoning.

If this is right

  • Researchers can now train comparable multimodal models from public resources without starting from proprietary checkpoints.
  • The 80-89 percent performance band shows that core few-shot vision-language capabilities transfer to open training pipelines.
  • Scaling experiments across the 3B to 9B range supply baselines for studying how size affects multimodal task accuracy.
  • The shared evaluation suite enables direct head-to-head comparisons of new open models against the Flamingo reference.
  • Public code and weights allow the community to iterate on data mixtures, training schedules, and architectural tweaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If further open training closes the remaining gap, the results would indicate that Flamingo's gains come mainly from architecture and scale rather than from exclusive data or methods.
  • The released framework could serve as a template for open replications of other large closed multimodal systems.
  • Extending the same open training recipe to higher parameter counts or additional modalities such as video would test how far the replication approach generalizes.

Load-bearing premise

The reported performance numbers were measured under evaluation conditions comparable to the original Flamingo models and the released code and data suffice for independent reproduction.

What would settle it

Re-running the released code and data on the seven datasets under the same evaluation protocol and obtaining average scores below 70 percent of Flamingo's reported numbers would undermine the replication claim.

read the original abstract

We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpenFlamingo, an open-source family of autoregressive vision-language models (3B–9B parameters) intended as a replication of DeepMind’s Flamingo. It reports that these models achieve 80–89% of the corresponding Flamingo performance on average across seven vision-language datasets, provides details on architecture, training data, hyperparameters, and evaluation protocols, and releases models and code at the linked GitHub repository.

Significance. If the relative performance numbers hold under matched evaluation conditions, the work is significant as the first public, reproducible alternative to the closed-source Flamingo models. The explicit release of code, models, and training details directly addresses reproducibility concerns in large-scale vision-language research and lowers the barrier for follow-on work.

major comments (3)
  1. [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.
  2. [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.
  3. [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.
minor comments (2)
  1. [Abstract] The GitHub URL in the abstract should be accompanied by a permanent archive link (e.g., Zenodo DOI) to guard against repository changes.
  2. [Model Architecture] Notation for model sizes (3B, 9B) should be defined consistently with parameter counts reported in Table 1 or the model architecture section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and clarify our evaluation approach, limitations, and the scope of this work as an open-source replication effort.

read point-by-point responses
  1. Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the central claim that OpenFlamingo reaches 80–89% of Flamingo performance on seven datasets rests on the unverified assumption that few-shot prompts, example selection, image preprocessing, shot count, and metric computation are identical to those used in the original (closed-source) Flamingo evaluations. The manuscript describes its own suite and shares code, but provides no explicit side-by-side protocol table or verification that would allow independent confirmation of equivalence.

    Authors: We agree that a side-by-side protocol comparison would improve transparency. While we cannot access the closed-source Flamingo code for exact verification, our evaluation suite was designed to follow the protocols described in the Flamingo paper as closely as possible, including shot counts, prompt templates, and metrics. The released code at https://github.com/mlfoundations/open_flamingo contains the precise evaluation scripts, data loaders, and preprocessing steps used. In the revision we will add an explicit comparison table in the Evaluation section detailing our choices against the Flamingo paper descriptions, along with any unavoidable differences. revision: yes

  2. Referee: [Evaluation section] Evaluation section: reported performance figures lack error bars, multiple random seeds, or statistical comparisons against the Flamingo baselines. Without these, it is impossible to determine whether the observed 80–89% range reflects a stable gap or is sensitive to evaluation variance.

    Authors: We acknowledge that error bars and multi-seed statistics would strengthen the presentation. However, the computational cost of repeated full evaluations on 3B–9B models across seven datasets is prohibitive. We followed the single-run reporting convention common in large-scale vision-language papers and observed consistent relative performance across diverse tasks, which suggests the gap is not driven by outlier variance. In the revision we will add a limitations paragraph noting this constraint and the consistency evidence. revision: partial

  3. Referee: [Training section] Training and ablation discussion: the manuscript describes hyperparameters and data but contains no systematic ablations (e.g., effect of vision encoder choice, cross-attention layers, or data mixture ratios). This omission makes it difficult to isolate which design decisions are responsible for closing most of the gap to Flamingo.

    Authors: We agree that systematic ablations would be informative. The primary objective of this technical report is to release reproducible models and code that close most of the gap to Flamingo, rather than to perform an exhaustive ablation study that would require orders-of-magnitude more compute. We do describe key hyperparameter choices and data mixtures in the Training section. In the revision we will expand the discussion to highlight the design decisions we found most impactful during development, while noting that a full ablation study remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical replication claims rest on external baselines without internal self-definition or fitted predictions

full rationale

The paper is a technical report describing an open-source replication of Flamingo models, including architecture, training data, hyperparameters, and an evaluation suite. The headline result (80-89% relative performance on seven datasets) is an empirical measurement against the closed-source Flamingo baseline on public datasets. No equations, derivations, or first-principles results are presented that reduce to the paper's own inputs by construction. There are no self-definitional quantities, no parameters fitted to a subset and then relabeled as predictions, and no load-bearing self-citations. The comparison assumes protocol equivalence, but this is an external validity concern rather than circularity per the enumerated patterns. The derivation chain consists of standard model training and benchmarking steps that are independently verifiable from the released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical engineering report that relies on standard neural-network training practices and public datasets; no new free parameters, axioms, or invented entities are introduced beyond those inherited from the Flamingo architecture.

pith-pipeline@v0.9.0 · 5452 in / 1027 out tokens · 43582 ms · 2026-05-14T01:47:24.782890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

    cs.CV 2026-04 unverdicted novelty 8.0

    MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  3. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  4. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    cs.CV 2023-10 accept novelty 8.0

    MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.

  5. BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

    cs.RO 2026-05 unverdicted novelty 7.0

    BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

  6. AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

    cs.HC 2026-05 unverdicted novelty 7.0

    AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...

  7. QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

    quant-ph 2026-04 unverdicted novelty 7.0

    Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

  8. Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces VietPET-RoI dataset with fine-grained RoI annotations for Vietnamese 3D PET/CT and HiRRA graph framework that improves report generation by modeling region dependencies, claiming large gains over prior models.

  9. Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

  10. Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

  11. When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

  12. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  13. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  14. Phantasia: Context-Adaptive Backdoors in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Phantasia is a new backdoor attack on VLMs that dynamically aligns malicious outputs with input context to achieve higher stealth and state-of-the-art success rates compared to static-pattern attacks.

  15. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  16. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  17. Long Context Transfer from Language to Vision

    cs.CV 2024-06 unverdicted novelty 6.0

    Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

  18. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  19. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

  20. Make Your LVLM KV Cache More Lightweight

    cs.CV 2026-05 unverdicted novelty 5.0

    LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

  21. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  22. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  23. A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...

  24. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  25. Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

    cs.CV 2026-04 unverdicted novelty 4.0

    Robust CLIP models amplify vulnerabilities to natural adversarial scenarios while standard CLIP shows large performance drops on natural language-induced adversarial examples in zero-shot classification, segmentation,...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 25 Pith papers · 11 internal anchors

  1. [1]

    arXiv preprint arXiv:2201.07520 , year=

    Armen Aghajanyan, Po-Yao (Bernie) Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022

  2. [2]

    Lawrence Zitnick, Devi Parikh, and Dhruv Batra

    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question 10 answering. International Journal of Computer Vision, 123:4–31, 2015

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35: 23716–23736, 2022

  4. [4]

    Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them

    Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip re- trieval system with them. https://github.com/ rom1504/clip-retrieval, 2022

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  6. [6]

    Pali: A jointly-scaled mul- tilingual language-image model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se- bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

  7. [7]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  8. [8]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Haus- man, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

  9. [9]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108 , 2023

  10. [10]

    arXiv preprint arXiv:2305.04790 , year=

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023

  11. [11]

    Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: An- swering visual questions from blind people. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018

  12. [12]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  13. [13]

    Perceiver: General perception with iterative at- tention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative at- tention. In International conference on machine learning, pages 4651–4664. PMLR, 2021

  14. [14]

    Scaling up vi- sual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up vi- sual and vision-language representation learning with noisy text supervision. In International Con- ference on Machine Learning, pages 4904–4916. PMLR, 2021

  15. [15]

    The hateful memes challenge: Detecting hate speech in multi- modal memes

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multi- modal memes. arXiv preprint arXiv:2005.04790 , 2020

  16. [16]

    Grounding language models to images for multimodal generation

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023

  17. [17]

    Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features

    Gokul Karthik Kumar and Karthik Nandakumar. Hate-clipper: Multimodal hateful meme classifi- cation based on cross-modal interaction of clip features. arXiv preprint arXiv:2210.05916 , 2022

  18. [18]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Lauren¸ con, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, 11 Thomas Wang, Siddharth Karamcheti, Alexan- der M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelisc: An open web-scale fil- tered dataset of interleaved image-text docu- ments. arXiv preprint arXiv:2306.16527 , 2023

  19. [19]

    Mimic-it: Multi-modal in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruc- tion tuning. arXiv preprint arXiv:2306.05425 , 2023

  20. [20]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023

  21. [21]

    mplug: Effective and efficient vision-language learning by cross-modal skip-connections

    Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng da Cao, Ji Zhang, Songfang Huang, Feiran Huang, Jingren Zhou, and Luo Si. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022

  22. [22]

    Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language under- standing and generation. In International Con- ference on Machine Learning, pages 12888–12900. PMLR, 2022

  23. [23]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  24. [24]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13 , pages 740–755. Springer, 2014

  25. [25]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  26. [26]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3190–3199, 2019

  27. [27]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

    MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

  28. [28]

    GPT-4 Technical Report

    R OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  29. [29]

    Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cer- vantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collect- ing region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision, pages 2641–2649, 2015

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

  31. [31]

    Is a caption worth a thousand images? a controlled study for representation learning

    Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a caption worth a thousand images? a controlled study for representation learning. arXiv preprint arXiv:2207.07635, 2022

  32. [32]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  33. [33]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8309–8318, 2019

  34. [34]

    WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch

    Thomas Breuel. WebDataset: A high- performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch. Available at: https: //github.com/webdataset/webdataset, 2020. 12

  35. [35]

    Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models

    Together.xyz. Releasing 3b and 7b redpajama- incite family of models including base, instruction-tuned & chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023

  36. [36]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4566–4575, 2014

  37. [37]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  38. [38]

    An empirical study of gpt-3 for few-shot knowledge-based vqa

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi- aowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022

  39. [39]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large lan- guage models with multimodality. arXiv preprint arXiv:2304.14178, 2023

  40. [40]

    Hockenmaier

    Peter Young, Alice Lai, Micah Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for se- mantic inference over event descriptions. Trans- actions of the Association for Computational Lin- guistics, 2:67–78, 2014

  41. [41]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xi- angfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init at- tention. arXiv preprint arXiv:2303.16199 , 2023

  42. [42]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Vic- toria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01...

  43. [43]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023

  44. [44]

    Corso, and Jianfeng Gao

    Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059, 2019

  45. [45]

    Multimodal c4: An open, billion-scale corpus of images interleaved with text

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939 , 2023. 13 Table 8: Fine-tuned state-of-the-art numbers used in this report. Method Dataset Score...

  46. [46]

    tell stories, reference real-world entities/events, etc

    be creative. tell stories, reference real-world entities/events, etc. The images/sentence can play off each-other in fun ways

  47. [47]

    generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

    be interesting. generate sequences that are cool, fun, compelling and require interesting commonsense reasoning across and between images/sentences

  48. [48]

    (Image A, Image B, Sentence 1, Image C, Image D, Sentence 2, Image E, Image F, Sentence 3)

    make sure the image descriptions are self-contained, and the output format follows the requested format. user(human authored) Generate a creative, interesting sequence of sentences/images with the following format: (image A, sentence 1, image B, sentence 2, image C, sentence 3) assistant(human authored) Sure! Sequence format: (image A, sentence 1, image B...