LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Aojun Zhou; Conghui He; Hongsheng Li; Jiaming Han; Pan Lu; Peng Gao; Renrui Zhang; Shijie Geng; Wei Zhang; Xiangyu Yue

arxiv: 2304.15010 · v1 · pith:C6RUBASQnew · submitted 2023-04-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao , Jiaming Han , Renrui Zhang , Ziyi Lin , Shijie Geng , Aojun Zhou , Wei Zhang , Pan Lu

show 4 more authors

Conghui He Xiangyu Yue Hongsheng Li Yu Qiao

This is my paper

Pith reviewed 2026-05-15 08:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM

keywords LLaMA-Adapterparameter-efficient tuningvisual instruction followingmulti-modal LLMearly fusionjoint traininginstruction tuning

0 comments

The pith

LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt large language models for multi-modal reasoning without full retraining. It augments the original LLaMA-Adapter by unlocking extra learnable parameters such as norms, biases, and scales so instruction following spreads through the whole model. Visual tokens are fed only into early layers via an early-fusion step, and image-text alignment and instruction data are trained jointly on separate parameter groups to stop one task from harming the other. Expert models for captioning or OCR can be added at inference time with no added training cost. The outcome is stronger open-ended visual instruction performance and even better language-only chat ability than the first version, all while using modest data scales.

Core claim

LLaMA-Adapter V2 shows that unlocking additional parameters across the base LLaMA model, placing visual tokens in early layers only, and training image-text pairs and instruction data on disjoint parameter sets together produce open-ended multi-modal instruction following with 14 million extra parameters over LLaMA.

What carries the argument

Early fusion of visual tokens into initial LLM layers together with disjoint-parameter joint training on image-text alignment and instruction data.

If this is right

Open-ended multi-modal instructions become possible with far fewer added parameters than full fine-tuning.
Language-only instruction following improves as a side effect of the same training.
Expert vision modules can be swapped in at test time without retraining the core model.
Small-scale image-text and instruction datasets suffice for competitive multi-modal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unlocked-parameter and early-fusion pattern may transfer directly to other base language models beyond LLaMA.
Scaling the base model size could reduce the relative parameter overhead even further while preserving the same training separation.
The approach suggests a route to multi-modal capability that avoids the need for massive paired training corpora.

Load-bearing premise

Early fusion and disjoint parameter groups will keep preventing interference between alignment and instruction tasks even when the data distribution shifts or larger base models are used.

What would settle it

Performance collapse on a new open-ended visual instruction benchmark that shows clear task interference between image-text alignment and instruction following would disprove the central claim.

read the original abstract

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

LLaMA-Adapter V2 adds early visual fusion, unlocked norms/biases/scales, and disjoint-parameter joint training to the original adapter, delivering open-ended multimodal instructions with 14M extra parameters, but the interference-reduction claim lacks a direct ablation. The three changes are straightforward engineering moves that keep the added cost low while spreading instruction ability across the base model. Early fusion gets visual tokens in sooner, which is a reasonable way to improve incorporation without extra parameters. The code release lets others verify the numbers directly, and the reported lift on language-only instructions is a useful side observation. These details make the work easy to adopt for groups already running adapter-style fine-tuning. The soft spot sits in the joint-training setup. The abstract claims the disjoint parameter groups reduce interference between image-text alignment and instruction following, yet no ablation compares shared versus separate optimization on the same mixture, and no diagnostic such as gradient overlap or per-task curves is shown. Without that measurement it is hard to credit the split specifically rather than the other two modifications. The assumption that the split will continue to prevent task conflict under shifted data or larger base models is therefore the least supported part. This paper is for teams that want a low-parameter recipe to add vision to LLaMA-style models. Readers already working with adapters will get concrete implementation details they can test quickly. I would send it to peer review because the modifications are clearly described, the claims are modest and testable with the released code, and the practical angle is relevant even if one ablation is missing.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaMA-Adapter V2 as a parameter-efficient extension of the original LLaMA-Adapter for visual instruction following. It augments the base model by unlocking additional learnable parameters (norms, biases, scales) across LLaMA layers, introduces an early-fusion strategy that injects visual tokens only into early LLM layers, proposes joint training on image-text pairs and instruction data using disjoint groups of learnable parameters to reduce task interference, and incorporates external expert models (captioning/OCR) at inference time. The central claim is that these changes enable open-ended multi-modal instruction following with only 14M added parameters over LLaMA while also strengthening language-only capabilities.

Significance. If the empirical results hold, the work demonstrates a lightweight route to multi-modal instruction models that avoids full fine-tuning of large LLMs. The combination of parameter unlocking, early fusion, and disjoint-parameter joint training addresses a practical bottleneck in scaling visual instruction tuning, and the inference-time expert integration provides a zero-training-cost way to boost image understanding. These elements could influence subsequent adapter-based multi-modal systems if the interference-reduction mechanism is shown to generalize.

major comments (2)

[Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.
[§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.

minor comments (2)

[§2] The exact count and placement of the unlocked norm/bias/scale parameters should be stated explicitly (e.g., which layers and which modules) rather than left as 'e.g., norm, bias and scale'.
[Figures] Figure captions and axis labels in the (referenced) ablation and comparison figures should include the precise training data sizes and hyper-parameter settings used for each variant to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.

Authors: We agree that an explicit ablation isolating the disjoint-parameter strategy would strengthen the claim. Our joint training results demonstrate that optimizing disjoint parameter groups on image-text pairs and instruction data yields strong multi-modal performance with only 14M added parameters, outperforming the original LLaMA-Adapter. In the revision we will add an ablation comparing shared versus disjoint optimization on the same data mixture, including performance deltas and task-specific metrics to isolate the interference-reduction effect. revision: yes
Referee: [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.

Authors: The full manuscript contains Tables 1-4 with quantitative comparisons on VQA, captioning, and instruction-following benchmarks. We acknowledge that error bars, significance tests, and per-task breakdowns would improve assessment of robustness. In the revision we will add error bars for multi-seed experiments, report statistical significance where applicable, and include expanded per-task breakdowns in the main text or supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture and training choices with external benchmarks

full rationale

The paper describes three design choices (unlocking norm/bias/scale parameters, early visual fusion, and disjoint-parameter joint training on image-text plus instruction data) followed by empirical evaluation against the prior LLaMA-Adapter and external models. No equations, first-principles derivations, or predictions appear in the provided text. The 14M-parameter claim is a direct count of added learnable weights, not a fitted or self-referential quantity. Self-citation to the original LLaMA-Adapter is present but functions only as a baseline comparison, not as load-bearing justification for any uniqueness theorem or ansatz. The joint-training interference claim is asserted without an internal ablation, but this is an evidence gap rather than circularity; the reported results remain falsifiable against held-out benchmarks. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard transformer architecture and training assumptions already present in the LLaMA literature; no new physical or mathematical entities are postulated.

axioms (2)

standard math Transformer layers with standard attention and feed-forward blocks can be adapted via low-rank or partial-parameter updates while preserving core language capabilities.
Invoked implicitly when the authors unlock norm, bias, and scale parameters across the full model.
domain assumption Visual tokens can be treated as additional sequence elements that the language model can attend to without architectural redesign.
Underlying the early-fusion strategy.

pith-pipeline@v0.9.0 · 5652 in / 1338 out tokens · 28046 ms · 2026-05-15T08:36:50.707814+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose an early fusion strategy to feed visual tokens only into the early LLM layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment
cs.CV 2026-05 unverdicted novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
Selective LoRA for Visual Tokens and Attention Heads
cs.CV 2025-12 unverdicted novelty 7.0

Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.
Stitch-a-Demo: Video Demonstrations from Multistep Descriptions
cs.CV 2025-03 unverdicted novelty 7.0

Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
cs.CV 2024-06 unverdicted novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
cs.CV 2023-12 conditional novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Multimodal LLMs under Pairwise Modalities
cs.CV 2026-05 unverdicted novelty 6.0

A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
cs.RO 2025-12 conditional novelty 6.0

ImagineNav++ achieves SOTA mapless visual navigation by prompting VLMs to select imagined future views generated from a human-preference-distilled module and maintained via selective foveation memory.
GD-FPS: Growth-Driven Feedforward Parameter Selection for Efficient Fine-Tuning
cs.CV 2025-10 unverdicted novelty 6.0

GD-FPS is a gradient-free, forward-pass-only parameter selection method for PEFT that identifies important weights by scaling magnitudes with relative activation growth against a pre-training anchor, matching or beati...
Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation
cs.CV 2025-09 unverdicted novelty 6.0

Introduces progressive task-specific multi-task adaptation for vision transformers, sharing adapters early and specializing later with gradient-based task allocation, outperforming prior methods on PASCAL and NYUD-v2 ...
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
cs.CV 2025-03 conditional novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
cs.CV 2025-03 unverdicted novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
E5-V: Universal Embeddings with Multimodal Large Language Models
cs.CL 2024-07 unverdicted novelty 6.0

E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.
An Embodied Generalist Agent in 3D World
cs.CV 2023-11 unverdicted novelty 6.0

LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Large Language Models are not Fair Evaluators
cs.CL 2023-05 conditional novelty 6.0

LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better matc...
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization
cs.RO 2026-04 unverdicted novelty 5.0

Adaptor uses few-shot learning with trajectory perturbation and vision-language conditioning to achieve robust cross-operator intent recognition and higher success rates in assistive teleoperation.
Grounding Everything in Tokens for Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 5.0

GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
cs.CV 2025-01 unverdicted novelty 5.0

LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
cs.CV 2024-01 unverdicted novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
cs.CL 2023-11 unverdicted novelty 5.0

mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets
cs.LG 2023-08 unverdicted novelty 5.0

MM-LIMA uses proposed quality metrics and a trainable selector to pick 200 high-quality multimodal instruction examples and outperforms MiniGPT-4 on evaluations.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
cs.AI 2024-08 unverdicted novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
cs.CV 2023-09 conditional novelty 4.0

InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
A Comprehensive Overview of Large Language Models
cs.CL 2023-07 unverdicted novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 46 Pith papers · 19 internal anchors

[1]

https://sharegpt.com/

Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/. 1, 3, 7, 8

work page
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page
[3]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 3

work page 2018
[4]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 3, 8

work page 2021
[6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 4, 6, 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023. 1, 3, 8

work page 2023
[8]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-ﬁnetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Meshed-memory transformer for image cap- tioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. Meshed-memory transformer for image cap- tioning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10578–10587,

work page
[10]

Using lora for efﬁcient sta- ble diffusion ﬁne-tuning

Pedro Cuenca and Sayak Paul. Using lora for efﬁcient sta- ble diffusion ﬁne-tuning. https://huggingface.co/ blog/lora, January 2023. 3

work page 2023
[11]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, 2019. 3

work page 2019
[12]

Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong- han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- Min Chan, Weize Chen, et al. Delta tuning: A comprehen- sive study of parameter efﬁcient methods for pre-trained lan- guage models. arXiv preprint arXiv:2203.06904, 2022. 3

work page arXiv 2022
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

work page 2021
[14]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Clip-adapter: Better vision-language models with feature adapters,

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 11 Figure 10. A Chatting Example using 65B LLaMA-Adapter V2. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 3

work page arXiv 2021
[16]

Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6639– 6648, 2019. 3

work page 2019
[17]

Making pre- trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. In ACL- IJCNLP, 2021. 3

work page 2021
[18]

OpenAGI: When LLM Meets Domain Experts,

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

work page arXiv
[19]

Koala: A di- alogue model for academic research

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A di- alogue model for academic research. Blog post, April 2023. 1

work page 2023
[20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

work page 2017
[21]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 4

work page arXiv 2022
[22]

Towards a uniﬁed view of parameter-efﬁcient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a uniﬁed view of parameter-efﬁcient transfer learning. In International Con- 12 Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations, 2022. 3

work page 2022
[23]

A comprehensive survey of deep learn- ing for image captioning

MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud- din, and Hamid Laga. A comprehensive survey of deep learn- ing for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019. 3

work page 2019
[24]

Parameter-efﬁcient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

work page 2019
[25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 3

work page 2022
[27]

Inner monologue: Embodied reason- ing through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jack- son, Noah Brown, Linda Luu, Sergey Levine, Karol Haus- man, and brian ichter. Inner monologue: Embodied reason- ing through planning with language models. In 6th Annual Conference on Robot Lea...

work page 2022
[28]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022. 3, 5

work page 2022
[29]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In Proceedings of the 38th International Confer- ence on Machine Learning (ICML), pages 5583–5594, 2021. 3

work page 2021
[30]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 3, 8

work page 2017
[31]

The power of scale for parameter-efﬁcient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. In EMNLP,

work page
[32]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 8

work page 2022
[34]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- 13 mant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 3

work page internal anchor Pith review arXiv 1908
[35]

Preﬁx-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. In ACL, 2021. 3

work page 2021
[36]

Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

work page arXiv
[37]

arXiv preprint arXiv:2303.12153 , year=

Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: From natu- ral language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 4

work page arXiv 2023
[38]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Infor- mation Processing Systems (NeurIPS) , pages 13–23, 2019. 3

work page 2019
[40]

Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 375–383, 2017. 3

work page 2017
[41]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3, 4

work page 2022
[42]

Chameleon: Plug-and-play compositional reasoning with large lan- guage models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 ,

work page arXiv
[43]

Compacter: Efﬁcient low-rank hypercomplex adapter layers

Rabeeh Karimi mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efﬁcient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Pro- cessing Systems, 2021. 3

work page 2021
[44]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 10, 11

work page 2021
[45]

Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Follow- ing instructions in language with modular methods. ArXiv, abs/2110.07342, 2021. 3

work page arXiv 2021
[46]

arXiv preprint arXiv: 2111.09734 (2021)

Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- cap: Clip preﬁx for image captioning. arXiv preprint arXiv:2111.09734, 2021. 8

work page arXiv 2021
[47]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1, 3

work page 2022
[49]

Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods

Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods. https : / / github.com/huggingface/peft, 2022. 3

work page 2022
[50]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 1, 3, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Tool Learning with Foundation Models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023. 4

work page internal anchor Pith review arXiv 2023
[52]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021
[53]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

work page 2018
[54]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3

work page 2019
[55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 3

work page 2022
[56]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3, 8

work page 2018
[58]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning. arXiv preprint arXiv:2206.06522, 2022. 3

work page arXiv 2022
[60]

Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 5227–5237,

work page
[61]

ViperGPT: Visual Inference via Python Execution for Reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 4 14

work page internal anchor Pith review arXiv 2023
[62]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu- lab/ stanford_alpaca, 2023. 1, 3

work page 2023
[63]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efﬁcient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence , 39(4):652–663,

work page 2015
[65]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representa- tions, 2022. 3

work page 2022
[67]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Baize: An open-source chat model with parameter-efficient tuning on self-chat data

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efﬁcient tuning on self-chat data. arXiv preprint arXiv:2304.01196,

work page arXiv
[69]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitﬁt: Simple parameter-efﬁcient ﬁne-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,

work page arXiv
[71]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022. 3

work page 2022
[72]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efﬁcient ﬁne-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 ,

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Han- qiu Deng, Hongsheng Li, Yu Qiao, and Peng Gao. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023, 2023. 3

work page 2023
[74]

Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation

Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation. In ECCV, 2022. 3

work page 2022
[75]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els. arXiv preprint arXiv:2303.18223, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Inter- national Journal of Computer Vision, pages 1–12, 2022. 3

work page 2022
[77]

Uniﬁed vision-language pre- training for image captioning and vqa

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja- son Corso, and Jianfeng Gao. Uniﬁed vision-language pre- training for image captioning and vqa. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 34, pages 13041–13049, 2020. 3

work page 2020
[78]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023. 2, 3, 6, 7

work page 2023
[79]

Pointclip v2: Adapting clip for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022. 4 15

work page arXiv 2022

[1] [1]

https://sharegpt.com/

Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/. 1, 3, 7, 8

work page

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page

[3] [3]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 3

work page 2018

[4] [4]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901

[5] [5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 3, 8

work page 2021

[6] [6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 4, 6, 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023. 1, 3, 8

work page 2023

[8] [8]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-ﬁnetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Meshed-memory transformer for image cap- tioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. Meshed-memory transformer for image cap- tioning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10578–10587,

work page

[10] [10]

Using lora for efﬁcient sta- ble diffusion ﬁne-tuning

Pedro Cuenca and Sayak Paul. Using lora for efﬁcient sta- ble diffusion ﬁne-tuning. https://huggingface.co/ blog/lora, January 2023. 3

work page 2023

[11] [11]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, 2019. 3

work page 2019

[12] [12]

Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong- han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- Min Chan, Weize Chen, et al. Delta tuning: A comprehen- sive study of parameter efﬁcient methods for pre-trained lan- guage models. arXiv preprint arXiv:2203.06904, 2022. 3

work page arXiv 2022

[13] [13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

work page 2021

[14] [14]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Clip-adapter: Better vision-language models with feature adapters,

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 11 Figure 10. A Chatting Example using 65B LLaMA-Adapter V2. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 3

work page arXiv 2021

[16] [16]

Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6639– 6648, 2019. 3

work page 2019

[17] [17]

Making pre- trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. In ACL- IJCNLP, 2021. 3

work page 2021

[18] [18]

OpenAGI: When LLM Meets Domain Experts,

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

work page arXiv

[19] [19]

Koala: A di- alogue model for academic research

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A di- alogue model for academic research. Blog post, April 2023. 1

work page 2023

[20] [20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

work page 2017

[21] [21]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 4

work page arXiv 2022

[22] [22]

Towards a uniﬁed view of parameter-efﬁcient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a uniﬁed view of parameter-efﬁcient transfer learning. In International Con- 12 Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations, 2022. 3

work page 2022

[23] [23]

A comprehensive survey of deep learn- ing for image captioning

MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud- din, and Hamid Laga. A comprehensive survey of deep learn- ing for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019. 3

work page 2019

[24] [24]

Parameter-efﬁcient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

work page 2019

[25] [25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 3

work page 2022

[27] [27]

Inner monologue: Embodied reason- ing through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jack- son, Noah Brown, Linda Luu, Sergey Levine, Karol Haus- man, and brian ichter. Inner monologue: Embodied reason- ing through planning with language models. In 6th Annual Conference on Robot Lea...

work page 2022

[28] [28]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022. 3, 5

work page 2022

[29] [29]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In Proceedings of the 38th International Confer- ence on Machine Learning (ICML), pages 5583–5594, 2021. 3

work page 2021

[30] [30]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 3, 8

work page 2017

[31] [31]

The power of scale for parameter-efﬁcient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. In EMNLP,

work page

[32] [32]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 8

work page 2022

[34] [34]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- 13 mant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 3

work page internal anchor Pith review arXiv 1908

[35] [35]

Preﬁx-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. In ACL, 2021. 3

work page 2021

[36] [36]

Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

work page arXiv

[37] [37]

arXiv preprint arXiv:2303.12153 , year=

Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: From natu- ral language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 4

work page arXiv 2023

[38] [38]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Infor- mation Processing Systems (NeurIPS) , pages 13–23, 2019. 3

work page 2019

[40] [40]

Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 375–383, 2017. 3

work page 2017

[41] [41]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3, 4

work page 2022

[42] [42]

Chameleon: Plug-and-play compositional reasoning with large lan- guage models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 ,

work page arXiv

[43] [43]

Compacter: Efﬁcient low-rank hypercomplex adapter layers

Rabeeh Karimi mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efﬁcient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Pro- cessing Systems, 2021. 3

work page 2021

[44] [44]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 10, 11

work page 2021

[45] [45]

Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Follow- ing instructions in language with modular methods. ArXiv, abs/2110.07342, 2021. 3

work page arXiv 2021

[46] [46]

arXiv preprint arXiv: 2111.09734 (2021)

Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- cap: Clip preﬁx for image captioning. arXiv preprint arXiv:2111.09734, 2021. 8

work page arXiv 2021

[47] [47]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1, 3

work page 2022

[49] [49]

Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods

Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods. https : / / github.com/huggingface/peft, 2022. 3

work page 2022

[50] [50]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 1, 3, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Tool Learning with Foundation Models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023. 4

work page internal anchor Pith review arXiv 2023

[52] [52]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021

[53] [53]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

work page 2018

[54] [54]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3

work page 2019

[55] [55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 3

work page 2022

[56] [56]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3, 8

work page 2018

[58] [58]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning. arXiv preprint arXiv:2206.06522, 2022. 3

work page arXiv 2022

[60] [60]

Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 5227–5237,

work page

[61] [61]

ViperGPT: Visual Inference via Python Execution for Reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 4 14

work page internal anchor Pith review arXiv 2023

[62] [62]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu- lab/ stanford_alpaca, 2023. 1, 3

work page 2023

[63] [63]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efﬁcient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence , 39(4):652–663,

work page 2015

[65] [65]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[66] [66]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representa- tions, 2022. 3

work page 2022

[67] [67]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

Baize: An open-source chat model with parameter-efficient tuning on self-chat data

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efﬁcient tuning on self-chat data. arXiv preprint arXiv:2304.01196,

work page arXiv

[69] [69]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitﬁt: Simple parameter-efﬁcient ﬁne-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,

work page arXiv

[71] [71]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022. 3

work page 2022

[72] [72]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efﬁcient ﬁne-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 ,

work page internal anchor Pith review Pith/arXiv arXiv

[73] [73]

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Han- qiu Deng, Hongsheng Li, Yu Qiao, and Peng Gao. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023, 2023. 3

work page 2023

[74] [74]

Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation

Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation. In ECCV, 2022. 3

work page 2022

[75] [75]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els. arXiv preprint arXiv:2303.18223, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Inter- national Journal of Computer Vision, pages 1–12, 2022. 3

work page 2022

[77] [77]

Uniﬁed vision-language pre- training for image captioning and vqa

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja- son Corso, and Jianfeng Gao. Uniﬁed vision-language pre- training for image captioning and vqa. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 34, pages 13041–13049, 2020. 3

work page 2020

[78] [78]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023. 2, 3, 6, 7

work page 2023

[79] [79]

Pointclip v2: Adapting clip for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022. 4 15

work page arXiv 2022