pith. sign in

arxiv: 2304.15010 · v1 · pith:C6RUBASQnew · submitted 2023-04-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Pith reviewed 2026-05-15 08:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM
keywords LLaMA-Adapterparameter-efficient tuningvisual instruction followingmulti-modal LLMearly fusionjoint traininginstruction tuning
0
0 comments X

The pith

LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt large language models for multi-modal reasoning without full retraining. It augments the original LLaMA-Adapter by unlocking extra learnable parameters such as norms, biases, and scales so instruction following spreads through the whole model. Visual tokens are fed only into early layers via an early-fusion step, and image-text alignment and instruction data are trained jointly on separate parameter groups to stop one task from harming the other. Expert models for captioning or OCR can be added at inference time with no added training cost. The outcome is stronger open-ended visual instruction performance and even better language-only chat ability than the first version, all while using modest data scales.

Core claim

LLaMA-Adapter V2 shows that unlocking additional parameters across the base LLaMA model, placing visual tokens in early layers only, and training image-text pairs and instruction data on disjoint parameter sets together produce open-ended multi-modal instruction following with 14 million extra parameters over LLaMA.

What carries the argument

Early fusion of visual tokens into initial LLM layers together with disjoint-parameter joint training on image-text alignment and instruction data.

If this is right

  • Open-ended multi-modal instructions become possible with far fewer added parameters than full fine-tuning.
  • Language-only instruction following improves as a side effect of the same training.
  • Expert vision modules can be swapped in at test time without retraining the core model.
  • Small-scale image-text and instruction datasets suffice for competitive multi-modal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unlocked-parameter and early-fusion pattern may transfer directly to other base language models beyond LLaMA.
  • Scaling the base model size could reduce the relative parameter overhead even further while preserving the same training separation.
  • The approach suggests a route to multi-modal capability that avoids the need for massive paired training corpora.

Load-bearing premise

Early fusion and disjoint parameter groups will keep preventing interference between alignment and instruction tasks even when the data distribution shifts or larger base models are used.

What would settle it

Performance collapse on a new open-ended visual instruction benchmark that shows clear task interference between image-text alignment and instruction following would disprove the central claim.

read the original abstract

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaMA-Adapter V2 as a parameter-efficient extension of the original LLaMA-Adapter for visual instruction following. It augments the base model by unlocking additional learnable parameters (norms, biases, scales) across LLaMA layers, introduces an early-fusion strategy that injects visual tokens only into early LLM layers, proposes joint training on image-text pairs and instruction data using disjoint groups of learnable parameters to reduce task interference, and incorporates external expert models (captioning/OCR) at inference time. The central claim is that these changes enable open-ended multi-modal instruction following with only 14M added parameters over LLaMA while also strengthening language-only capabilities.

Significance. If the empirical results hold, the work demonstrates a lightweight route to multi-modal instruction models that avoids full fine-tuning of large LLMs. The combination of parameter unlocking, early fusion, and disjoint-parameter joint training addresses a practical bottleneck in scaling visual instruction tuning, and the inference-time expert integration provides a zero-training-cost way to boost image understanding. These elements could influence subsequent adapter-based multi-modal systems if the interference-reduction mechanism is shown to generalize.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.
  2. [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.
minor comments (2)
  1. [§2] The exact count and placement of the unlocked norm/bias/scale parameters should be stated explicitly (e.g., which layers and which modules) rather than left as 'e.g., norm, bias and scale'.
  2. [Figures] Figure captions and axis labels in the (referenced) ablation and comparison figures should include the precise training data sizes and hyper-parameter settings used for each variant to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.

    Authors: We agree that an explicit ablation isolating the disjoint-parameter strategy would strengthen the claim. Our joint training results demonstrate that optimizing disjoint parameter groups on image-text pairs and instruction data yields strong multi-modal performance with only 14M added parameters, outperforming the original LLaMA-Adapter. In the revision we will add an ablation comparing shared versus disjoint optimization on the same data mixture, including performance deltas and task-specific metrics to isolate the interference-reduction effect. revision: yes

  2. Referee: [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.

    Authors: The full manuscript contains Tables 1-4 with quantitative comparisons on VQA, captioning, and instruction-following benchmarks. We acknowledge that error bars, significance tests, and per-task breakdowns would improve assessment of robustness. In the revision we will add error bars for multi-seed experiments, report statistical significance where applicable, and include expanded per-task breakdowns in the main text or supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture and training choices with external benchmarks

full rationale

The paper describes three design choices (unlocking norm/bias/scale parameters, early visual fusion, and disjoint-parameter joint training on image-text plus instruction data) followed by empirical evaluation against the prior LLaMA-Adapter and external models. No equations, first-principles derivations, or predictions appear in the provided text. The 14M-parameter claim is a direct count of added learnable weights, not a fitted or self-referential quantity. Self-citation to the original LLaMA-Adapter is present but functions only as a baseline comparison, not as load-bearing justification for any uniqueness theorem or ansatz. The joint-training interference claim is asserted without an internal ablation, but this is an evidence gap rather than circularity; the reported results remain falsifiable against held-out benchmarks. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard transformer architecture and training assumptions already present in the LLaMA literature; no new physical or mathematical entities are postulated.

axioms (2)
  • standard math Transformer layers with standard attention and feed-forward blocks can be adapted via low-rank or partial-parameter updates while preserving core language capabilities.
    Invoked implicitly when the authors unlock norm, bias, and scale parameters across the full model.
  • domain assumption Visual tokens can be treated as additional sequence elements that the language model can attend to without architectural redesign.
    Underlying the early-fusion strategy.

pith-pipeline@v0.9.0 · 5652 in / 1338 out tokens · 28046 ms · 2026-05-15T08:36:50.707814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  3. Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

  4. Selective LoRA for Visual Tokens and Attention Heads

    cs.CV 2025-12 unverdicted novelty 7.0

    Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.

  5. Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

    cs.CV 2025-03 unverdicted novelty 7.0

    Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.

  6. We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    cs.AI 2024-07 accept novelty 7.0

    WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

  7. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    cs.CV 2024-06 unverdicted novelty 7.0

    Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...

  8. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  9. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    cs.CV 2023-12 conditional novelty 7.0

    Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.

  10. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  11. Evaluating Object Hallucination in Large Vision-Language Models

    cs.CV 2023-05 accept novelty 7.0

    Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

  12. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  13. Multimodal LLMs under Pairwise Modalities

    cs.CV 2026-05 unverdicted novelty 6.0

    A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.

  14. Latent Denoising Improves Visual Alignment in Large Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

  15. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  16. Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

    cs.CV 2026-03 unverdicted novelty 6.0

    Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...

  17. ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

    cs.RO 2025-12 conditional novelty 6.0

    ImagineNav++ achieves SOTA mapless visual navigation by prompting VLMs to select imagined future views generated from a human-preference-distilled module and maintained via selective foveation memory.

  18. GD-FPS: Growth-Driven Feedforward Parameter Selection for Efficient Fine-Tuning

    cs.CV 2025-10 unverdicted novelty 6.0

    GD-FPS is a gradient-free, forward-pass-only parameter selection method for PEFT that identifies important weights by scaling magnitudes with relative activation growth against a pre-training anchor, matching or beati...

  19. Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

    cs.CV 2025-09 unverdicted novelty 6.0

    Introduces progressive task-specific multi-task adaptation for vision transformers, sharing adapters early and specializing later with gradient-based task allocation, outperforming prior methods on PASCAL and NYUD-v2 ...

  20. OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    cs.CV 2025-03 conditional novelty 6.0

    Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

  21. MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

    cs.CV 2025-03 unverdicted novelty 6.0

    MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

  22. E5-V: Universal Embeddings with Multimodal Large Language Models

    cs.CL 2024-07 unverdicted novelty 6.0

    E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.

  23. An Embodied Generalist Agent in 3D World

    cs.CV 2023-11 unverdicted novelty 6.0

    LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...

  24. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  25. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    cs.CV 2023-08 unverdicted novelty 6.0

    DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.

  26. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  27. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  28. Large Language Models are not Fair Evaluators

    cs.CL 2023-05 conditional novelty 6.0

    LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better matc...

  29. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  30. Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization

    cs.RO 2026-04 unverdicted novelty 5.0

    Adaptor uses few-shot learning with trajectory perturbation and vision-language conditioning to achieve robust cross-operator intent recognition and higher success rates in assistive teleoperation.

  31. Grounding Everything in Tokens for Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 5.0

    GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.

  32. PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.

  33. LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

    cs.CV 2025-01 unverdicted novelty 5.0

    LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.

  34. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  35. InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    cs.CV 2024-01 unverdicted novelty 5.0

    InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

  36. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  37. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

    cs.CL 2023-11 unverdicted novelty 5.0

    mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.

  38. MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

    cs.LG 2023-08 unverdicted novelty 5.0

    MM-LIMA uses proposed quality metrics and a trainable selector to pick 200 high-quality multimodal instruction examples and outperforms MiniGPT-4 on evaluations.

  39. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  40. AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

    cs.AI 2024-08 unverdicted novelty 4.0

    The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

  41. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  42. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  43. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    cs.CV 2023-09 conditional novelty 4.0

    InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.

  44. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

  45. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

  46. A Comprehensive Overview of Large Language Models

    cs.CL 2023-07 unverdicted novelty 2.0

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 46 Pith papers · 19 internal anchors

  1. [1]

    https://sharegpt.com/

    Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/. 1, 3, 7, 8

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

  3. [3]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 3

  4. [4]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

  5. [5]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 3, 8

  6. [6]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 4, 6, 7, 8, 10

  7. [7]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023. 1, 3, 8

  8. [8]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3

  9. [9]

    Meshed-memory transformer for image cap- tioning

    Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. Meshed-memory transformer for image cap- tioning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10578–10587,

  10. [10]

    Using lora for efficient sta- ble diffusion fine-tuning

    Pedro Cuenca and Sayak Paul. Using lora for efficient sta- ble diffusion fine-tuning. https://huggingface.co/ blog/lora, January 2023. 3

  11. [11]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, 2019. 3

  12. [12]

    Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong- han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- Min Chan, Weize Chen, et al. Delta tuning: A comprehen- sive study of parameter efficient methods for pre-trained lan- guage models. arXiv preprint arXiv:2203.06904, 2022. 3

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

  14. [14]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 4

  15. [15]

    Clip-adapter: Better vision-language models with feature adapters,

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 11 Figure 10. A Chatting Example using 65B LLaMA-Adapter V2. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 3

  16. [16]

    Dy- namic fusion with intra-and inter-modality attention flow for visual question answering

    Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy- namic fusion with intra-and inter-modality attention flow for visual question answering. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6639– 6648, 2019. 3

  17. [17]

    Making pre- trained language models better few-shot learners

    Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. In ACL- IJCNLP, 2021. 3

  18. [18]

    OpenAGI: When LLM Meets Domain Experts,

    Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

  19. [19]

    Koala: A di- alogue model for academic research

    Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A di- alogue model for academic research. Blog post, April 2023. 1

  20. [20]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

  21. [21]

    Visual programming: Compositional visual reason- ing without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 4

  22. [22]

    Towards a unified view of parameter-efficient transfer learning

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In International Con- 12 Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations, 2022. 3

  23. [23]

    A comprehensive survey of deep learn- ing for image captioning

    MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud- din, and Hamid Laga. A comprehensive survey of deep learn- ing for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019. 3

  24. [24]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 5

  26. [26]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 3

  27. [27]

    Inner monologue: Embodied reason- ing through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jack- son, Noah Brown, Linda Luu, Sergey Levine, Karol Haus- man, and brian ichter. Inner monologue: Embodied reason- ing through planning with language models. In 6th Annual Conference on Robot Lea...

  28. [28]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022. 3, 5

  29. [29]

    Vilt: Vision- and-language transformer without convolution or region su- pervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In Proceedings of the 38th International Confer- ence on Machine Learning (ICML), pages 5583–5594, 2021. 3

  30. [30]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 3, 8

  31. [31]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP,

  32. [32]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3, 8

  33. [33]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 8

  34. [34]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- 13 mant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 3

  35. [35]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021. 3

  36. [36]

    Scaling & shifting your features: A new baseline for efficient model tuning.arXiv preprint arXiv:2210.08823,

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.arXiv preprint arXiv:2210.08823,

  37. [37]

    arXiv preprint arXiv:2303.12153 , year=

    Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: From natu- ral language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 4

  38. [38]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

  39. [39]

    Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Infor- mation Processing Systems (NeurIPS) , pages 13–23, 2019. 3

  40. [40]

    Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning

    Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 375–383, 2017. 3

  41. [41]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3, 4

  42. [42]

    Chameleon: Plug-and-play compositional reasoning with large lan- guage models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 ,

  43. [43]

    Compacter: Efficient low-rank hypercomplex adapter layers

    Rabeeh Karimi mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Pro- cessing Systems, 2021. 3

  44. [44]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 10, 11

  45. [45]

    Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

    So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Follow- ing instructions in language with modular methods. ArXiv, abs/2110.07342, 2021. 3

  46. [46]

    arXiv preprint arXiv: 2111.09734 (2021)

    Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- cap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021. 8

  47. [47]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

  48. [48]

    Training language models to follow instructions with human feed- back

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1, 3

  49. [49]

    Peft: State-of-the-art parameter-efficient fine-tuning methods

    Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https : / / github.com/huggingface/peft, 2022. 3

  50. [50]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 1, 3, 5, 7, 8

  51. [51]

    Tool Learning with Foundation Models

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023. 4

  52. [52]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

  53. [53]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

  54. [54]

    Language models are unsu- pervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3

  55. [55]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 3

  56. [56]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8

  57. [57]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3, 8

  58. [58]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 4

  59. [59]

    Lst: Lad- der side-tuning for parameter and memory efficient transfer learning

    Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022. 3

  60. [60]

    Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks

    Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 5227–5237,

  61. [61]

    ViperGPT: Visual Inference via Python Execution for Reasoning

    D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 4 14

  62. [62]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu- lab/ stanford_alpaca, 2023. 1, 3

  63. [63]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023. 1, 3, 4

  64. [64]

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence , 39(4):652–663,

  65. [65]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. 3

  66. [66]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representa- tions, 2022. 3

  67. [67]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 4

  68. [68]

    Baize: An open-source chat model with parameter-efficient tuning on self-chat data

    Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196,

  69. [69]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 4

  70. [70]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

    Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,

  71. [71]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022. 3

  72. [72]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 ,

  73. [73]

    Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

    Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Han- qiu Deng, Hongsheng Li, Yu Qiao, and Peng Gao. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023, 2023. 3

  74. [74]

    Tip- adapter: Training-free adaption of clip for few-shot classi- fication

    Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. In ECCV, 2022. 3

  75. [75]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els. arXiv preprint arXiv:2303.18223, 2023. 1

  76. [76]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Inter- national Journal of Computer Vision, pages 1–12, 2022. 3

  77. [77]

    Unified vision-language pre- training for image captioning and vqa

    Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja- son Corso, and Jianfeng Gao. Unified vision-language pre- training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049, 2020. 3

  78. [78]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023. 2, 3, 6, 7

  79. [79]

    Pointclip v2: Adapting clip for powerful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022. 4 15