arxiv: 2303.16199 · v3 · submitted 2023-03-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

Recognition: no theorem link

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang , Jiaming Han , Chris Liu , Peng Gao , Aojun Zhou , Xiangfei Hu , Shilin Yan , Pan Lu

show 2 more authors

Hongsheng Li Yu Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM

keywords LLaMA-Adapterparameter-efficient fine-tuninginstruction followingzero-init attentionlanguage modelsmulti-modal adaptationadapter methods

0 comments

The pith

LLaMA-Adapter adapts frozen LLaMA to follow instructions using only 1.2 million added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaMA-Adapter as a lightweight way to turn the LLaMA language model into an instruction follower. It adds 1.2 million learnable parameters to the frozen 7 billion parameter model and trains on 52,000 self-instruct examples in under an hour on eight A100 GPUs. Learnable prompts are prepended at higher transformer layers, and a zero-initialized attention mechanism with zero gating blends the new cues into the model without erasing its original knowledge. This produces responses of similar quality to fully fine-tuned models such as Alpaca. The same structure also supports image-conditioned instructions and applies to fine-tuning other models like ViT and RoBERTa.

Core claim

LLaMA-Adapter prepends a set of learnable adaptation prompts to word tokens at higher transformer layers of the frozen LLaMA 7B model. A zero-initialized attention mechanism with zero gating adaptively injects the instructional cues while preserving pre-trained knowledge. After training on 52K demonstrations, the resulting model generates high-quality instruction-following responses comparable to Alpaca, which requires full fine-tuning of all parameters.

What carries the argument

Zero-initialized attention with zero gating that adaptively injects new instructional cues into higher layers of the frozen model.

Load-bearing premise

The zero-initialized attention with zero gating can selectively add instructional information without disrupting the model's pre-trained knowledge.

What would settle it

Train LLaMA-Adapter on the same 52K demonstrations and compare its responses to Alpaca's on a held-out set of instructions; if the Adapter outputs are consistently lower quality by human judgment or automatic metrics, the comparability claim fails.

read the original abstract

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaMA-Adapter gets close to Alpaca performance on LLaMA-7B with 1.2M parameters via learnable prompts and zero-init attention, but the mechanism's specific contribution lacks direct ablation support.

read the letter

The punchline is that this adapter gets you Alpaca-comparable instruction following on LLaMA-7B by training only 1.2M parameters in under an hour, using learnable prompts at higher layers and a zero-init attention gate to blend them in. The work does a few things cleanly. It freezes the base model entirely, adds a small set of adaptation prompts, and uses the zero-initialized attention with zero gating to control how much new instructional signal gets injected at each layer. That design choice avoids overwriting pre-trained weights, which is the point. They train on 52K self-instruct examples and report results that match fully fine-tuned Alpaca on language tasks. The extension to multi-modal instructions, where they condition on images for ScienceQA and COCO captioning, shows the same mechanism works beyond text. They also apply the zero-init attention to ViT and RoBERTa on classic vision and language benchmarks, which gives some evidence that the trick isn't tied only to LLaMA. The soft spot is the validation of the zero-init mechanism itself. The abstract and description claim it adaptively injects cues while preserving knowledge, but without seeing controlled ablations—like prompts alone versus prompts plus zero-init attention, or zero-init versus random init—the necessity of that specific design isn't fully pinned down. It could be that the learnable prompts plus frozen base are doing most of the work. The multi-modal results look good, but more error analysis or layer-wise studies would strengthen the case for why higher layers and this gating work. This paper is for practitioners who need fast, low-resource ways to adapt large models for instructions or vision-language tasks. Readers working on efficient fine-tuning or adapter methods will find the concrete numbers and code release useful. The approach is simple enough that it deserves a serious referee to check the experiments and see if the zero-init idea holds up under closer scrutiny. Recommendation: send it out for review. The efficiency claims are concrete and the method is easy to reproduce, even if the ablation depth could be better.

Referee Report

1 major / 2 minor

Summary. The paper introduces LLaMA-Adapter, a lightweight adaptation method for fine-tuning the frozen LLaMA 7B model into an instruction-following model. It uses 52K self-instruct demonstrations to train only 1.2M learnable parameters consisting of adaptation prompts prepended at higher transformer layers, combined with a zero-initialized attention mechanism and zero gating that is claimed to adaptively inject instructional cues while preserving pre-trained knowledge. Training completes in under one hour on 8 A100 GPUs. The resulting model generates responses claimed to be comparable to fully fine-tuned Alpaca, and the approach extends to multi-modal image-conditioned instructions with strong results on ScienceQA and COCO Caption; the zero-init attention is also evaluated on ViT and RoBERTa for standard vision/language tasks. Code is released.

Significance. If the central performance claims hold and the zero-init mechanism proves necessary, the work would be significant for enabling efficient, low-parameter adaptation of large pre-trained models with minimal compute, lowering barriers to instruction tuning. The dramatic reduction to 1.2M parameters, rapid training time, multi-modal extension, and cross-architecture generalization tests are concrete strengths. Reproducibility via code release further supports impact. However, significance is tempered by the need to confirm the proposed mechanism drives the gains rather than the adaptation prompts and frozen backbone alone.

major comments (1)

[Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.

minor comments (2)

[Abstract and Experiments (§4)] Abstract and §4: The claim of 'comparable to Alpaca with fully fine-tuned 7B parameters' should specify the exact evaluation benchmarks, metrics (e.g., GPT-4 win rates or human scores), and Alpaca baseline details for direct comparison.
[Method figures] Figure 2 or method diagram: Clarify the exact placement of adaptation prompts and the mathematical formulation of the zero-gating scalar to avoid ambiguity in replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point-by-point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.

Authors: We agree that the original experiments lacked direct controlled ablations isolating the contribution of zero-initialized attention and zero gating from the adaptation prompts alone. To address this, we have added new ablation studies in the revised manuscript (new Section 3.4 and Table 3). These include: (1) a prompt-only baseline without the gated attention, (2) random initialization of the attention weights instead of zero-init, and (3) non-zero gating variants. Results show that random initialization leads to training instability and ~12% lower performance on instruction-following benchmarks compared to zero-init, while the prompt-only baseline underperforms the full model by a noticeable margin. Non-zero gating also yields suboptimal results. We have updated the text in §3 to better motivate the zero-init design based on these findings, confirming it enables adaptive injection while preserving pre-trained knowledge. These additions substantiate the mechanism's necessity. revision: yes

Circularity Check

0 steps flagged

No circularity: method defined by new components and trained on external data

full rationale

The paper defines LLaMA-Adapter via learnable adaptation prompts prepended at higher layers plus a zero-initialized attention with zero gating, then trains the 1.2M parameters on 52K external self-instruct demonstrations. Reported performance (Alpaca-comparable responses, ScienceQA/COCO gains) follows from this training rather than any equation that reduces the output to the same fitted quantities by construction. No self-citations are load-bearing for the central claims, no uniqueness theorems are imported from prior author work, and no known empirical patterns are merely renamed. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard transformer architecture plus two new introduced elements: learnable prompts and the zero-init attention variant. No external benchmarks or proofs are invoked beyond the empirical training run.

free parameters (1)

learnable adaptation prompts
1.2M parameters introduced and optimized on the 52K self-instruct demonstrations.

axioms (1)

domain assumption Pre-trained LLaMA weights contain useful general knowledge that should remain largely unchanged during adaptation.
Invoked to justify the zero-initialization strategy that starts with no effect on existing representations.

invented entities (1)

zero-initialized attention mechanism with zero gating no independent evidence
purpose: To allow new instructional cues to be injected gradually without initially disrupting pre-trained behavior.
New mechanism proposed in the paper; no independent evidence outside the training results is provided.

pith-pipeline@v0.9.0 · 5576 in / 1304 out tokens · 54777 ms · 2026-05-14T23:01:25.060113+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
cs.AI 2026-05 unverdicted novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...
MAny: Merge Anything for Multimodal Continual Instruction Tuning
cs.LG 2026-04 unverdicted novelty 5.0

MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

278 extracted references · 278 canonical work pages · cited by 23 Pith papers · 38 internal anchors

[1]

https://github.com/tloen/alpaca-lora, 2023

Alpaca-lora. https://github.com/tloen/alpaca-lora, 2023

work page 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 0 23716--23736, 2022

work page 2022
[4]

Open llm leaderboard

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, and Lewis Tunstalland Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[6]

Introduction to the conll-2004 shared task: Semantic role labeling

Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2004 shared task: Semantic role labeling. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004, pp.\ 89--97, 2004

work page 2004
[7]

Introduction to the conll-2005 shared task: Semantic role labeling

Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005), pp.\ 152--164, 2005

work page 2005
[8]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ https://lmsys.org/blog/2023-03-30-vicuna/, March 2023

work page 2023
[9]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[11]

Using lora for efficient stable diffusion fine-tuning

Pedro Cuenca and Sayak Paul. Using lora for efficient stable diffusion fine-tuning. https://huggingface.co/blog/lora, January 2023

work page 2023
[12]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a

work page 2023
[13]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 248--255, 2009

work page 2009
[18]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004

work page 2004
[19]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems

Andrea Gesmundo and Jeff Dean. munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems. arXiv preprint arXiv:2205.10937, 2022

work page arXiv 2022
[23]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

work page 2023
[24]

Google. Bard. https://bard.google.com/, 2023

work page 2023
[25]

Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains

Koustava Goswami, Lukas Lange, Jun Araki, and Heike Adel. Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains. arXiv preprint arXiv:2302.06868, 2023

work page arXiv 2023
[26]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

work page 2017
[29]

Structured pruning adapters

Lukas Hedegaard, Aman Alok, Juby Jose, and Alexandros Iosifidis. Structured pruning adapters. arXiv preprint arXiv:2211.10155, 2022

work page arXiv 2022
[30]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[31]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019

work page 2019
[33]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page arXiv 2023
[34]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

work page 2021
[35]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pp.\ 709--727. Springer, 2022

work page 2022
[36]

Compacter: Efficient low-rank hypercomplex adapter layers

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34: 0 1022--1035, 2021

work page 2021
[37]

Unifiedqa: Crossing format boundaries with a single qa system

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. In Findings of the Association for Computational Linguistics (EMNLP), pp.\ 1896--1907, 2020

work page 1907
[38]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022

work page arXiv 2022
[40]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023 a

work page internal anchor Pith review arXiv 2023
[43]

What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 5265--5275, 2020

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 5265--5275, 2020

work page 2020
[44]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10965--10975, 2022

work page 2022
[46]

Attention-guided unified network for panoptic segmentation

Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7026--7035, 2019 b

work page 2019
[48]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 d

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022
[50]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

work page 2014
[52]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

CoRR , volume =

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021 a

work page arXiv 2021
[55]

arXiv preprint arXiv:2103.10385 , year=

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021 b

work page arXiv 2021
[56]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[57]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023 c

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[59]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.\ 722--729. IEEE, 2008

work page 2008
[60]

OpenAI. Chatgpt. https://chat.openai.com, 2023 a

work page 2023
[61]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Peft: State-of-the-art parameter-efficient fine-tuning methods

Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

work page 2022
[64]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint conference on EMNLP and CoNLL-shared task, pp.\ 1--40, 2012

work page 2012
[66]

E2e nlg challenge: Neural models vs

Yevgeniy Puzikov and Iryna Gurevych. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation, pp.\ 463--471, 2018

work page 2018
[67]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[68]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[69]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

work page 2020
[70]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[71]

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003

work page internal anchor Pith review Pith/arXiv arXiv 2003
[72]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Vipergpt: Visual inference via python execution for reasoning

D \' dac Sur \' s, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023

work page arXiv 2023
[74]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[76]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

work page 2022
[78]

Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...

work page 2022
[80]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

work page arXiv 2023
[81]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[82]

Improving visual prompt tuning for self-supervised vision transformers

Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. arXiv preprint arXiv:2306.05067, 2023

work page arXiv 2023
[83]

Deep modular co-attention networks for visual question answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6281--6290, 2019

work page 2019
[84]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, 2022

work page 2022
[85]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

work page 2019
[86]

A large-scale study of representation learning with the visual task adaptation benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

work page arXiv 1910
[87]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18123--18133, 2022

work page 2022
[88]

Transfer visual prompt generator across llms

Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. CoRR, abs/23045.01278, 2023 a . URL https://doi.org/10.48550/arXiv.2305.01278

work page doi:10.48550/arxiv.2305.01278 2023
[89]

Side-tuning: a baseline for network adaptation via additive side networks

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16, pp.\ 698--714. Springer, 2020

work page 2020
[90]

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Xin Wen, and Bingchen Zhao. What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023 b

work page 2023
[91]

Adding conditional control to text-to-image diffusion models, 2023 c

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023 c

work page 2023
[92]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023 d

work page internal anchor Pith review Pith/arXiv arXiv 2023
[97]

Zero initialization: Initializing residual networks with only zeros and ones

Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing residual networks with only zeros and ones. 2021

work page 2021
[98]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 a

work page 2022
[99]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 b

work page 2022
[100]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 c

work page 2022
[101]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[102]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.