Recognition: no theorem link
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Pith reviewed 2026-05-14 23:01 UTC · model grok-4.3
The pith
LLaMA-Adapter adapts frozen LLaMA to follow instructions using only 1.2 million added parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaMA-Adapter prepends a set of learnable adaptation prompts to word tokens at higher transformer layers of the frozen LLaMA 7B model. A zero-initialized attention mechanism with zero gating adaptively injects the instructional cues while preserving pre-trained knowledge. After training on 52K demonstrations, the resulting model generates high-quality instruction-following responses comparable to Alpaca, which requires full fine-tuning of all parameters.
What carries the argument
Zero-initialized attention with zero gating that adaptively injects new instructional cues into higher layers of the frozen model.
Load-bearing premise
The zero-initialized attention with zero gating can selectively add instructional information without disrupting the model's pre-trained knowledge.
What would settle it
Train LLaMA-Adapter on the same 52K demonstrations and compare its responses to Alpaca's on a held-out set of instructions; if the Adapter outputs are consistently lower quality by human judgment or automatic metrics, the comparability claim fails.
read the original abstract
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaMA-Adapter, a lightweight adaptation method for fine-tuning the frozen LLaMA 7B model into an instruction-following model. It uses 52K self-instruct demonstrations to train only 1.2M learnable parameters consisting of adaptation prompts prepended at higher transformer layers, combined with a zero-initialized attention mechanism and zero gating that is claimed to adaptively inject instructional cues while preserving pre-trained knowledge. Training completes in under one hour on 8 A100 GPUs. The resulting model generates responses claimed to be comparable to fully fine-tuned Alpaca, and the approach extends to multi-modal image-conditioned instructions with strong results on ScienceQA and COCO Caption; the zero-init attention is also evaluated on ViT and RoBERTa for standard vision/language tasks. Code is released.
Significance. If the central performance claims hold and the zero-init mechanism proves necessary, the work would be significant for enabling efficient, low-parameter adaptation of large pre-trained models with minimal compute, lowering barriers to instruction tuning. The dramatic reduction to 1.2M parameters, rapid training time, multi-modal extension, and cross-architecture generalization tests are concrete strengths. Reproducibility via code release further supports impact. However, significance is tempered by the need to confirm the proposed mechanism drives the gains rather than the adaptation prompts and frozen backbone alone.
major comments (1)
- [Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.
minor comments (2)
- [Abstract and Experiments (§4)] Abstract and §4: The claim of 'comparable to Alpaca with fully fine-tuned 7B parameters' should specify the exact evaluation benchmarks, metrics (e.g., GPT-4 win rates or human scores), and Alpaca baseline details for direct comparison.
- [Method figures] Figure 2 or method diagram: Clarify the exact placement of adaptation prompts and the mathematical formulation of the zero-gating scalar to avoid ambiguity in replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the major comment point-by-point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.
Authors: We agree that the original experiments lacked direct controlled ablations isolating the contribution of zero-initialized attention and zero gating from the adaptation prompts alone. To address this, we have added new ablation studies in the revised manuscript (new Section 3.4 and Table 3). These include: (1) a prompt-only baseline without the gated attention, (2) random initialization of the attention weights instead of zero-init, and (3) non-zero gating variants. Results show that random initialization leads to training instability and ~12% lower performance on instruction-following benchmarks compared to zero-init, while the prompt-only baseline underperforms the full model by a noticeable margin. Non-zero gating also yields suboptimal results. We have updated the text in §3 to better motivate the zero-init design based on these findings, confirming it enables adaptive injection while preserving pre-trained knowledge. These additions substantiate the mechanism's necessity. revision: yes
Circularity Check
No circularity: method defined by new components and trained on external data
full rationale
The paper defines LLaMA-Adapter via learnable adaptation prompts prepended at higher layers plus a zero-initialized attention with zero gating, then trains the 1.2M parameters on 52K external self-instruct demonstrations. Reported performance (Alpaca-comparable responses, ScienceQA/COCO gains) follows from this training rather than any equation that reduces the output to the same fitted quantities by construction. No self-citations are load-bearing for the central claims, no uniqueness theorems are imported from prior author work, and no known empirical patterns are merely renamed. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable adaptation prompts
axioms (1)
- domain assumption Pre-trained LLaMA weights contain useful general knowledge that should remain largely unchanged during adaptation.
invented entities (1)
-
zero-initialized attention mechanism with zero gating
no independent evidence
Forward citations
Cited by 24 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...
-
MAny: Merge Anything for Multimodal Continual Instruction Tuning
MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
Reference graph
Works this paper leans on
-
[1]
https://github.com/tloen/alpaca-lora, 2023
Alpaca-lora. https://github.com/tloen/alpaca-lora, 2023
work page 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 0 23716--23736, 2022
work page 2022
-
[4]
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, and Lewis Tunstalland Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023
work page 2023
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[6]
Introduction to the conll-2004 shared task: Semantic role labeling
Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2004 shared task: Semantic role labeling. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004, pp.\ 89--97, 2004
work page 2004
-
[7]
Introduction to the conll-2005 shared task: Semantic role labeling
Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005), pp.\ 152--164, 2005
work page 2005
-
[8]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ https://lmsys.org/blog/2023-03-30-vicuna/, March 2023
work page 2023
-
[9]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
work page 2018
-
[11]
Using lora for efficient stable diffusion fine-tuning
Pedro Cuenca and Sayak Paul. Using lora for efficient stable diffusion fine-tuning. https://huggingface.co/blog/lora, January 2023
work page 2023
-
[12]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a
work page 2023
-
[13]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 248--255, 2009
work page 2009
-
[18]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004
work page 2004
-
[19]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems
Andrea Gesmundo and Jeff Dean. munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems. arXiv preprint arXiv:2205.10937, 2022
-
[23]
Multimodal-gpt: A vision and language model for dialogue with humans, 2023
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023
work page 2023
-
[24]
Google. Bard. https://bard.google.com/, 2023
work page 2023
-
[25]
Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains
Koustava Goswami, Lukas Lange, Jun Araki, and Heike Adel. Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains. arXiv preprint arXiv:2302.06868, 2023
-
[26]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017
work page 2017
-
[29]
Lukas Hedegaard, Aman Alok, Juby Jose, and Alexandros Iosifidis. Structured pruning adapters. arXiv preprint arXiv:2211.10155, 2022
-
[30]
Measuring massive multitask language understanding, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021
work page 2021
-
[31]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019
work page 2019
-
[33]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
-
[34]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021
work page 2021
-
[35]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pp.\ 709--727. Springer, 2022
work page 2022
-
[36]
Compacter: Efficient low-rank hypercomplex adapter layers
Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34: 0 1022--1035, 2021
work page 2021
-
[37]
Unifiedqa: Crossing format boundaries with a single qa system
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. In Findings of the Association for Computational Linguistics (EMNLP), pp.\ 1896--1907, 2020
work page 1907
-
[38]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022
-
[40]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023 a
work page internal anchor Pith review arXiv 2023
-
[43]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 5265--5275, 2020
work page 2020
-
[44]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10965--10975, 2022
work page 2022
-
[46]
Attention-guided unified network for panoptic segmentation
Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7026--7035, 2019 b
work page 2019
-
[48]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 d
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Truthfulqa: Measuring how models mimic human falsehoods, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022
work page 2022
-
[50]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014
work page 2014
-
[52]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021 a
-
[55]
arXiv preprint arXiv:2103.10385 , year=
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021 b
-
[56]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[57]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023 c
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[59]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.\ 722--729. IEEE, 2008
work page 2008
-
[60]
OpenAI. Chatgpt. https://chat.openai.com, 2023 a
work page 2023
-
[61]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Peft: State-of-the-art parameter-efficient fine-tuning methods
Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022
work page 2022
-
[64]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint conference on EMNLP and CoNLL-shared task, pp.\ 1--40, 2012
work page 2012
-
[66]
E2e nlg challenge: Neural models vs
Yevgeniy Puzikov and Iryna Gurevych. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation, pp.\ 463--471, 2018
work page 2018
-
[67]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[68]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[69]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020
work page 2020
-
[70]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[71]
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[72]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Vipergpt: Visual inference via python execution for reasoning
D \' dac Sur \' s, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023
- [74]
-
[76]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[77]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a
work page 2022
-
[78]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...
work page 2022
-
[80]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023
-
[81]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[82]
Improving visual prompt tuning for self-supervised vision transformers
Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. arXiv preprint arXiv:2306.05067, 2023
-
[83]
Deep modular co-attention networks for visual question answering
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6281--6290, 2019
work page 2019
-
[84]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, 2022
work page 2022
-
[85]
Hellaswag: Can a machine really finish your sentence?, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019
work page 2019
-
[86]
A large-scale study of representation learning with the visual task adaptation benchmark
Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019
-
[87]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18123--18133, 2022
work page 2022
-
[88]
Transfer visual prompt generator across llms
Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. CoRR, abs/23045.01278, 2023 a . URL https://doi.org/10.48550/arXiv.2305.01278
-
[89]
Side-tuning: a baseline for network adaptation via additive side networks
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16, pp.\ 698--714. Springer, 2020
work page 2020
-
[90]
What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models
Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Xin Wen, and Bingchen Zhao. What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023 b
work page 2023
-
[91]
Adding conditional control to text-to-image diffusion models, 2023 c
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023 c
work page 2023
-
[92]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023 d
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[97]
Zero initialization: Initializing residual networks with only zeros and ones
Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing residual networks with only zeros and ones. 2021
work page 2021
-
[98]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 a
work page 2022
-
[99]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 b
work page 2022
-
[100]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 c
work page 2022
-
[101]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[102]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.