pith. machine review for the scientific record. sign in

arxiv: 2303.16199 · v3 · submitted 2023-03-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

Recognition: no theorem link

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM
keywords LLaMA-Adapterparameter-efficient fine-tuninginstruction followingzero-init attentionlanguage modelsmulti-modal adaptationadapter methods
0
0 comments X

The pith

LLaMA-Adapter adapts frozen LLaMA to follow instructions using only 1.2 million added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaMA-Adapter as a lightweight way to turn the LLaMA language model into an instruction follower. It adds 1.2 million learnable parameters to the frozen 7 billion parameter model and trains on 52,000 self-instruct examples in under an hour on eight A100 GPUs. Learnable prompts are prepended at higher transformer layers, and a zero-initialized attention mechanism with zero gating blends the new cues into the model without erasing its original knowledge. This produces responses of similar quality to fully fine-tuned models such as Alpaca. The same structure also supports image-conditioned instructions and applies to fine-tuning other models like ViT and RoBERTa.

Core claim

LLaMA-Adapter prepends a set of learnable adaptation prompts to word tokens at higher transformer layers of the frozen LLaMA 7B model. A zero-initialized attention mechanism with zero gating adaptively injects the instructional cues while preserving pre-trained knowledge. After training on 52K demonstrations, the resulting model generates high-quality instruction-following responses comparable to Alpaca, which requires full fine-tuning of all parameters.

What carries the argument

Zero-initialized attention with zero gating that adaptively injects new instructional cues into higher layers of the frozen model.

Load-bearing premise

The zero-initialized attention with zero gating can selectively add instructional information without disrupting the model's pre-trained knowledge.

What would settle it

Train LLaMA-Adapter on the same 52K demonstrations and compare its responses to Alpaca's on a held-out set of instructions; if the Adapter outputs are consistently lower quality by human judgment or automatic metrics, the comparability claim fails.

read the original abstract

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces LLaMA-Adapter, a lightweight adaptation method for fine-tuning the frozen LLaMA 7B model into an instruction-following model. It uses 52K self-instruct demonstrations to train only 1.2M learnable parameters consisting of adaptation prompts prepended at higher transformer layers, combined with a zero-initialized attention mechanism and zero gating that is claimed to adaptively inject instructional cues while preserving pre-trained knowledge. Training completes in under one hour on 8 A100 GPUs. The resulting model generates responses claimed to be comparable to fully fine-tuned Alpaca, and the approach extends to multi-modal image-conditioned instructions with strong results on ScienceQA and COCO Caption; the zero-init attention is also evaluated on ViT and RoBERTa for standard vision/language tasks. Code is released.

Significance. If the central performance claims hold and the zero-init mechanism proves necessary, the work would be significant for enabling efficient, low-parameter adaptation of large pre-trained models with minimal compute, lowering barriers to instruction tuning. The dramatic reduction to 1.2M parameters, rapid training time, multi-modal extension, and cross-architecture generalization tests are concrete strengths. Reproducibility via code release further supports impact. However, significance is tempered by the need to confirm the proposed mechanism drives the gains rather than the adaptation prompts and frozen backbone alone.

major comments (1)
  1. [Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.
minor comments (2)
  1. [Abstract and Experiments (§4)] Abstract and §4: The claim of 'comparable to Alpaca with fully fine-tuned 7B parameters' should specify the exact evaluation benchmarks, metrics (e.g., GPT-4 win rates or human scores), and Alpaca baseline details for direct comparison.
  2. [Method figures] Figure 2 or method diagram: Clarify the exact placement of adaptation prompts and the mathematical formulation of the zero-gating scalar to avoid ambiguity in replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point-by-point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (§3) and Experiments] §3 (zero-initialized attention with zero gating): The central claim that this mechanism 'adaptively injects the new instructional cues into LLaMA while effectively preserves its pre-trained knowledge' is load-bearing for both the efficiency and quality assertions, yet the experiments provide no controlled ablation (e.g., random-init attention on identical prompts, prompt-only baseline without gated attention, or non-zero gating). Without such comparisons, it remains possible that gains derive primarily from the learnable prompts and frozen LLaMA rather than the zero-init design, weakening the necessity of the proposed technique.

    Authors: We agree that the original experiments lacked direct controlled ablations isolating the contribution of zero-initialized attention and zero gating from the adaptation prompts alone. To address this, we have added new ablation studies in the revised manuscript (new Section 3.4 and Table 3). These include: (1) a prompt-only baseline without the gated attention, (2) random initialization of the attention weights instead of zero-init, and (3) non-zero gating variants. Results show that random initialization leads to training instability and ~12% lower performance on instruction-following benchmarks compared to zero-init, while the prompt-only baseline underperforms the full model by a noticeable margin. Non-zero gating also yields suboptimal results. We have updated the text in §3 to better motivate the zero-init design based on these findings, confirming it enables adaptive injection while preserving pre-trained knowledge. These additions substantiate the mechanism's necessity. revision: yes

Circularity Check

0 steps flagged

No circularity: method defined by new components and trained on external data

full rationale

The paper defines LLaMA-Adapter via learnable adaptation prompts prepended at higher layers plus a zero-initialized attention with zero gating, then trains the 1.2M parameters on 52K external self-instruct demonstrations. Reported performance (Alpaca-comparable responses, ScienceQA/COCO gains) follows from this training rather than any equation that reduces the output to the same fitted quantities by construction. No self-citations are load-bearing for the central claims, no uniqueness theorems are imported from prior author work, and no known empirical patterns are merely renamed. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard transformer architecture plus two new introduced elements: learnable prompts and the zero-init attention variant. No external benchmarks or proofs are invoked beyond the empirical training run.

free parameters (1)
  • learnable adaptation prompts
    1.2M parameters introduced and optimized on the 52K self-instruct demonstrations.
axioms (1)
  • domain assumption Pre-trained LLaMA weights contain useful general knowledge that should remain largely unchanged during adaptation.
    Invoked to justify the zero-initialization strategy that starts with no effect on existing representations.
invented entities (1)
  • zero-initialized attention mechanism with zero gating no independent evidence
    purpose: To allow new instructional cues to be injected gradually without initially disrupting pre-trained behavior.
    New mechanism proposed in the paper; no independent evidence outside the training results is provided.

pith-pipeline@v0.9.0 · 5576 in / 1304 out tokens · 54777 ms · 2026-05-14T23:01:25.060113+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  3. Instruction Tuning with GPT-4

    cs.CL 2023-04 unverdicted novelty 8.0

    GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

  4. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  5. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  6. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    cs.CV 2023-10 accept novelty 7.0

    Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

  7. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  8. LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

  9. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  10. ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

  11. $M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

    cs.RO 2026-04 unverdicted novelty 6.0

    M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

  12. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  13. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  14. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  15. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  16. Multimodal Chain-of-Thought Reasoning in Language Models

    cs.CL 2023-02 accept novelty 6.0

    Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.

  17. Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

    cs.AI 2026-05 unverdicted novelty 5.0

    Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and...

  18. MAny: Merge Anything for Multimodal Continual Instruction Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.

  19. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  20. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  21. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  22. UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 4.0

    UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

  23. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  24. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

278 extracted references · 278 canonical work pages · cited by 23 Pith papers · 38 internal anchors

  1. [1]

    https://github.com/tloen/alpaca-lora, 2023

    Alpaca-lora. https://github.com/tloen/alpaca-lora, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 0 23716--23736, 2022

  3. [4]

    Open llm leaderboard

    Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, and Lewis Tunstalland Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  4. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  5. [6]

    Introduction to the conll-2004 shared task: Semantic role labeling

    Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2004 shared task: Semantic role labeling. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004, pp.\ 89--97, 2004

  6. [7]

    Introduction to the conll-2005 shared task: Semantic role labeling

    Xavier Carreras and Llu \' s M \`a rquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning (CoNLL-2005), pp.\ 152--164, 2005

  7. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ https://lmsys.org/blog/2023-03-30-vicuna/, March 2023

  8. [9]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  9. [10]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  10. [11]

    Using lora for efficient stable diffusion fine-tuning

    Pedro Cuenca and Sayak Paul. Using lora for efficient stable diffusion fine-tuning. https://huggingface.co/blog/lora, January 2023

  11. [12]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023 a

  12. [13]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023 b

  13. [15]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 248--255, 2009

  14. [18]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004

  15. [19]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  16. [22]

    munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems

    Andrea Gesmundo and Jeff Dean. munet: Evolving pretrained deep neural networks into scalable auto-tuning multitask systems. arXiv preprint arXiv:2205.10937, 2022

  17. [23]

    Multimodal-gpt: A vision and language model for dialogue with humans, 2023

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

  18. [24]

    Google. Bard. https://bard.google.com/, 2023

  19. [25]

    Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains

    Koustava Goswami, Lukas Lange, Jun Araki, and Heike Adel. Switchprompt: Learning domain-specific gated soft prompts for classification in low-resource domains. arXiv preprint arXiv:2302.06868, 2023

  20. [26]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

  21. [29]

    Structured pruning adapters

    Lukas Hedegaard, Aman Alok, Juby Jose, and Alexandros Iosifidis. Structured pruning adapters. arXiv preprint arXiv:2211.10155, 2022

  22. [30]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  23. [31]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799. PMLR, 2019

  24. [33]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  25. [34]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

  26. [35]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pp.\ 709--727. Springer, 2022

  27. [36]

    Compacter: Efficient low-rank hypercomplex adapter layers

    Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34: 0 1022--1035, 2021

  28. [37]

    Unifiedqa: Crossing format boundaries with a single qa system

    Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. In Findings of the Association for Computational Linguistics (EMNLP), pp.\ 1896--1907, 2020

  29. [38]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022

  30. [40]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023 a

  31. [43]

    What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 5265--5275, 2020

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 5265--5275, 2020

  32. [44]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10965--10975, 2022

  33. [46]

    Attention-guided unified network for panoptic segmentation

    Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 7026--7035, 2019 b

  34. [48]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 d

  35. [49]

    Truthfulqa: Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

  36. [50]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

  37. [52]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023 a

  38. [53]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023 b

  39. [54]

    CoRR , volume =

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021 a

  40. [55]

    arXiv preprint arXiv:2103.10385 , year=

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021 b

  41. [56]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  42. [57]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023 c

  43. [58]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  44. [59]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.\ 722--729. IEEE, 2008

  45. [60]

    OpenAI. Chatgpt. https://chat.openai.com, 2023 a

  46. [61]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023 b

  47. [63]

    Peft: State-of-the-art parameter-efficient fine-tuning methods

    Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

  48. [64]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  49. [65]

    Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes

    Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint conference on EMNLP and CoNLL-shared task, pp.\ 1--40, 2012

  50. [66]

    E2e nlg challenge: Neural models vs

    Yevgeniy Puzikov and Iryna Gurevych. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation, pp.\ 463--471, 2018

  51. [67]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  52. [68]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  53. [69]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

  54. [70]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

  55. [71]

    Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

    Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003

  56. [72]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  57. [73]

    Vipergpt: Visual inference via python execution for reasoning

    D \' dac Sur \' s, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023

  58. [74]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  59. [76]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  60. [77]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

  61. [78]

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...

  62. [80]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

  63. [81]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

  64. [82]

    Improving visual prompt tuning for self-supervised vision transformers

    Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, and Sungroh Yoon. Improving visual prompt tuning for self-supervised vision transformers. arXiv preprint arXiv:2306.05067, 2023

  65. [83]

    Deep modular co-attention networks for visual question answering

    Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6281--6290, 2019

  66. [84]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1--9, 2022

  67. [85]

    Hellaswag: Can a machine really finish your sentence?, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

  68. [86]

    A large-scale study of representation learning with the visual task adaptation benchmark

    Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019

  69. [87]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18123--18133, 2022

  70. [88]

    Transfer visual prompt generator across llms

    Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms. CoRR, abs/23045.01278, 2023 a . URL https://doi.org/10.48550/arXiv.2305.01278

  71. [89]

    Side-tuning: a baseline for network adaptation via additive side networks

    Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16, pp.\ 698--714. Springer, 2020

  72. [90]

    What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

    Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Xin Wen, and Bingchen Zhao. What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023 b

  73. [91]

    Adding conditional control to text-to-image diffusion models, 2023 c

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023 c

  74. [92]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023 d

  75. [97]

    Zero initialization: Initializing residual networks with only zeros and ones

    Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing residual networks with only zeros and ones. 2021

  76. [98]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 a

  77. [99]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16816--16825, 2022 b

  78. [100]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 c

  79. [101]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  80. [102]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

Showing first 80 references.