MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

Haniyeh Ehsani Oskouie; Mahnoosh Alizadeh; Ramtin Pedarsani; Sajjad Ghiasvand

arxiv: 2602.21397 · v2 · pith:6DENYAWHnew · submitted 2026-02-24 · 💻 cs.CV · cs.LG

MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

Sajjad Ghiasvand , Haniyeh Ehsani Oskouie , Mahnoosh Alizadeh , Ramtin Pedarsani This is my paper

classification 💻 cs.CV cs.LG

keywords mmloplow-rankmethodsparameterspromptsmulti-modalpromptprompting

0 comments

read the original abstract

Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose MMLoP (Multi-Modal Low-Rank Prompting), a framework that achieves deep multi-modal prompting with only 11.5K trainable parameters, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization that constrains prompts to a compact subspace, providing parameter efficiency while motivating the need for our complementary regularization components. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization. Code is available at https://github.com/sajjad-ucsb/MMLoP.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models
cs.CL 2026-06 unverdicted novelty 5.0

MLLMs generate verbose, comprehensive, and repetitive aesthetic critiques unlike selective human ones, and reference-based metrics fail to detect this because they capture model house style instead of image-specific content.