pith. sign in

arxiv: 2502.04501 · v3 · submitted 2025-02-06 · 💻 cs.CL

Ultra-Low-Dimensional Prompt Tuning via Random Projection

Pith reviewed 2026-05-23 03:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords prompt tuningparameter-efficient fine-tuningrandom projectionlow-dimensional optimizationlarge language modelsnatural language processing
0
0 comments X

The pith

Prompts learned in 2D space and lifted by a frozen random matrix match full prompt tuning performance with 98 percent fewer trainable parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that learns prompt vectors in an extremely low-dimensional space rather than tying them to the full hidden size of the language model. A single fixed random matrix then projects these short vectors up to the required embedding dimension. This yields a 98 percent drop in the number of parameters that must be trained and stored. Experiments across more than twenty NLP tasks show accuracy stays comparable to standard prompt tuning and exceeds other recent efficient-tuning baselines that still use more parameters.

Core claim

ULPT optimizes prompt embeddings inside a low-dimensional space such as two dimensions and multiplies them by a frozen random matrix to reach the model's hidden dimension, thereby cutting trainable parameters by 98 percent while preserving downstream performance on more than twenty NLP tasks.

What carries the argument

Frozen random up-projection matrix that maps low-dimensional prompt vectors to the model's full hidden dimensionality.

If this is right

  • ULPT requires far fewer trainable parameters than other recent parameter-efficient tuning techniques.
  • Performance on more than twenty NLP tasks remains comparable to vanilla prompt tuning.
  • The approach enables storage of many more task-specific prompts for the same memory budget.
  • Prompt optimization becomes feasible in spaces as small as two dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The storage reduction could let a single device hold separate prompts for thousands of downstream tasks.
  • Similar random-projection compression might be applied to other adapter-based tuning methods.
  • Task-specific prompts could be transmitted or updated with very small communication cost.

Load-bearing premise

A fixed random matrix from the low-dimensional prompt space to the model's hidden dimension keeps enough task information for performance to stay close to full prompt tuning.

What would settle it

A controlled test on additional tasks in which ULPT accuracy falls markedly below standard prompt tuning while using the same low dimension would show the random projection loses critical information.

Figures

Figures reproduced from arXiv: 2502.04501 by Lili Mou, Yongchang Hao, Zijun Wu.

Figure 1
Figure 1. Figure 1: Overview of our approach. (a) ULPT up￾projects ultra-low-dimensional embeddings with a ran￾dom but fixed matrix. (b) ULPT can significantly reduce parameters storage for LLMs customization. trainable parameters than vanilla prompt tuning. We avoid this overhead by employing a random but frozen matrix for the up-projection, as shown in Figure 1a. We further introduce lightweight, learnable shift and scale e… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of prompt embedding values over [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results with controlled numbers of trainable [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: Training loss curves comparing ULPT with no alignment (dotted), with learnable shift only (dashed), and with both shift and scale (solid). Right: Evaluation accuracy curves for ULPT at r = 2. Adding shift significantly improves optimization and accuracy, while adding scale yields further gains. Trends are con￾sistent across ranks. r=2 r=16 r=64 r=256 r=2 r=16 r=64 r=256 1.000 0.698 1.000 0.569 0.675 … view at source ↗
Figure 6
Figure 6. Figure 6: Left: Shift embeddings learned with different ranks are highly similar, suggesting a general alignment role. Right: Scale embeddings vary significantly, indi￾cating their dependence on frozen random projections. projection matrix P˜ hinders the optimization pro￾cess and consequently lowers the model perfor￾mance. Introducing a learnable shift embedding b provides a substantial improvement (dashed lines), p… view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of randomly selected dimensions [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Large language models achieve state-of-the-art performance but are increasingly costly to fine-tune. Prompt tuning is a parameter-efficient fine-tuning method that addresses parameter-efficiency by learning prompt embeddings, but these embeddings are typically tied to the model's hidden dimensionality, limiting parameter saving. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), a simple yet effective method that optimizes prompts in a low-dimensional space (e.g., 2D) and uses a frozen random matrix for up-projection. ULPT can achieve 98% reduction in the training parameters compared to vanilla prompt tuning while preserving performance. Our extensive experiments across over 20 NLP tasks demonstrate that ULPT consistently outperforms recent parameter-efficient tuning methods using significantly fewer parameters, making it well-suited as a storage-efficient framework for massive LLM customization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Ultra-Low-Dimensional Prompt Tuning (ULPT), which learns prompt vectors in a low-dimensional space (e.g., dimension 2) and maps them to the LLM hidden dimension via a fixed random projection matrix. It claims this yields a 98% reduction in trainable parameters relative to standard prompt tuning while matching or exceeding performance on more than 20 NLP tasks and outperforming other PEFT baselines.

Significance. If the empirical results are robust, the work would be significant for storage-efficient LLM adaptation, as the drastic parameter reduction enables maintaining large numbers of task-specific prompts. The simplicity of the fixed random up-projection is a methodological strength, and the scale of evaluation across 20+ tasks provides a broad empirical test.

major comments (3)
  1. [§3] §3 (Method): The central modeling assumption—that a fixed random 2D subspace suffices for near-optimal prompts on arbitrary tasks—is stated without a supporting probabilistic argument, Johnson-Lindenstrauss-style bound in the up-projection direction, or comparison to a data-driven basis; this assumption is load-bearing for the 98% reduction claim.
  2. [§4] §4 (Experiments): Results are reported for a single random projection matrix per task with no ablation over multiple random seeds or variance statistics; without this, it is impossible to determine whether reported gains are stable or depend on fortunate random draws.
  3. [Table 2] Table 2 (main results): ULPT is compared only against other PEFT methods but not against a learned low-rank projection or PCA-based basis of the same dimension; this omission leaves open whether randomness itself is essential or merely convenient.
minor comments (2)
  1. [Abstract] The abstract states 'consistently outperforms' but the text does not specify the exact statistical test or multiple-comparison correction used across 20+ tasks.
  2. [§3, §4] Notation for the random matrix R (shape and initialization) is introduced in §3 but not restated when results are discussed in §4, making cross-reference cumbersome.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central modeling assumption—that a fixed random 2D subspace suffices for near-optimal prompts on arbitrary tasks—is stated without a supporting probabilistic argument, Johnson-Lindenstrauss-style bound in the up-projection direction, or comparison to a data-driven basis; this assumption is load-bearing for the 98% reduction claim.

    Authors: We acknowledge the value of a formal bound. The standard Johnson-Lindenstrauss lemma already guarantees that random projections approximately preserve distances and norms when mapping from high to low (or vice versa) dimensions with target dimension logarithmic in the source. Our method relies on this known property for the up-projection step. We will add a short discussion paragraph in §3 explicitly connecting ULPT to the JL lemma and clarifying that the sufficiency claim is primarily empirical, supported by results across more than 20 tasks. A data-driven basis comparison is not included because it would require per-task storage of the basis vectors, undermining the storage-efficiency goal that enables the 98% reduction. revision: partial

  2. Referee: [§4] §4 (Experiments): Results are reported for a single random projection matrix per task with no ablation over multiple random seeds or variance statistics; without this, it is impossible to determine whether reported gains are stable or depend on fortunate random draws.

    Authors: We agree that reporting variance over random seeds strengthens the empirical claims. In the revised version we will rerun the main experiments on a representative subset of tasks using at least five independent random projection matrices per task and report mean performance together with standard deviation. revision: yes

  3. Referee: [Table 2] Table 2 (main results): ULPT is compared only against other PEFT methods but not against a learned low-rank projection or PCA-based basis of the same dimension; this omission leaves open whether randomness itself is essential or merely convenient.

    Authors: We maintain that the relevant baselines are existing PEFT methods, as these are the methods practitioners would otherwise use. The core advantage of the fixed random projection is that it incurs zero additional per-task storage for the up-projection matrix itself; any learned or PCA-derived basis would need to be stored (or recomputed) per task, eroding the storage benefit that allows maintaining thousands of task-specific prompts. Randomness is therefore not merely convenient but essential to the storage-efficiency claim. We therefore do not plan to add such comparisons. revision: no

Circularity Check

0 steps flagged

No circularity: empirical proposal with no self-referential derivation

full rationale

The paper introduces ULPT as a direct empirical method: optimize a low-dimensional prompt vector and up-project via a fixed random matrix. No equations, theorems, or claims reduce the performance result to a fitted quantity defined by the method itself, nor rely on self-citation chains for load-bearing uniqueness or ansatzes. The central claim rests on experimental results across tasks rather than any closed-loop derivation. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that random projection from 2D preserves task performance; no free parameters are explicitly fitted beyond the low dimension choice itself.

free parameters (1)
  • prompt dimension d
    Chosen as a small integer (example 2) to achieve the reported parameter reduction; its value directly controls the claimed savings.

pith-pipeline@v0.9.0 · 5657 in / 991 out tokens · 29858 ms · 2026-05-23T03:27:17.534530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

  1. [1]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319--7328, 2021. URL https://aclanthology.org/2...

  2. [2]

    ATTEMPT : Parameter-efficient multi-task tuning via attentional mixtures of soft prompts

    Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. ATTEMPT : Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655--6672, 2022. URL https://aclanthology.org/2022.emnlp-main.446

  3. [3]

    B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. B it F it: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 1--9, 2022. URL https://aclanthology.org/2022.acl-short.1

  4. [4]

    Random projection in dimensionality reduction: applications to image and text data

    Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 245–250, 2001. URL https://doi.org/10.1145/502512.502546

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  6. [6]

    S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation, pages 1--14, 2017. URL https://aclanthology.org/S17-2001/

  7. [7]

    SM o P : Towards efficient and effective prompt tuning with sparse mixture-of-prompts

    Joon-Young Choi, Junho Kim, Jun-Hyung Park, Wing-Lam Mok, and SangKeun Lee. SM o P : Towards efficient and effective prompt tuning with sparse mixture-of-prompts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14306--14316, 2023. URL https://aclanthology.org/2023.emnlp-main.884

  8. [8]

    B ool Q : Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. B ool Q : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2924--2936, 2019. URL ...

  9. [9]

    An elementary proof of a theorem of J ohnson and L indenstrauss

    Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of J ohnson and L indenstrauss. Random Structures & Algorithms, 22 0 (1): 0 60--65, 2003. URL https://doi.org/10.1002/rsa.10073

  10. [10]

    The commitmentbank: Investigating projection in naturally occurring discourse

    Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In Proceedings of Sinn und Bedeutung, volume 23, pages 107--124, 2019. URL https://semanticsarchive.net/Archive/Tg3ZGI2M/Marneffe.pdf

  11. [11]

    Transforming Question Answering Datasets Into Natural Language Inference Datasets

    Dorottya Demszky, Kelvin Guu, and Percy Liang. Transforming question answering datasets into natural language inference datasets, 2018. URL https://arxiv.org/abs/1809.02922

  12. [12]

    Dolan and Chris Brockett

    William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, 2005. URL https://aclanthology.org/I05-5002/

  13. [13]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107--1128, 2024. URL https://aclanthology.org/2024.emnlp-main.64

  14. [14]

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

    Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Search QA : A new Q & A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017. URL https://arxiv.org/abs/1704.05179

  15. [15]

    MRQA 2019 shared task: Evaluating generalization in reading comprehension

    Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1--13, 2019. URL https://aclanthology.org/D19-5801

  16. [16]

    The third PASCAL recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, 2007. URL https://aclanthology.org/W07-1401

  17. [17]

    Lo PT : Low-rank prompt tuning for parameter efficient language models, 2024

    Shouchang Guo, Sonam Damani, and Keng hao Chang. Lo PT : Low-rank prompt tuning for parameter efficient language models, 2024. URL https://arxiv.org/abs/2406.19486

  18. [18]

    Flora: Low-rank adapters are secretly gradient compressors

    Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors. In Proceedings of the 41st International Conference on Machine Learning, 2024. URL https://proceedings.mlr.press/v235/hao24a.html

  19. [19]

    Lo RA +: Efficient low rank adaptation of large models

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lo RA +: Efficient low rank adaptation of large models. In Proceedings of the 41st International Conference on Machine Learning, 2024. URL https://proceedings.mlr.press/v235/hayou24a.html

  20. [20]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Proceedings of the 36th International Conference on Machine Learning, pages 2790--2799, 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html

  21. [21]

    Lo RA : Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  22. [22]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604--613, 1998. URL https://dl.acm.org/doi/10.1145/276698.276876

  23. [23]

    Hyperdecoders: Instance-specific decoders for multi-task NLP

    Hamish Ivison and Matthew Peters. Hyperdecoders: Instance-specific decoders for multi-task NLP . In Findings of the Association for Computational Linguistics: EMNLP, pages 1715--1730, 2022. URL https://aclanthology.org/2022.findings-emnlp.124

  24. [24]

    Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

    Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 565--576, 2021. UR...

  25. [25]

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 252--262, 2018. URL https:...

  26. [26]

    Scitail: A textual entailment dataset from science question answering

    Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 2018. URL https://ojs.aaai.org/index.php/AAAI/article/view/12022

  27. [27]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, pages 22199--22213, 2022. URL https://openreview.net/pdf?id=e2TBb5y0yFf

  28. [28]

    Ve RA : Vector-based random matrix adaptation

    Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Ve RA : Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NjNfLdxr3A

  29. [29]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  30. [30]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045--3059, 2021. URL https://aclanthology.org/2021.emnlp-main.243

  31. [31]

    The winograd schema challenge

    Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Preceddings of the 13th International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf

  32. [32]

    Prefix- T uning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix- T uning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4582--4597, 2021. URL https://aclanthology.org/2021.acl-long.353

  33. [33]

    Relo RA : High-rank training through low-rank updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. Relo RA : High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=DLJznSp6X3

  34. [34]

    P- T uning: Prompt tuning can be comparable to fine-tuning across scales and tasks

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P- T uning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 61--68, 2022. URL https://aclanthology.org/2022.acl-short.8

  35. [35]

    GPT understands, too

    Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. GPT understands, too. AI Open, 5: 0 208--215, 2024. URL https://www.sciencedirect.com/science/article/pii/S2666651023000141

  36. [36]

    PEFT : State-of-the-art parameter-efficient fine-tuning methods

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

  37. [37]

    On variants of the J ohnson-- L indenstrauss lemma

    Ji r \' Matou s ek. On variants of the J ohnson-- L indenstrauss lemma. Random Structures & Algorithms, 33 0 (2): 0 142--156, 2008. URL https://doi.org/10.1002/rsa.20218

  38. [38]

    Crosslingual generalization through multitask finetuning

    Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual generalization through multitask finetuning...

  39. [39]

    Prompting a pretrained transformer can be a universal approximator

    Aleksandar Petrov, Philip Torr, and Adel Bibi. Prompting a pretrained transformer can be a universal approximator. In Proceedings of the 41st International Conference on Machine Learning, 2024 a . URL https://proceedings.mlr.press/v235/petrov24a.html

  40. [40]

    When do prompting and prefix-tuning work? a theory of capabilities and limitations

    Aleksandar Petrov, Philip Torr, and Adel Bibi. When do prompting and prefix-tuning work? a theory of capabilities and limitations. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=JewzobRhay

  41. [41]

    W i C : The word-in-context dataset for evaluating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. W i C : The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , 2019. URL https://aclanthology.org/N19-1128

  42. [42]

    Exploring universal intrinsic task subspace via prompt tuning, 2022

    Yujia Qin, Xiaozhi Wang, Yusheng Su, Yankai Lin, Ning Ding, Jing Yi, Weize Chen, Zhiyuan Liu, Juanzi Li, Lei Hou, Peng Li, Maosong Sun, and Jie Zhou. Exploring universal intrinsic task subspace via prompt tuning, 2022. URL https://arxiv.org/abs/2110.07867

  43. [43]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, pages 1--67, 2020. URL https://jmlr.org/papers/v21/20-074.html

  44. [44]

    Residual P rompt T uning: improving prompt tuning with residual reparameterization

    Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual P rompt T uning: improving prompt tuning with residual reparameterization. In Findings of the Association for Computational Linguistics: ACL, pages 6740--6757, 2023. URL https://aclanthology.org/2023.findings-acl.421

  45. [45]

    AdapterDrop : O n the efficiency of adapters in transformers

    Andreas R \"u ckl \'e , Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych. AdapterDrop : O n the efficiency of adapters in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. URL https://aclanthology.org/2021.emnlp-main.626

  46. [46]

    Wino G rande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Wino G rande: An adversarial winograd schema challenge at scale. Communications of the ACM, page 99–106, 2021. URL https://doi.org/10.1145/3474381

  47. [47]

    De PT : Decomposed prompt tuning for parameter-efficient fine-tuning

    Zhengxiang Shi and Aldo Lipani. De PT : Decomposed prompt tuning for parameter-efficient fine-tuning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KjegfPGRde

  48. [48]

    Logan IV, Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4222--4235, 2020. URL https://aclanthology.org/2020.emnlp-main.346

  49. [49]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642, 2013. URL https://aclanthology.org/D13-1170/

  50. [50]

    LST : Ladder side-tuning for parameter and memory efficient transfer learning

    Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. LST : Ladder side-tuning for parameter and memory efficient transfer learning. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=isPnnaTZaP5

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

  52. [52]

    N ews QA : A machine comprehension dataset

    Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. N ews QA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP , pages 191--200, 2017. URL https://aclanthology.org/W17-2623

  53. [53]

    SP o T : Better frozen model adaptation through soft prompt transfer

    Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou ' , and Daniel Cer. SP o T : Better frozen model adaptation through soft prompt transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5039--5059, 2022. URL https://aclanthology.org/2022.acl-long.346

  54. [54]

    GLUE : A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, 2018. URL https://aclanthology.org/W18-5446

  55. [55]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Super GLUE : A stickier benchmark for general-purpose language understanding systems. In arxiv, 2019. URL http://arxiv.org/abs/1905.00537

  56. [56]

    Multitask prompt tuning enables parameter-efficient transfer learning

    Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, and Yoon Kim. Multitask prompt tuning enables parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Nk2pDtuhTq

  57. [57]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139--149, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Learnin...

  58. [58]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 2019. URL https://aclanthology.org/Q19-1040

  59. [59]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a . URL https://openreview.net/forum?id=gEZrGCozdqR

  60. [60]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, pages 24824--24837, 2022 b . URL https://openreview.net/pdf?id=_VjQlMeSB_J

  61. [61]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1112--1122, 2018. URL https://aclanthology.org/N18-1101/

  62. [62]

    Mixture of L o RA experts

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of L o RA experts. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=uWvKBCYh4S

  63. [63]

    Re FT : Representation finetuning for language models

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Re FT : Representation finetuning for language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 b . URL https://openreview.net/forum?id=fykjplMc0V

  64. [64]

    Zero-shot continuous prompt transfer: Generalizing task semantics across language models

    Zijun Wu, Yongkang Wu, and Lili Mou. Zero-shot continuous prompt transfer: Generalizing task semantics across language models. In The Twelfth International Conference on Learning Representations, 2024 c . URL https://openreview.net/forum?id=26XphugOcS

  65. [65]

    Decomposed prompt tuning via low-rank reparameterization

    Yao Xiao, Lu Xu, Jiaxi Li, Wei Lu, and Xiaoli Li. Decomposed prompt tuning via low-rank reparameterization. In Findings of the Association for Computational Linguistics: EMNLP, pages 13335--13347, 2023. URL https://aclanthology.org/2023.findings-emnlp.890

  66. [66]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, 2018. URL https://aclanthology.org/D18-1259/

  67. [67]

    Lo F i T : Localized fine-tuning on LLM representations

    Fangcong Yin, Xi Ye, and Greg Durrett. Lo F i T : Localized fine-tuning on LLM representations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=dfiXFbECSZ

  68. [68]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems, page 649–657, 2015. URL https://proceedings.neurips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf

  69. [69]

    PAWS : Paraphrase adversaries from word scrambling

    Yuan Zhang, Jason Baldridge, and Luheng He. PAWS : Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1298--1308, 2019. URL https://aclanthology.org/N19-1131

  70. [70]

    Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning

    Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=YR3ETaElNK