Recognition: 2 theorem links
· Lean TheoremThe Power of Scale for Parameter-Efficient Prompt Tuning
Pith reviewed 2026-05-11 16:27 UTC · model grok-4.3
The pith
As models grow to billions of parameters, learning a small set of soft prompts matches the performance of tuning all model weights while keeping the base model frozen.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompt tuning closes the gap with model tuning at large scales: as T5 models exceed billions of parameters, learned soft prompts achieve performance comparable to tuning all model weights, while remaining far more parameter-efficient and enabling the same frozen model to serve multiple downstream tasks.
What carries the argument
Soft prompts: a small set of continuous, trainable vectors optimized via gradient descent to condition the input of a frozen language model.
If this is right
- One frozen model can be reused for many tasks by storing only the small prompt parameters instead of separate full copies.
- Serving costs drop because the large model weights need to be loaded only once and shared across applications.
- Domain-transfer robustness improves relative to full model tuning.
- The approach simplifies prefix tuning while matching its results on the evaluated settings.
Where Pith is reading between the lines
- Deployment pipelines for very large models can shift toward storing and swapping small prompts rather than full fine-tuned weights.
- If the trend continues, parameter-efficient adaptation may become the default route for applying foundation models to new tasks.
- The method invites direct comparisons on non-T5 architectures to test whether the scale advantage is architecture-specific.
Load-bearing premise
The scaling trend observed on T5 models and the tested tasks will hold for other model families, architectures, and task distributions.
What would settle it
Prompt tuning failing to match full model tuning performance on a new family of models larger than a few billion parameters using the same training procedure.
read the original abstract
In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces prompt tuning, a parameter-efficient adaptation method that learns a small number of continuous 'soft prompt' embeddings while keeping the underlying language model (T5) frozen. It reports that on GLUE and SuperGLUE tasks, prompt tuning's performance gap to full model tuning shrinks with scale; at 11B parameters the two methods become competitive, and prompt tuning also outperforms GPT-3 few-shot learning while showing improved robustness under domain shift.
Significance. If the reported scaling trend holds, the result is significant: it demonstrates that a single frozen model can be reused across many tasks via tiny per-task prompt parameters, substantially lowering storage and serving costs for large LMs. The systematic size ablations on T5 (60M–11B) and direct comparisons to prefix tuning constitute a clear empirical contribution.
major comments (1)
- [§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.
minor comments (3)
- [§3.1] §3.1: the definition of the soft prompt as a sequence of length k is clear, but the initialization scheme (random vs. vocabulary tokens) and whether it is held constant across all model sizes should be stated explicitly in the main text rather than only in the appendix.
- [Figure 3] Figure 3: axis labels and legend entries are too small for print; the scaling curves would be easier to read if the x-axis were log-scaled with explicit parameter counts annotated.
- [§5] §5: the discussion of domain-transfer robustness would benefit from a brief statement of how the source and target domains were selected and whether the improvement is consistent across all transfer pairs or driven by a subset.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [§4.2 and Table 1] §4.2 and Table 1: the central claim that prompt tuning 'matches' model tuning at 11B parameters rests on point estimates; no standard deviations across random seeds or statistical significance tests are reported, which weakens the assertion that the gap has closed rather than narrowed within noise.
Authors: We agree that reporting variability across random seeds and including statistical significance tests would make the central claim more robust. Our original experiments used single runs for the 11B models owing to the substantial computational cost of training and evaluating models at this scale. In the revised manuscript, we will rerun the 11B-scale experiments with multiple random seeds, report mean performance and standard deviations in Table 1 and §4.2, and add a brief discussion of statistical significance for the key comparisons. This will allow readers to assess whether the performance gap has closed within the observed variance. revision: yes
Circularity Check
No significant circularity: empirical scaling observations only
full rationale
The paper reports direct experimental comparisons of prompt tuning versus model tuning across T5 model sizes (60M to 11B parameters) on standard NLP tasks. No equations, fitted parameters, or predictions are defined in terms of the target metrics; performance gaps are measured on held-out test sets using fixed training protocols. The central claim is an observed trend, not a derivation. Self-citations are absent from load-bearing steps, and the prefix-tuning comparison cites external work (Li & Liang 2021) without circular reduction. The result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- soft prompt length
axioms (1)
- domain assumption A frozen pre-trained language model encodes sufficient general knowledge that task-specific behavior can be elicited by conditioning on a small learned input prefix.
invented entities (1)
-
soft prompt
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prompt tuning becomes more competitive with scale
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 41 Pith papers
-
Finetuned Language Models Are Zero-Shot Learners
Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
Combining pre-trained models via localized model averaging
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
Query-efficient model evaluation using cached responses
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
ConforNets: Latents-Based Conformational Control in OpenFold3
ConforNets use channel-wise affine transforms on pre-Pairformer pair latents in OpenFold3 to achieve state-of-the-art unsupervised generation of alternate protein states and supervised conformational transfer across families.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models
RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical predic...
-
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
-
Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation
LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.
-
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
CoLA introduces a dual-path low-rank adaptation method that adds cross-modal learning to LoRA, delivering small gains over standard LoRA on visual grounding and audio-visual benchmarks while preserving parameter efficiency.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
PaLM-E: An Embodied Multimodal Language Model
PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
-
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning
HEDP uses energy regularization inspired by Helmholtz free energy plus hybrid energy-distance weighting in prompts to improve domain selection and achieve a 2.57% accuracy gain on benchmarks like CORe50 while mitigati...
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
-
Deep Reprogramming Distillation for Medical Foundation Models
DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
-
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
-
RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.
-
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
-
SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
SCHK-HTC uses sibling contrastive learning plus hierarchical prompt tuning to improve discrimination between confusable sibling classes in few-shot hierarchical text classification.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.
Reference graph
Works this paper leans on
-
[1]
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, volume 6, pages 6--4. Venice
work page 2006
-
[2]
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC
work page 2009
-
[3]
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. 2018. http://github.com/google/jax JAX : composable transformations of P ython+ N um P y programs
work page 2018
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[5]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...
-
[6]
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177--190. Springer
work page 2005
-
[7]
Marie-Catherine De Marneff, Mandy Simons, and Judith Tonhauser. 2019. The CommitmentBank : Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung 23
work page 2019
-
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...
-
[9]
William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005)
work page 2005
-
[10]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...
-
[11]
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP
work page 2019
-
[12]
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1--9. Association for Computational Linguistics
work page 2007
-
[14]
L. K. Hansen and P. Salamon . 1990. https://doi.org/10.1109/34.58871 Neural network ensembles . IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993--1001
-
[15]
Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. 2020. http://github.com/google/flax F lax: A neural network library and ecosystem for JAX
work page 2020
-
[16]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. http://proceedings.mlr.press/v97/houlsby19a.html Parameter-efficient transfer learning for NLP . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine L...
work page 2019
-
[17]
Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics
-
[18]
Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs First Q uora dataset release: Question pairs
work page 2017
-
[19]
and Araki, Jun and Neubig, Graham
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. https://doi.org/10.1162/tacl_a_00324 How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423--438
-
[20]
A. Kembhavi , M. Seo , D. Schwenk , J. Choi , A. Farhadi , and H. Hajishirzi . 2017. https://doi.org/10.1109/CVPR.2017.571 Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376--5384
-
[21]
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)
work page 2018
-
[22]
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. https://doi.org/10.18653/v1/P19-1478 A surprisingly robust trick for the W inograd schema challenge . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837--4842, Florence, Italy. Association for Computational L...
-
[24]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics
-
[25]
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc
work page 2017
-
[26]
Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The W inograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning
work page 2012
-
[27]
Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/K17-1034 Zero-shot relation extraction via reading comprehension . In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017) , pages 333--342, Vancouver, Canada. Association for Computational Linguistics
-
[28]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. https://doi.org/10.18653/v1/2020.acl-main.703 BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension . In Proceedings of the 58th Annual Meeting of the Associat...
- [30]
- [31]
-
[32]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted B oltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, page 807–814, Madison, WI, USA. Omnipress
work page 2010
-
[33]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 Deep contextualized word representations . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lon...
-
[34]
Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.617 MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654--7673, Online. Association for Comput...
-
[35]
Mohammad Taher Pilehvar and Jose Camacho - Collados. 2018. http://arxiv.org/abs/1808.09121 WiC : 10,000 example pairs for evaluating context-sensitive representations . CoRR, abs/1808.09121
work page Pith review arXiv 2018
-
[37]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training
work page 2018
-
[38]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI Blog
work page 2019
-
[39]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67
work page 2020
-
[41]
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf Learning multiple visual domains with residual adapters . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc
work page 2017
-
[42]
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series
work page 2011
-
[43]
Khapra, and Karthik Sankaranarayanan
Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. https://doi.org/10.18653/v1/P18-1156 D uo RC : Towards complex language understanding with paraphrased reading comprehension . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1683--1693, Melbourne, ...
-
[44]
Timo Schick and Hinrich Sch \"u tze. 2021. https://aclanthology.org/2021.eacl-main.20 Exploiting cloze-questions for few-shot text classification and natural language inference . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255--269, Online. Association for Computational...
work page 2021
-
[45]
Noam Shazeer. 2020. http://arxiv.org/abs/2002.05202 GLU variants improve transformer . CoRR, abs/2002.05202
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[46]
Noam Shazeer and Mitchell Stern. 2018. http://proceedings.mlr.press/v80/shazeer18a.html Adafactor: Adaptive learning rates with sublinear memory cost . In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596--4604. PMLR
work page 2018
-
[47]
Logan IV, Eric Wallace, and Sameer Singh
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.346 A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222--4235, On...
-
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30, pages 5998--6008
work page 2017
-
[49]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019 a . https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf SuperGLUE : A stickier benchmark for general-purpose language understanding systems . In Advances in Neural Information Processing System...
work page 2019
-
[50]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 b . GLUE : A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR
work page 2019
-
[52]
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. http://arxiv.org/abs/1810.12885 ReCoRD : Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885
work page Pith review arXiv 2018
-
[53]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[54]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[55]
Prefix-tuning: Optimizing continuous prompts for generation
Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.353
-
[56]
WARP : W ord-level A dversarial R e P rogramming
Hambardzumyan, Karen and Khachatrian, Hrant and May, Jonathan. WARP : W ord-level A dversarial R e P rogramming. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.381
-
[57]
Lajanugen Logeswaran and Ann Lee and Myle Ott and Honglak Lee and Marc'Aurelio Ranzato and Arthur Szlam , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[58]
Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =
-
[59]
Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=
Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=
-
[60]
Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , note=
-
[61]
Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen , booktitle=
-
[62]
Armen Aghajanyan and Luke Zettlemoyer and Sonal Gupta , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[63]
Proceedings of the National Academy of Sciences , volume=
Transforming task representations to perform novel tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=
work page 2020
-
[64]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , url =. Advances in Neural Information Processing Systems , editor =
-
[65]
Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =
-
[66]
Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =
-
[67]
Proceedings of Sinn und Bedeutung 23 , author=
The. Proceedings of Sinn und Bedeutung 23 , author=
-
[68]
Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2005 , organization=
work page 2005
-
[69]
Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , booktitle=. The second. 2006 , organization=
work page 2006
- [70]
- [71]
-
[72]
2011 AAAI Spring Symposium Series , year=
Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=
work page 2011
-
[73]
Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The
- [74]
-
[75]
Alex Warstadt and Amanpreet Singh and Samuel R. Bowman , title =. CoRR , volume =. 2018 , url =
work page 2018
-
[76]
Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=
work page 2013
-
[77]
SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation , author=. arXiv preprint arXiv:1708.00055 , year=
work page Pith review arXiv 2017
-
[78]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018
work page 2018
-
[79]
SQuAD : 100,000+ questions for machine comprehension of text
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264
-
[80]
Xiao Liu and Yanan Zheng and Zhengxiao Du and Ming Ding and Yujie Qian and Zhilin Yang and Jie Tang , title =. CoRR , volume =. 2021 , url =
work page 2021
-
[81]
Language Models are Unsupervised Multitask Learners , author=
-
[82]
2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=
A. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , title=. 2013 , volume=
work page 2013
-
[83]
Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[84]
Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , title =. CoRR , volume =. 2018 , url =
work page 2018
-
[85]
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=
A. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , title=. 2017 , volume=
work page 2017
-
[86]
James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.