pith. machine review for the scientific record. sign in

arxiv: 2305.11206 · v1 · submitted 2023-05-18 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LIMA: Less Is More for Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords large language modelsinstruction tuningalignmentpretrainingfine-tuningsupervised learning
0
0 comments X

The pith

Large language models acquire nearly all knowledge during pretraining and need only limited curated data for alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors fine-tune a 65 billion parameter LLaMa model on just 1,000 high-quality prompt and response pairs using only supervised learning. This approach skips the usual reinforcement learning from human feedback step entirely. The resulting LIMA model handles complex instructions like trip planning and historical speculation, and it generalizes to tasks absent from its training data. Human evaluators prefer or rate LIMA responses as equivalent to GPT-4 outputs in 43% of comparisons, with even higher rates against other aligned models. These findings point to pretraining as the main stage where knowledge is built, while alignment needs far less data than commonly assumed.

Core claim

LIMA, a 65B parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompts and responses without any reinforcement learning or human preference modeling, demonstrates strong performance on complex queries and generalizes well to unseen tasks, with responses equivalent or preferred to GPT-4 in 43% of human evaluations.

What carries the argument

The set of 1,000 carefully curated instruction-response pairs for supervised fine-tuning, which teach output formats and enable generalization.

Load-bearing premise

The 1,000 examples capture the essential behaviors needed for broad generalization without introducing bias in the human preference judgments.

What would settle it

A controlled experiment showing that randomly selected 1,000 examples produce markedly worse generalization and lower human preference scores than the curated set would falsify the claim.

read the original abstract

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LIMA, a 65B-parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompt-response pairs, without reinforcement learning or human preference modeling. It reports that the model learns to follow complex response formats and generalizes to unseen tasks, achieving human preference rates of 43% equivalent-or-better versus GPT-4, 58% versus Bard, and 65% versus DaVinci003. The central claim is that almost all knowledge in large language models is acquired during pretraining and that limited instruction-tuning data suffices for high-quality output.

Significance. If the results hold after addressing the noted concerns, the work would be significant for LLM alignment research. It provides empirical support for the hypothesis that pretraining encodes the majority of capabilities and that small, high-quality supervised datasets can elicit strong instruction-following behavior, potentially simplifying training pipelines and reducing reliance on large-scale RLHF. The direct comparisons to strong baselines in a controlled human study strengthen the contribution.

major comments (2)
  1. [Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).
  2. [Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.
minor comments (2)
  1. [Data Curation] Clarify the exact criteria used for prompt selection and response writing in the data section to allow replication.
  2. [Experiments] Add a table or figure summarizing the distribution of task types in the 1,000-example set versus the evaluation set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our LIMA manuscript. We respond point-by-point to the major concerns below, clarifying our claims and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).

    Authors: The manuscript explicitly describes the 1,000 examples as 'carefully curated' in the abstract and methods, with responses written to demonstrate formats and generalization. Our claim is that limited high-quality instruction data suffices for alignment because pretraining encodes most capabilities; we do not claim that any arbitrary small set would work equally well. The curation intentionally covers diverse tasks including planning and history to probe generalization. We did not run ablations on randomly sampled sets of equal size, as such sets would likely lack the targeted coverage needed to elicit the observed behaviors. In revision we will expand the data section with additional details on curation criteria and example selection to make this distinction clearer. revision: partial

  2. Referee: [Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.

    Authors: The abstract is space-constrained; the full manuscript details the human study in Section 4, including randomized presentation order to control for bias. We will revise the paper to include explicit rater instructions, inter-annotator agreement statistics, and statistical significance tests for the reported preference rates, either in the main text or an expanded appendix. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning study with no circular derivation or self-referential results

full rationale

The paper reports a standard supervised fine-tuning experiment on a 65B LLaMa model using 1,000 curated examples, followed by human preference evaluation against baselines. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters, self-definitions, or self-citations. The central suggestion that most knowledge comes from pretraining is an interpretive inference drawn from the observed generalization in the human study, not a tautology or load-bearing claim that collapses into the training data selection itself. The work is self-contained as an external benchmark comparison and does not invoke uniqueness theorems or ansatzes from prior author work to force its conclusions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the base LLaMa 65B model already encodes the required knowledge and that the curated 1,000 examples are representative; no new entities are postulated.

free parameters (1)
  • Selection and size of the 1,000-example set
    The number and content of examples are chosen by the authors to demonstrate sufficiency; the curation process is not derived from first principles.
axioms (1)
  • domain assumption The pretrained LLaMa 65B checkpoint already contains the factual and reasoning knowledge needed for the tested tasks.
    The paper builds directly on the prior LLaMa release without re-deriving its capabilities.

pith-pipeline@v0.9.0 · 5576 in / 1423 out tokens · 27908 ms · 2026-05-17T11:29:22.336718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  2. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  3. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    cs.CL 2023-10 conditional novelty 7.0

    Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

  4. Objaverse-XL: A Universe of 10M+ 3D Objects

    cs.CV 2023-07 accept novelty 7.0

    Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.

  5. Let the Target Select for Itself: Data Selection via Target-Aligned Paths

    cs.LG 2026-05 unverdicted novelty 6.0

    Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.

  6. Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...

  7. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

    cs.CL 2026-04 unverdicted novelty 6.0

    Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...

  8. A Layer-wise Analysis of Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    Middle layers (20-80%) remain stable during SFT while final layers are sensitive, enabling Mid-Block Efficient Tuning that outperforms LoRA by up to 10.2% on GSM8K with reduced parameter count.

  9. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  10. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  11. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    cs.CL 2023-10 conditional novelty 6.0

    FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...

  12. Aligning Large Multimodal Models with Factually Augmented RLHF

    cs.CV 2023-09 conditional novelty 6.0

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  13. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  14. Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

    cs.AI 2026-05 unverdicted novelty 5.0

    Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.

  15. The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 5.0

    Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

  16. The Platonic Representation Hypothesis

    cs.LG 2024-05 unverdicted novelty 5.0

    Representations learned by large AI models are converging toward a shared statistical model of reality.

  17. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  18. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 18 Pith papers · 7 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  2. [8]

    The Tenth International Conference on Learning Representations , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. The Tenth International Conference on Learning Representations , year=

  3. [10]

    doi:10.57967/hf/0513 , publisher =

    Edward Beeching and Younes Belkada and Kashif Rasul and Lewis Tunstall and Leandro von Werra and Nazneen Rajani and Nathan Lambert , title =. doi:10.57967/hf/0513 , publisher =

  4. [11]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  5. [12]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  6. [13]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  7. [14]

    Advances in Neural Information Processing Systems , year=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , year=

  8. [15]

    ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=

    Large Language Models are Zero-Shot Reasoners , author=. ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=

  9. [16]

    EMNLP , year=

    Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks , author=. EMNLP , year=

  10. [17]

    International Conference on Learning Representations , year=

    The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

  11. [22]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  12. [23]

    Proceedings of the international AAAI conference on web and social media , volume=

    The Pushshift Reddit Dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=

  13. [24]

    2022 , eprint=

    Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. 2022 , eprint=

  14. [25]

    2021 , eprint=

    A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

  15. [26]

    2022 , eprint=

    Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. 2022 , eprint=

  16. [27]

    2023 , eprint=

    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=

  17. [28]

    2023 , eprint=

    GPTScore: Evaluate as You Desire , author=. 2023 , eprint=

  18. [29]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

  19. [30]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b

  20. [31]

    The pushshift reddit dataset

    Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830--839, 2020

  21. [32]

    Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023

    Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, and Nathan Lambert. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface.co/blog/stackllama

  22. [33]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  23. [34]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  24. [35]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  25. [36]

    CoRR , volume =

    Avia Efrat and Omer Levy. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982, 2020

  26. [37]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

  27. [38]

    Unnatural instructions: Tuning language models with (almost) no human labor, 2022

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022

  28. [39]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

  29. [40]

    A few more examples may be worth billions of parameters

    Yuval Kirstain, Patrick Lewis, Sebastian Riedel, and Omer Levy. A few more examples may be worth billions of parameters. arXiv preprint arXiv:2110.04374, 2021

  30. [41]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022

  31. [42]

    Openassistant conversations -- democratizing large language model alignment

    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations -- democratizing large language mo...

  32. [43]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  33. [44]

    Natural instructions: Benchmarking generalization to new tasks from natural language instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, pages 839--849, 2021

  34. [45]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  35. [46]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  36. [47]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022

  37. [48]

    Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023

  38. [49]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  39. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  40. [51]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

  41. [52]

    Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022 b

  42. [53]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a

  43. [54]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022 b