arxiv: 2305.11206 · v1 · submitted 2023-05-18 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LIMA: Less Is More for Alignment

Chunting Zhou , Pengfei Liu , Puxin Xu , Srini Iyer , Jiao Sun , Yuning Mao , Xuezhe Ma , Avia Efrat

show 7 more authors

Ping Yu Lili Yu Susan Zhang Gargi Ghosh Mike Lewis Luke Zettlemoyer Omer Levy

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords large language modelsinstruction tuningalignmentpretrainingfine-tuningsupervised learning

0 comments

The pith

Large language models acquire nearly all knowledge during pretraining and need only limited curated data for alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors fine-tune a 65 billion parameter LLaMa model on just 1,000 high-quality prompt and response pairs using only supervised learning. This approach skips the usual reinforcement learning from human feedback step entirely. The resulting LIMA model handles complex instructions like trip planning and historical speculation, and it generalizes to tasks absent from its training data. Human evaluators prefer or rate LIMA responses as equivalent to GPT-4 outputs in 43% of comparisons, with even higher rates against other aligned models. These findings point to pretraining as the main stage where knowledge is built, while alignment needs far less data than commonly assumed.

Core claim

LIMA, a 65B parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompts and responses without any reinforcement learning or human preference modeling, demonstrates strong performance on complex queries and generalizes well to unseen tasks, with responses equivalent or preferred to GPT-4 in 43% of human evaluations.

What carries the argument

The set of 1,000 carefully curated instruction-response pairs for supervised fine-tuning, which teach output formats and enable generalization.

Load-bearing premise

The 1,000 examples capture the essential behaviors needed for broad generalization without introducing bias in the human preference judgments.

What would settle it

A controlled experiment showing that randomly selected 1,000 examples produce markedly worse generalization and lower human preference scores than the curated set would falsify the claim.

read the original abstract

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LIMA, a 65B-parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompt-response pairs, without reinforcement learning or human preference modeling. It reports that the model learns to follow complex response formats and generalizes to unseen tasks, achieving human preference rates of 43% equivalent-or-better versus GPT-4, 58% versus Bard, and 65% versus DaVinci003. The central claim is that almost all knowledge in large language models is acquired during pretraining and that limited instruction-tuning data suffices for high-quality output.

Significance. If the results hold after addressing the noted concerns, the work would be significant for LLM alignment research. It provides empirical support for the hypothesis that pretraining encodes the majority of capabilities and that small, high-quality supervised datasets can elicit strong instruction-following behavior, potentially simplifying training pipelines and reducing reliance on large-scale RLHF. The direct comparisons to strong baselines in a controlled human study strengthen the contribution.

major comments (2)

[Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).
[Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.

minor comments (2)

[Data Curation] Clarify the exact criteria used for prompt selection and response writing in the data section to allow replication.
[Experiments] Add a table or figure summarizing the distribution of task types in the 1,000-example set versus the evaluation set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our LIMA manuscript. We respond point-by-point to the major concerns below, clarifying our claims and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).

Authors: The manuscript explicitly describes the 1,000 examples as 'carefully curated' in the abstract and methods, with responses written to demonstrate formats and generalization. Our claim is that limited high-quality instruction data suffices for alignment because pretraining encodes most capabilities; we do not claim that any arbitrary small set would work equally well. The curation intentionally covers diverse tasks including planning and history to probe generalization. We did not run ablations on randomly sampled sets of equal size, as such sets would likely lack the targeted coverage needed to elicit the observed behaviors. In revision we will expand the data section with additional details on curation criteria and example selection to make this distinction clearer. revision: partial
Referee: [Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.

Authors: The abstract is space-constrained; the full manuscript details the human study in Section 4, including randomized presentation order to control for bias. We will revise the paper to include explicit rater instructions, inter-annotator agreement statistics, and statistical significance tests for the reported preference rates, either in the main text or an expanded appendix. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning study with no circular derivation or self-referential results

full rationale

The paper reports a standard supervised fine-tuning experiment on a 65B LLaMa model using 1,000 curated examples, followed by human preference evaluation against baselines. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters, self-definitions, or self-citations. The central suggestion that most knowledge comes from pretraining is an interpretive inference drawn from the observed generalization in the human study, not a tautology or load-bearing claim that collapses into the training data selection itself. The work is self-contained as an external benchmark comparison and does not invoke uniqueness theorems or ansatzes from prior author work to force its conclusions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the base LLaMa 65B model already encodes the required knowledge and that the curated 1,000 examples are representative; no new entities are postulated.

free parameters (1)

Selection and size of the 1,000-example set
The number and content of examples are chosen by the authors to demonstrate sufficiency; the curation process is not derived from first principles.

axioms (1)

domain assumption The pretrained LLaMa 65B checkpoint already contains the factual and reasoning knowledge needed for the tested tasks.
The paper builds directly on the prior LLaMa release without re-deriving its capabilities.

pith-pipeline@v0.9.0 · 5576 in / 1423 out tokens · 27908 ms · 2026-05-17T11:29:22.336718+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
cs.CL 2023-10 conditional novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Objaverse-XL: A Universe of 10M+ 3D Objects
cs.CV 2023-07 accept novelty 7.0

Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
cs.LG 2026-05 unverdicted novelty 6.0

Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
cs.CL 2026-04 unverdicted novelty 6.0

Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
A Layer-wise Analysis of Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

Middle layers (20-80%) remain stable during SFT while final layers are sensitive, enabling Mid-Block Efficient Tuning that outperforms LoRA by up to 10.2% on GSM8K with reduced parameter count.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
cs.CL 2023-10 conditional novelty 6.0

FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
cs.AI 2026-05 unverdicted novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

The Tenth International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. The Tenth International Conference on Learning Representations , year=

work page
[10]

doi:10.57967/hf/0513 , publisher =

Edward Beeching and Younes Belkada and Kashif Rasul and Lewis Tunstall and Leandro von Werra and Nazneen Rajani and Nathan Lambert , title =. doi:10.57967/hf/0513 , publisher =

work page doi:10.57967/hf/0513
[11]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[12]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[13]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[14]

Advances in Neural Information Processing Systems , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , year=

work page
[15]

ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=

Large Language Models are Zero-Shot Reasoners , author=. ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=

work page 2022
[16]

EMNLP , year=

Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks , author=. EMNLP , year=

work page
[17]

International Conference on Learning Representations , year=

The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

work page
[22]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[23]

Proceedings of the international AAAI conference on web and social media , volume=

The Pushshift Reddit Dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=

work page
[24]

2022 , eprint=

Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. 2022 , eprint=

work page 2022
[25]

2021 , eprint=

A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

work page 2021
[26]

2022 , eprint=

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. 2022 , eprint=

work page 2022
[27]

2023 , eprint=

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=

work page 2023
[28]

2023 , eprint=

GPTScore: Evaluate as You Desire , author=. 2023 , eprint=

work page 2023
[29]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

The pushshift reddit dataset

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830--839, 2020

work page 2020
[32]

Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023

Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, and Nathan Lambert. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface.co/blog/stackllama

work page 2023
[33]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[34]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

CoRR , volume =

Avia Efrat and Omer Levy. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982, 2020

work page arXiv 2010
[37]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019
[38]

Unnatural instructions: Tuning language models with (almost) no human labor, 2022

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022

work page 2022
[39]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[40]

A few more examples may be worth billions of parameters

Yuval Kirstain, Patrick Lewis, Sebastian Riedel, and Omer Levy. A few more examples may be worth billions of parameters. arXiv preprint arXiv:2110.04374, 2021

work page arXiv 2021
[41]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022

work page 2022
[42]

Openassistant conversations -- democratizing large language model alignment

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations -- democratizing large language mo...

work page arXiv 2023
[43]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Natural instructions: Benchmarking generalization to new tasks from natural language instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, pages 839--849, 2021

work page arXiv 2021
[45]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[46]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[47]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022

work page 2022
[48]

Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023

work page 2023
[49]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a

work page 2022
[52]

Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022 b

work page 2022
[53]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a

work page 2022
[54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022 b

work page 2022