Recognition: 2 theorem links
· Lean TheoremLIMA: Less Is More for Alignment
Pith reviewed 2026-05-17 11:29 UTC · model grok-4.3
The pith
Large language models acquire nearly all knowledge during pretraining and need only limited curated data for alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LIMA, a 65B parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompts and responses without any reinforcement learning or human preference modeling, demonstrates strong performance on complex queries and generalizes well to unseen tasks, with responses equivalent or preferred to GPT-4 in 43% of human evaluations.
What carries the argument
The set of 1,000 carefully curated instruction-response pairs for supervised fine-tuning, which teach output formats and enable generalization.
Load-bearing premise
The 1,000 examples capture the essential behaviors needed for broad generalization without introducing bias in the human preference judgments.
What would settle it
A controlled experiment showing that randomly selected 1,000 examples produce markedly worse generalization and lower human preference scores than the curated set would falsify the claim.
read the original abstract
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LIMA, a 65B-parameter LLaMa model fine-tuned with standard supervised loss on only 1,000 carefully curated prompt-response pairs, without reinforcement learning or human preference modeling. It reports that the model learns to follow complex response formats and generalizes to unseen tasks, achieving human preference rates of 43% equivalent-or-better versus GPT-4, 58% versus Bard, and 65% versus DaVinci003. The central claim is that almost all knowledge in large language models is acquired during pretraining and that limited instruction-tuning data suffices for high-quality output.
Significance. If the results hold after addressing the noted concerns, the work would be significant for LLM alignment research. It provides empirical support for the hypothesis that pretraining encodes the majority of capabilities and that small, high-quality supervised datasets can elicit strong instruction-following behavior, potentially simplifying training pipelines and reducing reliance on large-scale RLHF. The direct comparisons to strong baselines in a controlled human study strengthen the contribution.
major comments (2)
- [Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).
- [Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.
minor comments (2)
- [Data Curation] Clarify the exact criteria used for prompt selection and response writing in the data section to allow replication.
- [Experiments] Add a table or figure summarizing the distribution of task types in the 1,000-example set versus the evaluation set.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our LIMA manuscript. We respond point-by-point to the major concerns below, clarifying our claims and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract and data description: The central claim that 'only limited instruction tuning data is necessary' rests on the 1,000 examples being a generic small set rather than an expert-curated collection. The abstract notes that responses were written 'to demonstrate specific formats and generalization,' but no ablations are reported comparing performance on randomly sampled or minimally filtered sets of equal size. This leaves open whether results derive from data volume or from implicit behavioral guidance injected during curation (e.g., coverage of planning and history tasks).
Authors: The manuscript explicitly describes the 1,000 examples as 'carefully curated' in the abstract and methods, with responses written to demonstrate formats and generalization. Our claim is that limited high-quality instruction data suffices for alignment because pretraining encodes most capabilities; we do not claim that any arbitrary small set would work equally well. The curation intentionally covers diverse tasks including planning and history to probe generalization. We did not run ablations on randomly sampled sets of equal size, as such sets would likely lack the targeted coverage needed to elicit the observed behaviors. In revision we will expand the data section with additional details on curation criteria and example selection to make this distinction clearer. revision: partial
-
Referee: [Human Study] Human evaluation: The 43% preference figure versus GPT-4 is load-bearing for the empirical claim, yet the abstract and summary provide insufficient detail on rater instructions, inter-annotator agreement, statistical significance testing, and controls for order or presentation bias. This weakens confidence that the human-study outcomes robustly support the generalization argument.
Authors: The abstract is space-constrained; the full manuscript details the human study in Section 4, including randomized presentation order to control for bias. We will revise the paper to include explicit rater instructions, inter-annotator agreement statistics, and statistical significance tests for the reported preference rates, either in the main text or an expanded appendix. revision: yes
Circularity Check
Empirical fine-tuning study with no circular derivation or self-referential results
full rationale
The paper reports a standard supervised fine-tuning experiment on a 65B LLaMa model using 1,000 curated examples, followed by human preference evaluation against baselines. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters, self-definitions, or self-citations. The central suggestion that most knowledge comes from pretraining is an interpretive inference drawn from the observed generalization in the human study, not a tautology or load-bearing claim that collapses into the training data selection itself. The work is self-contained as an external benchmark comparison and does not invoke uniqueness theorems or ansatzes from prior author work to force its conclusions.
Axiom & Free-Parameter Ledger
free parameters (1)
- Selection and size of the 1,000-example set
axioms (1)
- domain assumption The pretrained LLaMa 65B checkpoint already contains the factual and reasoning knowledge needed for the tested tasks.
Forward citations
Cited by 18 Pith papers
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
-
Objaverse-XL: A Universe of 10M+ 3D Objects
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
-
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.
-
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
-
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
-
A Layer-wise Analysis of Supervised Fine-Tuning
Middle layers (20-80%) remain stable during SFT while final layers are sensitive, enabling Mid-Block Efficient Tuning that outperforms LoRA by up to 10.2% on GSM8K with reduced parameter count.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
-
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
The Tenth International Conference on Learning Representations , year=
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. The Tenth International Conference on Learning Representations , year=
-
[10]
doi:10.57967/hf/0513 , publisher =
Edward Beeching and Younes Belkada and Kashif Rasul and Lewis Tunstall and Leandro von Werra and Nazneen Rajani and Nathan Lambert , title =. doi:10.57967/hf/0513 , publisher =
-
[11]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[12]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
- [13]
-
[14]
Advances in Neural Information Processing Systems , year=
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , year=
-
[15]
ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=
Large Language Models are Zero-Shot Reasoners , author=. ICML 2022 Workshop on Knowledge Retrieval and Language Models , year=
work page 2022
-
[16]
Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks , author=. EMNLP , year=
-
[17]
International Conference on Learning Representations , year=
The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=
-
[22]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[23]
Proceedings of the international AAAI conference on web and social media , volume=
The Pushshift Reddit Dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=
-
[24]
Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. 2022 , eprint=
work page 2022
-
[25]
A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=
work page 2021
-
[26]
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. 2022 , eprint=
work page 2022
-
[27]
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=
work page 2023
- [28]
-
[29]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830--839, 2020
work page 2020
-
[32]
Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023
Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, and Nathan Lambert. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface.co/blog/stackllama
work page 2023
-
[33]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[34]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Avia Efrat and Omer Levy. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982, 2020
-
[37]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019
work page 2019
-
[38]
Unnatural instructions: Tuning language models with (almost) no human labor, 2022
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022
work page 2022
-
[39]
CTRL: A Conditional Transformer Language Model for Controllable Generation
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[40]
A few more examples may be worth billions of parameters
Yuval Kirstain, Patrick Lewis, Sebastian Riedel, and Omer Levy. A few more examples may be worth billions of parameters. arXiv preprint arXiv:2110.04374, 2021
-
[41]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022
work page 2022
-
[42]
Openassistant conversations -- democratizing large language model alignment
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations -- democratizing large language mo...
-
[43]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Natural instructions: Benchmarking generalization to new tasks from natural language instructions
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, pages 839--849, 2021
- [45]
-
[46]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[47]
Multitask prompted training enables zero-shot task generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022
work page 2022
-
[48]
Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023
work page 2023
- [49]
-
[50]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022 a
work page 2022
-
[52]
Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022 b
work page 2022
-
[53]
Finetuned language models are zero-shot learners
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a
work page 2022
-
[54]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022 b
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.