Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Pith reviewed 2026-05-21 20:44 UTC · model grok-4.3
The pith
Smaller models trained on large language model rationales outperform much larger models with less data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extracting rationales generated by a large language model and adding them as extra supervision signals in a multi-task framework, smaller student models can be trained to outperform the original large model on downstream tasks while requiring substantially fewer labeled or unlabeled training examples than either standard fine-tuning or conventional distillation.
What carries the argument
Distilling step-by-step, the process of using large language model rationales as additional supervision targets alongside task labels inside a single multi-task training objective for the smaller model.
If this is right
- Smaller models reach higher accuracy than few-shot prompted large models on the tested NLP tasks.
- Both finetuning and distillation baselines require more training examples to reach comparable performance.
- The same small model size can match or beat a much larger model when rationales are included in training.
- Reductions in both model parameters and data volume occur simultaneously without loss of accuracy.
Where Pith is reading between the lines
- The approach may extend to domains where step-by-step explanations can be generated, such as code or math problems.
- Focus could shift from collecting more human labels toward improving the quality of machine-generated rationales.
- Resource-limited settings could adopt smaller models more readily if the method generalizes beyond the four benchmarks.
Load-bearing premise
The rationales generated by the large language model must be accurate and consistent enough to supply useful guidance to the smaller model rather than adding noise or systematic mistakes.
What would settle it
Direct comparison on the reported benchmark showing whether the 770M T5 model trained with the step-by-step method on 80 percent of the data exceeds the few-shot accuracy of the 540B PaLM model; failure to exceed would falsify the central performance claim.
read the original abstract
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Distilling step-by-step, a method that uses rationales from large language models as additional supervision in a multi-task framework to train smaller models. It reports that this enables better performance than standard fine-tuning or distillation with less data, and that a 770M T5 model can outperform a 540B PaLM model on benchmarks using only 80% of the data while standard fine-tuning cannot even with 100%.
Significance. If validated, the results would be significant for making high-performing NLP models more accessible with reduced computational and data resources. The public code release aids in reproducibility.
major comments (2)
- [Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.
- [Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.
minor comments (2)
- [Abstract] The specific names of the four NLP benchmarks are not listed, which would help readers quickly contextualize the claims.
- [Section 4.1] The description of data usage percentages could clarify whether the 80% subset is randomly sampled or selected based on some criterion.
Simulated Author's Rebuttal
We appreciate the referee's detailed and constructive feedback on our manuscript. We have addressed each of the major comments below and made revisions to the paper accordingly to improve its rigor and clarity.
read point-by-point responses
-
Referee: [Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.
Authors: We agree that an ablation study replacing the LLM rationales with random text or empty strings would help isolate the contribution of the rationale content versus the multi-task training format. To address this concern, we have performed this additional experiment. When using random text or empty strings as targets for the rationale generation task, the performance of the smaller model drops significantly compared to using the actual LLM-generated rationales, approaching the levels seen in standard fine-tuning. These results confirm that the semantic content of the rationales is key to the observed gains. We will include this ablation analysis in the revised Section 3 and provide the corresponding results in a new table. revision: yes
-
Referee: [Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.
Authors: We acknowledge the value of statistical measures such as p-values or confidence intervals for strengthening the claims, particularly for the data-efficiency results in Table 2. However, repeating the full set of experiments multiple times is computationally prohibitive given the scale of the models involved. Following practices in similar large-scale NLP papers, we report results from single runs but have ensured consistency across four different benchmarks. In the revised manuscript, we have added a discussion of this limitation in the experimental setup section and included variance estimates from multiple seeds for the smaller-scale experiments where feasible. We believe the trends observed across benchmarks provide sufficient support for our conclusions. revision: partial
Circularity Check
Empirical training comparisons contain no circular derivation
full rationale
The paper reports measured accuracy improvements from multi-task fine-tuning on LLM-generated rationales versus standard fine-tuning or few-shot prompting. All headline numbers (770M T5 outperforming 540B PaLM on 80% data) are direct experimental outcomes on fixed benchmarks, not quantities obtained by solving the paper's own equations or by renaming fitted parameters. No self-citation chain is invoked to justify uniqueness or to close a derivation loop; the method is presented as an empirical recipe whose value is assessed by external test-set performance.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-task loss weighting coefficient
axioms (1)
- domain assumption LLM-generated rationales provide useful additional supervision that improves generalization of the student model
Forward citations
Cited by 22 Pith papers
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
EmbGen: Teaching with Reassembled Corpora
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on...
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
-
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
-
The False Promise of Imitating Proprietary LLMs
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
-
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
QLoRA fine-tuning on ~1700 examples internalizes tool knowledge in Gemma-4B and Qwen3-4B, enabling description-free inference that cuts input length by 82.6% and raises planning scores above an informed baseline.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Online In-Context Distillation for Low-Resource Vision Language Models
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
Energy-Aware Routing to Large Reasoning Models
In the critical regime for energy provisioning to large reasoning models, performance is volatility-limited, motivating variance-aware routing policies based on training and inference compute scaling laws.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Reference graph
Works this paper leans on
-
[5]
Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=
-
[6]
Ethical and social risks of harm from Language Models
Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
M easuring Association Between Labels and Free-Text Rationales
Wiegreffe, Sarah and Marasovi \'c , Ana and Smith, Noah A. M easuring Association Between Labels and Free-Text Rationales. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021
work page 2021
-
[9]
Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization
Zaidan, Omar and Eisner, Jason and Piatko, Christine. Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization. Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. 2007
work page 2007
-
[13]
Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=
Model reconstruction from model explanations , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=
-
[14]
doi: 10.18653/v1/2020.acl-main.703
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...
-
[16]
arXiv preprint arXiv:2004.03097 , year=
Towards non-task-specific distillation of BERT via sentence representation approximation , author=. arXiv preprint arXiv:2004.03097 , year=
-
[17]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[18]
International Conference on Machine Learning , pages=
Knowledge transfer with jacobian matching , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[20]
Advances in neural information processing systems , volume=
Big self-supervised models are strong semi-supervised learners , author=. Advances in neural information processing systems , volume=
-
[22]
Improving language models by retrieving from trillions of tokens
Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Knowledge distillation: A good teacher is patient and consistent , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[28]
International Conference on Machine Learning , pages=
Born again neural networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[30]
Transactions of the Association for Computational Linguistics , volume=
Evaluating Explanations: How much do explanations from the teacher aid students? , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
work page 2022
-
[35]
Adversarial NLI : A New Benchmark for Natural Language Understanding
Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
work page 2020
-
[38]
Advances in Neural Information Processing Systems , editor=
Weighted Distillation with Unlabeled Examples , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[41]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[47]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[48]
International Conference on Machine Learning , pages=
Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[53]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[59]
Advances in Neural Information Processing Systems , volume=
e-snli: Natural language inference with natural language explanations , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
European Conference on Computer Vision , pages=
Side-tuning: a baseline for network adaptation via additive side networks , author=. European Conference on Computer Vision , pages=. 2020 , organization=
work page 2020
-
[61]
Model compression , author=. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
- [62]
- [63]
-
[64]
Lucas Beyer, Xiaohua Zhai, Am \'e lie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925--10934
work page 2022
-
[65]
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://arxiv.org/abs/2204.06745 GPT-NeoX-20B : An open-source autoregressive language model ....
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[67]
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541
work page 2006
-
[68]
Oana-Maria Camburu, Tim Rockt \"a schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31
work page 2018
-
[69]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243--22255
work page 2020
-
[70]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[71]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [72]
- [73]
-
[74]
Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [75]
-
[76]
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7)
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [77]
-
[78]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[79]
Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics
-
[80]
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large language models can self-improve. arXiv preprint arXiv:2210.11610
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[81]
Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gaurav Menghani, Khoa Trinh, and Erik Vee. 2022. https://openreview.net/forum?id=M34VHvEU4NZ Weighted distillation with unlabeled examples . In Advances in Neural Information Processing Systems
work page 2022
-
[82]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [84]
- [85]
- [86]
-
[87]
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975--984
work page 2020
-
[88]
Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1--9
work page 2019
- [89]
-
[90]
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics
work page 2020
-
[91]
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[92]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[93]
Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C Lipton, Graham Neubig, and William W Cohen. 2022. Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359--375
work page 2022
-
[94]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67
work page 2020
-
[95]
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! leveraging language models for commonsense reasoning . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932--4942, Florence, Italy. Association for Computational Linguistics
-
[96]
Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [97]
-
[98]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022 b . Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[99]
Suraj Srinivas and Fran c ois Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723--4731. PMLR
work page 2018
-
[100]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...
-
[101]
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[102]
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [103]
- [104]
- [105]
-
[106]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 b . Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[107]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [108]
-
[109]
Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266--10284, Online and Punta Cana, Dominican Republic. Association for Computational Li...
work page 2021
-
[110]
Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. https://aclanthology.org/N07-1033 Using `` annotator rationales '' to improve machine learning for text categorization . In Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference , pages 260--...
work page 2007
- [111]
-
[112]
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020. Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698--714. Springer
work page 2020
-
[113]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[114]
Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. https://doi.org/10.18653/v1/D16-1076 Rationale-augmented convolutional neural networks for text classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795--804, Austin, Texas. Association for Computational Linguistics
- [115]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.