pith. sign in

arxiv: 2305.02301 · v2 · pith:FB5JQX5Unew · submitted 2023-05-03 · 💻 cs.CL · cs.AI· cs.LG

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Pith reviewed 2026-05-21 20:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords knowledge distillationlarge language modelsrationale extractionmulti-task learningmodel compressionfew-shot promptingnatural language processing
0
0 comments X

The pith

Smaller models trained on large language model rationales outperform much larger models with less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training approach that incorporates step-by-step reasoning explanations from large language models to supervise smaller models through a multi-task objective. This allows the compact models to learn both correct answers and the reasoning process behind them. The result is that a 770 million parameter T5 model can exceed the few-shot accuracy of a 540 billion parameter PaLM model on NLP benchmarks while using only 80 percent of the available training data. Standard fine-tuning of the same small model fails to match the large model even when given the full dataset. The method addresses the practical problem of deploying large models by showing that both model size and data volume can be reduced without sacrificing performance.

Core claim

By extracting rationales generated by a large language model and adding them as extra supervision signals in a multi-task framework, smaller student models can be trained to outperform the original large model on downstream tasks while requiring substantially fewer labeled or unlabeled training examples than either standard fine-tuning or conventional distillation.

What carries the argument

Distilling step-by-step, the process of using large language model rationales as additional supervision targets alongside task labels inside a single multi-task training objective for the smaller model.

If this is right

  • Smaller models reach higher accuracy than few-shot prompted large models on the tested NLP tasks.
  • Both finetuning and distillation baselines require more training examples to reach comparable performance.
  • The same small model size can match or beat a much larger model when rationales are included in training.
  • Reductions in both model parameters and data volume occur simultaneously without loss of accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to domains where step-by-step explanations can be generated, such as code or math problems.
  • Focus could shift from collecting more human labels toward improving the quality of machine-generated rationales.
  • Resource-limited settings could adopt smaller models more readily if the method generalizes beyond the four benchmarks.

Load-bearing premise

The rationales generated by the large language model must be accurate and consistent enough to supply useful guidance to the smaller model rather than adding noise or systematic mistakes.

What would settle it

Direct comparison on the reported benchmark showing whether the 770M T5 model trained with the step-by-step method on 80 percent of the data exceeds the few-shot accuracy of the 540B PaLM model; failure to exceed would falsify the central performance claim.

read the original abstract

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Distilling step-by-step, a method that uses rationales from large language models as additional supervision in a multi-task framework to train smaller models. It reports that this enables better performance than standard fine-tuning or distillation with less data, and that a 770M T5 model can outperform a 540B PaLM model on benchmarks using only 80% of the data while standard fine-tuning cannot even with 100%.

Significance. If validated, the results would be significant for making high-performing NLP models more accessible with reduced computational and data resources. The public code release aids in reproducibility.

major comments (2)
  1. [Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.
  2. [Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.
minor comments (2)
  1. [Abstract] The specific names of the four NLP benchmarks are not listed, which would help readers quickly contextualize the claims.
  2. [Section 4.1] The description of data usage percentages could clarify whether the 80% subset is randomly sampled or selected based on some criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We have addressed each of the major comments below and made revisions to the paper accordingly to improve its rigor and clarity.

read point-by-point responses
  1. Referee: [Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.

    Authors: We agree that an ablation study replacing the LLM rationales with random text or empty strings would help isolate the contribution of the rationale content versus the multi-task training format. To address this concern, we have performed this additional experiment. When using random text or empty strings as targets for the rationale generation task, the performance of the smaller model drops significantly compared to using the actual LLM-generated rationales, approaching the levels seen in standard fine-tuning. These results confirm that the semantic content of the rationales is key to the observed gains. We will include this ablation analysis in the revised Section 3 and provide the corresponding results in a new table. revision: yes

  2. Referee: [Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.

    Authors: We acknowledge the value of statistical measures such as p-values or confidence intervals for strengthening the claims, particularly for the data-efficiency results in Table 2. However, repeating the full set of experiments multiple times is computationally prohibitive given the scale of the models involved. Following practices in similar large-scale NLP papers, we report results from single runs but have ensured consistency across four different benchmarks. In the revised manuscript, we have added a discussion of this limitation in the experimental setup section and included variance estimates from multiple seeds for the smaller-scale experiments where feasible. We believe the trends observed across benchmarks provide sufficient support for our conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical training comparisons contain no circular derivation

full rationale

The paper reports measured accuracy improvements from multi-task fine-tuning on LLM-generated rationales versus standard fine-tuning or few-shot prompting. All headline numbers (770M T5 outperforming 540B PaLM on 80% data) are direct experimental outcomes on fixed benchmarks, not quantities obtained by solving the paper's own equations or by renaming fitted parameters. No self-citation chain is invoked to justify uniqueness or to close a derivation loop; the method is presented as an empirical recipe whose value is assessed by external test-set performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard supervised learning assumptions plus the untested premise that LLM-generated rationales are high-quality and transferable supervision signals. No new physical or mathematical entities are introduced.

free parameters (1)
  • multi-task loss weighting coefficient
    The relative weight between the answer prediction loss and the rationale prediction loss must be chosen; the abstract does not specify how it is set.
axioms (1)
  • domain assumption LLM-generated rationales provide useful additional supervision that improves generalization of the student model
    Invoked when the method claims performance gains from the rationale-augmented multi-task objective.

pith-pipeline@v0.9.0 · 5825 in / 1232 out tokens · 25144 ms · 2026-05-21T20:44:25.048971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...

  2. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  3. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  4. EmbGen: Teaching with Reassembled Corpora

    cs.CL 2026-05 unverdicted novelty 6.0

    EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on...

  5. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  6. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.

  7. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  8. Generating Leakage-Free Benchmarks for Robust RAG Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.

  9. A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

    cs.CL 2026-05 unverdicted novelty 6.0

    VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.

  10. Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

    cs.CL 2025-12 unverdicted novelty 6.0

    Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.

  11. Deep sequence models tend to memorize geometrically; it is unclear why

    cs.LG 2025-10 unverdicted novelty 6.0

    Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.

  12. Fine-Tuning Code Language Models to Detect Cross-Language Bugs

    cs.SE 2025-07 conditional novelty 6.0

    Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.

  13. The False Promise of Imitating Proprietary LLMs

    cs.CL 2023-05 conditional novelty 6.0

    Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.

  14. Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 5.0

    QLoRA fine-tuning on ~1700 examples internalizes tool knowledge in Gemma-4B and Qwen3-4B, enabling description-free inference that cuts input length by 82.6% and raises planning scores above an informed baseline.

  15. ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

  16. Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    cs.CL 2026-04 conditional novelty 5.0

    Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.

  17. Online In-Context Distillation for Low-Resource Vision Language Models

    cs.CV 2025-10 unverdicted novelty 5.0

    Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.

  18. MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

    cs.CY 2026-04 unverdicted novelty 4.0

    MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

  19. Energy-Aware Routing to Large Reasoning Models

    cs.AI 2025-12 unverdicted novelty 4.0

    In the critical regime for energy provisioning to large reasoning models, performance is volatility-limited, motivating variance-aware routing policies based on training and inference compute scaling laws.

  20. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  21. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  22. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    cs.HC 2024-01 unverdicted novelty 3.0

    This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 20 Pith papers · 20 internal anchors

  1. [5]

    Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=

  2. [6]

    Ethical and social risks of harm from Language Models

    Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

  3. [7]

    M easuring Association Between Labels and Free-Text Rationales

    Wiegreffe, Sarah and Marasovi \'c , Ana and Smith, Noah A. M easuring Association Between Labels and Free-Text Rationales. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

  4. [9]

    Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization

    Zaidan, Omar and Eisner, Jason and Piatko, Christine. Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization. Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. 2007

  5. [13]

    Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

    Model reconstruction from model explanations , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

  6. [14]

    doi: 10.18653/v1/2020.acl-main.703

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

  7. [16]

    arXiv preprint arXiv:2004.03097 , year=

    Towards non-task-specific distillation of BERT via sentence representation approximation , author=. arXiv preprint arXiv:2004.03097 , year=

  8. [17]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  9. [18]

    International Conference on Machine Learning , pages=

    Knowledge transfer with jacobian matching , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  10. [20]

    Advances in neural information processing systems , volume=

    Big self-supervised models are strong semi-supervised learners , author=. Advances in neural information processing systems , volume=

  11. [22]

    Improving language models by retrieving from trillions of tokens

    Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=

  12. [24]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Knowledge distillation: A good teacher is patient and consistent , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [28]

    International Conference on Machine Learning , pages=

    Born again neural networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  14. [30]

    Transactions of the Association for Computational Linguistics , volume=

    Evaluating Explanations: How much do explanations from the teacher aid students? , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  15. [35]

    Adversarial NLI : A New Benchmark for Natural Language Understanding

    Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  16. [38]

    Advances in Neural Information Processing Systems , editor=

    Weighted Distillation with Unlabeled Examples , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  17. [41]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  18. [47]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  19. [48]

    International Conference on Machine Learning , pages=

    Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  20. [53]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  21. [59]

    Advances in Neural Information Processing Systems , volume=

    e-snli: Natural language inference with natural language explanations , author=. Advances in Neural Information Processing Systems , volume=

  22. [60]

    European Conference on Computer Vision , pages=

    Side-tuning: a baseline for network adaptation via additive side networks , author=. European Conference on Computer Vision , pages=. 2020 , organization=

  23. [61]

    Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Model compression , author=. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  24. [62]

    Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Dipanjan Das, and Mirella Lapata. 2022. Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264

  25. [63]

    Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R \'e . 2022. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441

  26. [64]

    Lucas Beyer, Xiaohua Zhai, Am \'e lie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925--10934

  27. [65]

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://arxiv.org/abs/2204.06745 GPT-NeoX-20B : An open-source autoregressive language model ....

  28. [66]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  29. [67]

    Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541

  30. [68]

    Oana-Maria Camburu, Tim Rockt \"a schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31

  31. [69]

    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243--22255

  32. [70]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  33. [71]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  34. [72]

    Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, and David Mimno. 2022. Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498

  35. [73]

    Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726

  36. [74]

    Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415

  37. [75]

    Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201

  38. [76]

    Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7)

  39. [77]

    Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071

  40. [78]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

  41. [79]

    Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics

  42. [80]

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large language models can self-improve. arXiv preprint arXiv:2210.11610

  43. [81]

    Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gaurav Menghani, Khoa Trinh, and Erik Vee. 2022. https://openreview.net/forum?id=M34VHvEU4NZ Weighted distillation with unlabeled examples . In Advances in Neural Information Processing Systems

  44. [82]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

  45. [83]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691

  46. [84]

    Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050

  47. [85]

    Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. 2020. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593

  48. [86]

    Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. 2022. Teaching small language models to reason. arXiv preprint arXiv:2212.08410

  49. [87]

    Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975--984

  50. [88]

    Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1--9

  51. [89]

    Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546

  52. [90]

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics

  53. [91]

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114

  54. [92]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

  55. [93]

    Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C Lipton, Graham Neubig, and William W Cohen. 2022. Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359--375

  56. [94]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

  57. [95]

    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! leveraging language models for commonsense reasoning . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932--4942, Florence, Italy. Association for Computational Linguistics

  58. [96]

    Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717

  59. [97]

    Ryan Smith, Jason A Fries, Braden Hancock, and Stephen H Bach. 2022 a . Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318

  60. [98]

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022 b . Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990

  61. [99]

    Suraj Srinivas and Fran c ois Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723--4731. PMLR

  62. [100]

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

  63. [101]

    Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136

  64. [102]

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239

  65. [103]

    Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can't plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498

  66. [104]

    Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. 2022 a . Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562

  67. [105]

    Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487

  68. [106]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 b . Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

  69. [107]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

  70. [108]

    Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. 2021. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178

  71. [109]

    Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266--10284, Online and Punta Cana, Dominican Republic. Association for Computational Li...

  72. [110]

    Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. https://aclanthology.org/N07-1033 Using `` annotator rationales '' to improve machine learning for text categorization . In Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference , pages 260--...

  73. [111]

    Eric Zelikman, Yuhuai Wu, and Noah D Goodman. 2022. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465

  74. [112]

    Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020. Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698--714. Springer

  75. [113]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  76. [114]

    Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. https://doi.org/10.18653/v1/D16-1076 Rationale-augmented convolutional neural networks for text classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795--804, Austin, Texas. Association for Computational Linguistics

  77. [115]

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. 2022. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023