Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Alexander Ratner; Cheng-Yu Hsieh; Chen-Yu Lee; Chih-Kuan Yeh; Chun-Liang Li; Hootan Nakhost; Ranjay Krishna; Tomas Pfister; Yasuhisa Fujii

arxiv: 2305.02301 · v2 · pith:FB5JQX5Unew · submitted 2023-05-03 · 💻 cs.CL · cs.AI· cs.LG

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh , Chun-Liang Li , Chih-Kuan Yeh , Hootan Nakhost , Yasuhisa Fujii , Alexander Ratner , Ranjay Krishna , Chen-Yu Lee

show 1 more author

Tomas Pfister

This is my paper

Pith reviewed 2026-05-21 20:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords knowledge distillationlarge language modelsrationale extractionmulti-task learningmodel compressionfew-shot promptingnatural language processing

0 comments

The pith

Smaller models trained on large language model rationales outperform much larger models with less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training approach that incorporates step-by-step reasoning explanations from large language models to supervise smaller models through a multi-task objective. This allows the compact models to learn both correct answers and the reasoning process behind them. The result is that a 770 million parameter T5 model can exceed the few-shot accuracy of a 540 billion parameter PaLM model on NLP benchmarks while using only 80 percent of the available training data. Standard fine-tuning of the same small model fails to match the large model even when given the full dataset. The method addresses the practical problem of deploying large models by showing that both model size and data volume can be reduced without sacrificing performance.

Core claim

By extracting rationales generated by a large language model and adding them as extra supervision signals in a multi-task framework, smaller student models can be trained to outperform the original large model on downstream tasks while requiring substantially fewer labeled or unlabeled training examples than either standard fine-tuning or conventional distillation.

What carries the argument

Distilling step-by-step, the process of using large language model rationales as additional supervision targets alongside task labels inside a single multi-task training objective for the smaller model.

If this is right

Smaller models reach higher accuracy than few-shot prompted large models on the tested NLP tasks.
Both finetuning and distillation baselines require more training examples to reach comparable performance.
The same small model size can match or beat a much larger model when rationales are included in training.
Reductions in both model parameters and data volume occur simultaneously without loss of accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to domains where step-by-step explanations can be generated, such as code or math problems.
Focus could shift from collecting more human labels toward improving the quality of machine-generated rationales.
Resource-limited settings could adopt smaller models more readily if the method generalizes beyond the four benchmarks.

Load-bearing premise

The rationales generated by the large language model must be accurate and consistent enough to supply useful guidance to the smaller model rather than adding noise or systematic mistakes.

What would settle it

Direct comparison on the reported benchmark showing whether the 770M T5 model trained with the step-by-step method on 80 percent of the data exceeds the few-shot accuracy of the 540B PaLM model; failure to exceed would falsify the central performance claim.

read the original abstract

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a 770M T5 trained on labels plus LLM rationales can beat few-shot 540B PaLM with less data than standard fine-tuning needs, but the gains rest on unverified rationale quality.

read the letter

The main thing to know is that this multi-task distillation approach—training the student on both the final label and the teacher's step-by-step rationale—produces measurable gains over plain fine-tuning and standard distillation on four NLP benchmarks. Their headline comparison is that the 770M T5 with 80% data beats the 540B PaLM few-shot baseline, while the same T5 trained normally on 100% data does not reach it. They release code, which makes the setup easy to inspect or replicate.

Referee Report

2 major / 2 minor

Summary. The paper introduces Distilling step-by-step, a method that uses rationales from large language models as additional supervision in a multi-task framework to train smaller models. It reports that this enables better performance than standard fine-tuning or distillation with less data, and that a 770M T5 model can outperform a 540B PaLM model on benchmarks using only 80% of the data while standard fine-tuning cannot even with 100%.

Significance. If validated, the results would be significant for making high-performing NLP models more accessible with reduced computational and data resources. The public code release aids in reproducibility.

major comments (2)

[Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.
[Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.

minor comments (2)

[Abstract] The specific names of the four NLP benchmarks are not listed, which would help readers quickly contextualize the claims.
[Section 4.1] The description of data usage percentages could clarify whether the 80% subset is randomly sampled or selected based on some criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We have addressed each of the major comments below and made revisions to the paper accordingly to improve its rigor and clarity.

read point-by-point responses

Referee: [Section 3] The multi-task loss combines label prediction and rationale generation; however, no ablation is presented that replaces the LLM rationales with random text or empty strings to isolate whether the performance gains stem from the semantic content of the rationales or merely from the multi-task format. This directly addresses the weakest assumption regarding rationale quality.

Authors: We agree that an ablation study replacing the LLM rationales with random text or empty strings would help isolate the contribution of the rationale content versus the multi-task training format. To address this concern, we have performed this additional experiment. When using random text or empty strings as targets for the rationale generation task, the performance of the smaller model drops significantly compared to using the actual LLM-generated rationales, approaching the levels seen in standard fine-tuning. These results confirm that the semantic content of the rationales is key to the observed gains. We will include this ablation analysis in the revised Section 3 and provide the corresponding results in a new table. revision: yes
Referee: [Table 2] The headline result comparing the 770M T5 to the 540B PaLM lacks reported p-values or confidence intervals from repeated experiments, undermining confidence in the data-efficiency claim.

Authors: We acknowledge the value of statistical measures such as p-values or confidence intervals for strengthening the claims, particularly for the data-efficiency results in Table 2. However, repeating the full set of experiments multiple times is computationally prohibitive given the scale of the models involved. Following practices in similar large-scale NLP papers, we report results from single runs but have ensured consistency across four different benchmarks. In the revised manuscript, we have added a discussion of this limitation in the experimental setup section and included variance estimates from multiple seeds for the smaller-scale experiments where feasible. We believe the trends observed across benchmarks provide sufficient support for our conclusions. revision: partial

Circularity Check

0 steps flagged

Empirical training comparisons contain no circular derivation

full rationale

The paper reports measured accuracy improvements from multi-task fine-tuning on LLM-generated rationales versus standard fine-tuning or few-shot prompting. All headline numbers (770M T5 outperforming 540B PaLM on 80% data) are direct experimental outcomes on fixed benchmarks, not quantities obtained by solving the paper's own equations or by renaming fitted parameters. No self-citation chain is invoked to justify uniqueness or to close a derivation loop; the method is presented as an empirical recipe whose value is assessed by external test-set performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard supervised learning assumptions plus the untested premise that LLM-generated rationales are high-quality and transferable supervision signals. No new physical or mathematical entities are introduced.

free parameters (1)

multi-task loss weighting coefficient
The relative weight between the answer prediction loss and the rationale prediction loss must be chosen; the abstract does not specify how it is set.

axioms (1)

domain assumption LLM-generated rationales provide useful additional supervision that improves generalization of the student model
Invoked when the method claims performance gains from the rationale-augmented multi-task objective.

pith-pipeline@v0.9.0 · 5825 in / 1232 out tokens · 25144 ms · 2026-05-21T20:44:25.048971+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
EmbGen: Teaching with Reassembled Corpora
cs.CL 2026-05 unverdicted novelty 6.0

EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on...
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
cs.CL 2025-12 unverdicted novelty 6.0

Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Deep sequence models tend to memorize geometrically; it is unclear why
cs.LG 2025-10 unverdicted novelty 6.0

Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
cs.SE 2025-07 conditional novelty 6.0

Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
cs.CL 2026-05 unverdicted novelty 5.0

QLoRA fine-tuning on ~1700 examples internalizes tool knowledge in Gemma-4B and Qwen3-4B, enabling description-free inference that cuts input length by 82.6% and raises planning scores above an informed baseline.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
cs.CL 2026-04 conditional novelty 5.0

Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
Online In-Context Distillation for Low-Resource Vision Language Models
cs.CV 2025-10 unverdicted novelty 5.0

Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Energy-Aware Routing to Large Reasoning Models
cs.AI 2025-12 unverdicted novelty 4.0

In the critical regime for energy provisioning to large reasoning models, performance is volatility-limited, motivating variance-aware routing policies based on training and inference compute scaling laws.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
cs.HC 2024-01 unverdicted novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 20 Pith papers · 20 internal anchors

[5]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=

work page
[6]

Ethical and social risks of harm from Language Models

Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

M easuring Association Between Labels and Free-Text Rationales

Wiegreffe, Sarah and Marasovi \'c , Ana and Smith, Noah A. M easuring Association Between Labels and Free-Text Rationales. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021
[9]

Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization

Zaidan, Omar and Eisner, Jason and Piatko, Christine. Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization. Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. 2007

work page 2007
[13]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Model reconstruction from model explanations , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

work page
[14]

doi: 10.18653/v1/2020.acl-main.703

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020
[16]

arXiv preprint arXiv:2004.03097 , year=

Towards non-task-specific distillation of BERT via sentence representation approximation , author=. arXiv preprint arXiv:2004.03097 , year=

work page arXiv 2004
[17]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[18]

International Conference on Machine Learning , pages=

Knowledge transfer with jacobian matching , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[20]

Advances in neural information processing systems , volume=

Big self-supervised models are strong semi-supervised learners , author=. Advances in neural information processing systems , volume=

work page
[22]

Improving language models by retrieving from trillions of tokens

Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Knowledge distillation: A good teacher is patient and consistent , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

International Conference on Machine Learning , pages=

Born again neural networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[30]

Transactions of the Association for Computational Linguistics , volume=

Evaluating Explanations: How much do explanations from the teacher aid students? , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[35]

Adversarial NLI : A New Benchmark for Natural Language Understanding

Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[38]

Advances in Neural Information Processing Systems , editor=

Weighted Distillation with Unlabeled Examples , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[41]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[47]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[48]

International Conference on Machine Learning , pages=

Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[53]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[59]

Advances in Neural Information Processing Systems , volume=

e-snli: Natural language inference with natural language explanations , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

European Conference on Computer Vision , pages=

Side-tuning: a baseline for network adaptation via additive side networks , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020
[61]

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Model compression , author=. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page
[62]

Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Dipanjan Das, and Mirella Lapata. 2022. Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264

work page arXiv 2022
[63]

Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R \'e . 2022. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441

work page arXiv 2022
[64]

Lucas Beyer, Xiaohua Zhai, Am \'e lie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925--10934

work page 2022
[65]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://arxiv.org/abs/2204.06745 GPT-NeoX-20B : An open-source autoregressive language model ....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[67]

Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541

work page 2006
[68]

Oana-Maria Camburu, Tim Rockt \"a schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31

work page 2018
[69]

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243--22255

work page 2020
[70]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, and David Mimno. 2022. Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498

work page arXiv 2022
[73]

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726

work page arXiv 2023
[74]

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415

work page internal anchor Pith review Pith/arXiv arXiv 2019
[75]

Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201

work page arXiv 2021
[76]

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[77]

Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071

work page arXiv 2022
[78]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1031 2018
[80]

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large language models can self-improve. arXiv preprint arXiv:2210.11610

work page internal anchor Pith review Pith/arXiv arXiv 2022
[81]

Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gaurav Menghani, Khoa Trinh, and Erik Vee. 2022. https://openreview.net/forum?id=M34VHvEU4NZ Weighted distillation with unlabeled examples . In Advances in Neural Information Processing Systems

work page 2022
[82]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[83]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691

work page internal anchor Pith review Pith/arXiv arXiv 2021
[84]

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050

work page arXiv 2023
[85]

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. 2020. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593

work page arXiv 2020
[86]

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. 2022. Teaching small language models to reason. arXiv preprint arXiv:2212.08410

work page arXiv 2022
[87]

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975--984

work page 2020
[88]

Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1--9

work page 2019
[89]

Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546

work page arXiv 2020
[90]

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics

work page 2020
[91]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[92]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[93]

Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C Lipton, Graham Neubig, and William W Cohen. 2022. Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359--375

work page 2022
[94]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020
[95]

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! leveraging language models for commonsense reasoning . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932--4942, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1487 2019
[96]

Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717

work page internal anchor Pith review Pith/arXiv arXiv 2017
[97]

Ryan Smith, Jason A Fries, Braden Hancock, and Stephen H Bach. 2022 a . Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318

work page arXiv 2022
[98]

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022 b . Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990

work page internal anchor Pith review Pith/arXiv arXiv 2022
[99]

Suraj Srinivas and Fran c ois Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723--4731. PMLR

work page 2018
[100]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1421 2019
[101]

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136

work page internal anchor Pith review Pith/arXiv arXiv 2019
[102]

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239

work page internal anchor Pith review Pith/arXiv arXiv 2022
[103]

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can't plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498

work page arXiv 2022
[104]

Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. 2022 a . Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562

work page arXiv 2022
[105]

Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487

work page arXiv 2021
[106]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 b . Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[107]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[108]

Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. 2021. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178

work page arXiv 2021
[109]

Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266--10284, Online and Punta Cana, Dominican Republic. Association for Computational Li...

work page 2021
[110]

Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. https://aclanthology.org/N07-1033 Using `` annotator rationales '' to improve machine learning for text categorization . In Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference , pages 260--...

work page 2007
[111]

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. 2022. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465

work page arXiv 2022
[112]

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020. Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698--714. Springer

work page 2020
[113]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[114]

Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. https://doi.org/10.18653/v1/D16-1076 Rationale-augmented convolutional neural networks for text classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795--804, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1076 2016
[115]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. 2022. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023

work page arXiv 2022

[1] [5]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=

work page

[2] [6]

Ethical and social risks of harm from Language Models

Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [7]

M easuring Association Between Labels and Free-Text Rationales

Wiegreffe, Sarah and Marasovi \'c , Ana and Smith, Noah A. M easuring Association Between Labels and Free-Text Rationales. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021

[4] [9]

Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization

Zaidan, Omar and Eisner, Jason and Piatko, Christine. Using `` Annotator Rationales '' to Improve Machine Learning for Text Categorization. Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. 2007

work page 2007

[5] [13]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Model reconstruction from model explanations , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

work page

[6] [14]

doi: 10.18653/v1/2020.acl-main.703

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020

[7] [16]

arXiv preprint arXiv:2004.03097 , year=

Towards non-task-specific distillation of BERT via sentence representation approximation , author=. arXiv preprint arXiv:2004.03097 , year=

work page arXiv 2004

[8] [17]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page

[9] [18]

International Conference on Machine Learning , pages=

Knowledge transfer with jacobian matching , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[10] [20]

Advances in neural information processing systems , volume=

Big self-supervised models are strong semi-supervised learners , author=. Advances in neural information processing systems , volume=

work page

[11] [22]

Improving language models by retrieving from trillions of tokens

Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Knowledge distillation: A good teacher is patient and consistent , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[13] [28]

International Conference on Machine Learning , pages=

Born again neural networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[14] [30]

Transactions of the Association for Computational Linguistics , volume=

Evaluating Explanations: How much do explanations from the teacher aid students? , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022

[15] [35]

Adversarial NLI : A New Benchmark for Natural Language Understanding

Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020

[16] [38]

Advances in Neural Information Processing Systems , editor=

Weighted Distillation with Unlabeled Examples , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022

[17] [41]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page

[18] [47]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page

[19] [48]

International Conference on Machine Learning , pages=

Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[20] [53]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[21] [59]

Advances in Neural Information Processing Systems , volume=

e-snli: Natural language inference with natural language explanations , author=. Advances in Neural Information Processing Systems , volume=

work page

[22] [60]

European Conference on Computer Vision , pages=

Side-tuning: a baseline for network adaptation via additive side networks , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020

[23] [61]

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Model compression , author=. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page

[24] [62]

Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Dipanjan Das, and Mirella Lapata. 2022. Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264

work page arXiv 2022

[25] [63]

Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R \'e . 2022. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441

work page arXiv 2022

[26] [64]

Lucas Beyer, Xiaohua Zhai, Am \'e lie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925--10934

work page 2022

[27] [65]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://arxiv.org/abs/2204.06745 GPT-NeoX-20B : An open-source autoregressive language model ....

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [66]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020

[29] [67]

Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541

work page 2006

[30] [68]

Oana-Maria Camburu, Tim Rockt \"a schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31

work page 2018

[31] [69]

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243--22255

work page 2020

[32] [70]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [71]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [72]

Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, and David Mimno. 2022. Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498

work page arXiv 2022

[35] [73]

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726

work page arXiv 2023

[36] [74]

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [75]

Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201

work page arXiv 2021

[38] [76]

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [77]

Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071

work page arXiv 2022

[40] [78]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [79]

Jeremy Howard and Sebastian Ruder. 2018. https://doi.org/10.18653/v1/P18-1031 Universal language model fine-tuning for text classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1031 2018

[42] [80]

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large language models can self-improve. arXiv preprint arXiv:2210.11610

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [81]

Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gaurav Menghani, Khoa Trinh, and Erik Vee. 2022. https://openreview.net/forum?id=M34VHvEU4NZ Weighted distillation with unlabeled examples . In Advances in Neural Information Processing Systems

work page 2022

[44] [82]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [83]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [84]

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050

work page arXiv 2023

[47] [85]

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. 2020. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593

work page arXiv 2020

[48] [86]

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. 2022. Teaching small language models to reason. arXiv preprint arXiv:2212.08410

work page arXiv 2022

[49] [87]

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975--984

work page 2020

[50] [88]

Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1--9

work page 2019

[51] [89]

Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546

work page arXiv 2020

[52] [90]

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics

work page 2020

[53] [91]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021

[54] [92]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021

[55] [93]

Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C Lipton, Graham Neubig, and William W Cohen. 2022. Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359--375

work page 2022

[56] [94]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020

[57] [95]

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! leveraging language models for commonsense reasoning . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932--4942, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1487 2019

[58] [96]

Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717

work page internal anchor Pith review Pith/arXiv arXiv 2017

[59] [97]

Ryan Smith, Jason A Fries, Braden Hancock, and Stephen H Bach. 2022 a . Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318

work page arXiv 2022

[60] [98]

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022 b . Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [99]

Suraj Srinivas and Fran c ois Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723--4731. PMLR

work page 2018

[62] [100]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1421 2019

[63] [101]

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136

work page internal anchor Pith review Pith/arXiv arXiv 2019

[64] [102]

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239

work page internal anchor Pith review Pith/arXiv arXiv 2022

[65] [103]

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can't plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498

work page arXiv 2022

[66] [104]

Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. 2022 a . Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562

work page arXiv 2022

[67] [105]

Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487

work page arXiv 2021

[68] [106]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 b . Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [107]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[70] [108]

Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. 2021. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178

work page arXiv 2021

[71] [109]

Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266--10284, Online and Punta Cana, Dominican Republic. Association for Computational Li...

work page 2021

[72] [110]

Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. https://aclanthology.org/N07-1033 Using `` annotator rationales '' to improve machine learning for text categorization . In Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference , pages 260--...

work page 2007

[73] [111]

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. 2022. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465

work page arXiv 2022

[74] [112]

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. 2020. Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698--714. Springer

work page 2020

[75] [113]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022

[76] [114]

Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. https://doi.org/10.18653/v1/D16-1076 Rationale-augmented convolutional neural networks for text classification . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795--804, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1076 2016

[77] [115]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. 2022. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023

work page arXiv 2022