Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
citing papers explorer
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.