Critique-Guided Distillation for Robust Reasoning via Refinement
Pith reviewed 2026-05-22 14:14 UTC · model grok-4.3
The pith
Critique-Guided Distillation trains models to refine flawed answers using teacher critiques only at training time, yielding stronger mathematical reasoning without inference overhead or capability loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Critique-Guided Distillation decouples critique consumption from critique generation by training the student to refine flawed responses conditioned on teacher critiques during fine-tuning. Critiques function as a training-time-only supervision signal that encourages internalization of error-aware reasoning. Across five model families this produces 7 percent average gains over Critique Fine-Tuning and standard distillation on mathematical reasoning benchmarks, with peaks of +15.0 percent on AMC23 and +12.2 percent on MATH-500, plus higher Pass@1 on AIME24 and AIME25 and no loss in general instruction following.
What carries the argument
Critique-Guided Distillation, the training procedure in which a student model refines flawed responses when conditioned on teacher critiques that are supplied only during fine-tuning and withheld at inference.
If this is right
- Models achieve 7 percent average gains on mathematical reasoning benchmarks over both standard distillation and Critique Fine-Tuning.
- Larger lifts appear on hard competition sets, reaching +15.0 percent on AMC23 and +12.2 percent on MATH-500.
- Pass@1 rises and low-k performance improves on AIME24 and AIME25 problems.
- General instruction-following stays intact, avoiding the 21.3 percent drop observed with Critique Fine-Tuning on IFEval.
- No extra computation or architectural changes are required at inference time.
Where Pith is reading between the lines
- The same separation of training-time critique use from inference-time generation could transfer to code generation or multi-step scientific reasoning tasks.
- The method's success hinges on critique quality, implying that stronger teacher models or self-critique loops could amplify the gains.
- After initial CGD training, models might be further improved by occasionally re-introducing light critique signals without full retraining.
Load-bearing premise
The teacher critiques must stay specific and relevant enough throughout training to create genuine internalization of error-aware reasoning rather than superficial pattern matching.
What would settle it
Replacing the teacher critiques with random or generic text during training and measuring whether the performance advantage over standard distillation disappears would directly test whether the method depends on critique quality.
Figures
read the original abstract
Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from generation: the student model is fine-tuned to refine flawed responses conditioned on teacher critiques (used only at training time), with the goal of internalizing error-aware reasoning for mathematical tasks. It claims consistent outperformance over Critique Fine-Tuning (CFT) and standard distillation across five model families on benchmarks including AMC23 (+15.0%), MATH-500 (+12.2%), AIME24/25 (higher Pass@1 and low-k performance), yielding ~7% average gains, while avoiding the capability degradation seen in CFT (e.g., -21.3% on IFEval). Controlled ablations are said to link gains directly to critique specificity and relevance.
Significance. If the results hold under proper controls, CGD offers a practical, inference-efficient intermediate paradigm for improving reasoning robustness without output-format drift or general capability loss. The empirical scale across model families and the ablation evidence tying gains to feedback quality are strengths; however, the significance is tempered by the need to confirm that training-time error patterns transfer to the student's own inference-time generations.
major comments (2)
- [§3] §3 (Method description): The source of flawed responses used in training (e.g., whether generated by the student model itself via rollouts, a fixed weaker model, or synthetic corruption) is not specified. This is load-bearing for the internalization claim, because a distribution mismatch between training flaws and the student's own inference-time errors could reduce the method to supervised correction of foreign error types rather than acquisition of self-aware reasoning. An ablation comparing self-generated vs. external flawed responses is needed to address the distribution-shift concern.
- [§4] §4 (Experiments and results): The reported gains (7% average, +15.0% on AMC23, +12.2% on MATH-500) and ablations lack details on dataset splits, number of runs, statistical significance, or exact hyperparameter controls. Without these, the central empirical claim of consistent outperformance and the attribution to critique specificity remain only partially verifiable.
minor comments (2)
- [Abstract] Abstract: The phrasing that gains are 'directly driven by the specificity and relevance' of feedback should include a forward reference to the specific ablation table or figure for precision.
- [§4] Notation: Ensure consistent use of 'Pass@1' and 'Pass@k' across text and tables; minor inconsistencies in benchmark naming (e.g., AIME24 vs. AIME25) could be clarified.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions we will make.
read point-by-point responses
-
Referee: [§3] §3 (Method description): The source of flawed responses used in training (e.g., whether generated by the student model itself via rollouts, a fixed weaker model, or synthetic corruption) is not specified. This is load-bearing for the internalization claim, because a distribution mismatch between training flaws and the student's own inference-time errors could reduce the method to supervised correction of foreign error types rather than acquisition of self-aware reasoning. An ablation comparing self-generated vs. external flawed responses is needed to address the distribution-shift concern.
Authors: We appreciate the referee highlighting this important aspect of the method. We agree that explicitly stating the source of flawed responses is necessary for properly interpreting the internalization claim. This detail was omitted from the original §3 description. The flawed responses were obtained by prompting a fixed weaker model (a smaller model from the same family as the student) to generate solutions on the training problems and retaining those with incorrect answers. This choice allowed us to curate a diverse set of realistic errors efficiently. We acknowledge that an ablation contrasting self-generated rollouts with external flaws would further strengthen the distribution-shift analysis. However, conducting such an ablation would require substantial additional compute for student rollouts across the full training set, which was not feasible in the present study. We maintain that the core benefit arises from training the student to actively interpret and apply the critique signal to correct errors, which encourages transferable error-aware reasoning even when initial flaws originate externally. We will revise §3 to specify the source of flawed responses and add a brief discussion of this point, including it as a limitation for future investigation. This constitutes a partial revision. revision: partial
-
Referee: [§4] §4 (Experiments and results): The reported gains (7% average, +15.0% on AMC23, +12.2% on MATH-500) and ablations lack details on dataset splits, number of runs, statistical significance, or exact hyperparameter controls. Without these, the central empirical claim of consistent outperformance and the attribution to critique specificity remain only partially verifiable.
Authors: We agree that additional experimental details are required to make the results fully verifiable and reproducible. We will update §4 (and the associated appendix) to include: the precise dataset splits employed for each benchmark (training, validation, and test portions); results averaged over three independent runs with different random seeds, accompanied by standard deviations; statistical significance assessments (e.g., via paired t-tests) for the primary comparisons against baselines; and a complete enumeration of hyperparameters, including learning rates, batch sizes, training epochs, temperature settings for critique generation, and the exact teacher model configurations. These additions will directly support the claims of consistent outperformance and the role of critique specificity. This will be a full revision of the experimental reporting. revision: yes
Circularity Check
No significant circularity; empirical results independent of definitional inputs
full rationale
The paper defines Critique-Guided Distillation operationally as a training procedure that uses teacher critiques only at fine-tuning time to refine flawed responses, then validates the resulting performance gains through direct experiments on mathematical reasoning benchmarks (AMC23, MATH-500, AIME24/25) across five model families plus controlled ablations on critique specificity. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the claimed improvements to the method's own inputs by construction. The central claims rest on external benchmark comparisons and ablation controls rather than any self-referential derivation, satisfying the criteria for a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Teacher-provided critiques supply actionable signals for internalizing error-aware reasoning when used only as training supervision.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The student is fine-tuned on the augmented dataset ((x, y′, c), ŷ) using a standard language modeling objective. ... L(θ) = E[(x,y′,c,ŷ)][−log Sθ(ŷ|x,y′,c)]
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7% average improvements
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
URLhttps://arxiv.org/abs/2204.05862. Beatriz Borges, Niket Tandon, Tanja K¨aser, and Antoine Bosselut. Let me teach you: Pedagogical foundations of feedback for language models.arXiv preprint arXiv:2307.00279, 2023. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InI...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.489 2023
-
[2]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
URLhttps://arxiv.org/abs/2201.11903. Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, and Markus Freitag. Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback, 2024. URLhttps://arxiv.org/ abs/2311.09336. Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Sh...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
These numbers are distinct, meaning they are all different from one another, ... ,→ ,→ ### Multiplication of the Prime Numbers To find the smallest positive integer divisible by these five distinct prime numbers, we multiply them together:,→ 2 * 3 * 5 * 7 * 11 = 2310 This calculation is straightforward and demonstrates the direct application of the initia...
-
[4]
These are the first five prime numbers in the sequence of prime numbers. ,→ ,→ Now, to find the smallest positive integer divisible by these five distinct prime numbers, ...,→ The calculation is as follows: 2 * 3 * 5 * 7 * 11 = 2310 Therefore, the smallest positive integer divisible by five distinct prime numbers is indeed 2310. This number is the product...
-
[5]
The sequence starts with 1, which is plausible for a graph with 2 vertices and 1 edge
-
[6]
The sequence increases by 1 for the next three steps (1, 2, 3), which is consistent with adding edges to a graph while keeping the number of vertices constant
-
[7]
The sequence then increases by 1 again (3, 4, 5), which is also consistent with adding edges to a graph while keeping the number of vertices constant
-
[8]
The sequence ends with 5, which is plausible for a graph with 6 vertices and 1 edge. Key Check:The sequence does not require any negative scores (i.e., more edges than vertices), which is a valid mathematical possibility. Therefore, the sequence is feasible. Step 4: Conclusion The given score sequence (1, 1, 1, 2, 2, 3, 4, 5, 5) is valid and corresponds t...
-
[9]
**Surjectivity**: For everyM∈R 2×2, there existsM ′ ∈R 2×2 such thatT(M ′) =M3
**Injectivity**:T(M 1) =T(M 2)⇒M 1 =M 2 2. **Surjectivity**: For everyM∈R 2×2, there existsM ′ ∈R 2×2 such thatT(M ′) =M3. **Linearity**:T(aM 1 +bM 2) =aT(M 1) +bT(M 2)for alla, b∈RandM 1, M2 ∈R 2×2 Step 2: Linearity The transformationT(M) = 1 2 3 4 Mis clearly linear because matrix multiplication is distributive and associative. Thus, the linearity condi...
-
[10]
Three vertices are connected in a triangle (degrees 1)
-
[11]
Two vertices are connected to each other and to one of the vertices in the triangle (degrees 2)
-
[12]
One vertex is connected to one of the vertices in the triangle and to one of the vertices with degree 2 (degree 3)
-
[13]
One vertex is connected to the vertices with degrees 1, 2, and 3 (degree 4)
-
[14]
Two vertices are connected to all other vertices, including the ones with degrees 1, 2, 3, and 4 (degrees 5). This construction ensures that the degrees of the vertices match the given score, and the graph remains connected. Therefore, the statement that there exists a graph with the score (1, 1, 1, 2, 2, 3, 4, 5, 5) is true. The final answer is:T rue 30 ...
-
[15]
**Injectivity**: The transformation must be one-to-one, meaning that ifT(M 1) =T(M 2), thenM 1 = M2. 2. **Surjectivity**: The transformation must be onto, meaning that for every matrixN∈R 2×2, there exists a matrixM∈R 2×2 such thatT(M) =N. Let’s examine the transformationT(M) = 1 2 3 4 M. GivenM= a b c d , we have: T(M) = 1 2 3 4 a b c d = 1a+ 2c1b+ 2d 3a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.