pith. sign in

arxiv: 2505.11628 · v4 · pith:ILAOHRUFnew · submitted 2025-05-16 · 💻 cs.CL · cs.LG

Critique-Guided Distillation for Robust Reasoning via Refinement

Pith reviewed 2026-05-22 14:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords critique-guided distillationmathematical reasoninglanguage model fine-tuningdistillationreasoning robustnesscritique-based traininginstruction following
0
0 comments X

The pith

Critique-Guided Distillation trains models to refine flawed answers using teacher critiques only at training time, yielding stronger mathematical reasoning without inference overhead or capability loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard fine-tuning makes models copy correct outputs without learning how to reason through errors, while training them to produce critiques directly often harms their general abilities. Critique-Guided Distillation instead has the student improve incorrect responses when supplied with teacher critiques, using that feedback solely as a temporary training signal. The critiques disappear at inference, so the model must have internalized better error detection and correction. Experiments across five model families demonstrate consistent gains on math benchmarks and preservation of instruction-following skills where competing approaches degrade.

Core claim

Critique-Guided Distillation decouples critique consumption from critique generation by training the student to refine flawed responses conditioned on teacher critiques during fine-tuning. Critiques function as a training-time-only supervision signal that encourages internalization of error-aware reasoning. Across five model families this produces 7 percent average gains over Critique Fine-Tuning and standard distillation on mathematical reasoning benchmarks, with peaks of +15.0 percent on AMC23 and +12.2 percent on MATH-500, plus higher Pass@1 on AIME24 and AIME25 and no loss in general instruction following.

What carries the argument

Critique-Guided Distillation, the training procedure in which a student model refines flawed responses when conditioned on teacher critiques that are supplied only during fine-tuning and withheld at inference.

If this is right

  • Models achieve 7 percent average gains on mathematical reasoning benchmarks over both standard distillation and Critique Fine-Tuning.
  • Larger lifts appear on hard competition sets, reaching +15.0 percent on AMC23 and +12.2 percent on MATH-500.
  • Pass@1 rises and low-k performance improves on AIME24 and AIME25 problems.
  • General instruction-following stays intact, avoiding the 21.3 percent drop observed with Critique Fine-Tuning on IFEval.
  • No extra computation or architectural changes are required at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of training-time critique use from inference-time generation could transfer to code generation or multi-step scientific reasoning tasks.
  • The method's success hinges on critique quality, implying that stronger teacher models or self-critique loops could amplify the gains.
  • After initial CGD training, models might be further improved by occasionally re-introducing light critique signals without full retraining.

Load-bearing premise

The teacher critiques must stay specific and relevant enough throughout training to create genuine internalization of error-aware reasoning rather than superficial pattern matching.

What would settle it

Replacing the teacher critiques with random or generic text during training and measuring whether the performance advantage over standard distillation disappears would directly test whether the method depends on critique quality.

Figures

Figures reproduced from arXiv: 2505.11628 by Berkcan Kapusuzoglu, Chia-Hsuan Lee, Sambit Sahu, Supriyo Chakraborty, Zain Sarwar.

Figure 1
Figure 1. Figure 1: Comparing Supervised fine-tuning (SFT), Critique Fine-Tuning (CFT) and CRITIQUE-GUIDED DISTILLATION (CGD). Unlike CFT, which trains the student to generate critiques, CGD conditions training on both the initial answer and critique but at test time generates the final answer directly in a single pass. By conditioning answer generation on the critique, CGD avoids format drift (the model continues to generate… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of CGD, using a LLaMA3.3-70B Instruct teacher model to generate critiques and refined answers, with 100K samples from WebInstruct (Yue et al., 2024). The LLaMA3.1-8B Instruct student model is trained using the input prompt, initial answer and the the critique as input, and the refined answer as the target. Baselines in￾clude Distilled SFT, which uses only the input prompt as input to… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CRITIQUE-GUIDED DISTILLATION (CGD). Overview of Critique￾Guided Distillation (CGD). During training, the student produces an initial response, the teacher supplies a critique and refined answer, and the student is fine-tuned to map from (prompt, initial answer, critique) → refined answer. At inference, however, only the prompt is provided, and the student directly outputs the refined answer in … view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of CGD with and without the critique as input during train￾ing, evaluated on eight benchmarks. The critique provides a crucial learning signal, leading to consistent accuracy improvements across both the LLaMA3.1-8B Instruct (a) and S1.1-3B (b) student models. 4.2.3 TRAINING STABILITY AND HYPERPARAMETER ROBUSTNESS To ensure a rigorous and fair comparison, we evaluate the learning rat… view at source ↗
Figure 5
Figure 5. Figure 5: Performance on an average of math benchmarks for models trained on different mixtures of correct/incorrect student answers. A balanced 50/50 mixture yields the most robust model. B.2 EPOCH-ACCURACY CURVES [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy over training epochs for CGD on six math-focused benchmarks. While trends are modest, performance remains stable throughout, indicating resistance to overfitting and catastrophic forgetting. 10¡6 2 £ 10¡6 3 £ 10¡64 £ 10¡6 Learning Rate 20 40 60 80 Accuracy (%) MATH500 Minerva-Math GSM8K OlympiadBench AMC23 TheoremQA Avg. (a) CGD 10¡6 2 £ 10¡6 3 £ 10¡64 £ 10¡6 Learning Rate 20 40 60 80 Accuracy (%)… view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy vs. learning rate for (a) CGD (our method) and (b) the CFT baseline across six benchmarks. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training loss comparison between CGD and CFT. The x-axis indicates normalized training progress, and the y-axis shows loss. We present training loss curves comparing CRITIQUE-GUIDED DISTILLATION (CGD) and Critique-Finetuning (CFT) methods in [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average attention flow of the CGD model. All layers shown begin with a ”planning” step, focusing on the Critique (48.1%) and Student Answer (36.0%). The final layer (bottom right) then pivots sharply to an ”execution” phase, focusing on the Problem (> 90%), while the first layer (top left) continues to process the Critique. Shaded regions represent the 95% confidence interval over 50 samples. the middle of… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Aggregated heatmap of attention by generation phase for representative layers. The bright cells for the Critique in the first column for Layer 16 and 31 (48.1% and 45.5%) confirm that the initial planning phase is critique-driven, acting on the signal internalized by the early layers. The sustained brightness for the Critique in the Layer 0 heatmap illustrates its role in early-stage processing. C.4 CASE … view at source ↗
read the original abstract

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from generation: the student model is fine-tuned to refine flawed responses conditioned on teacher critiques (used only at training time), with the goal of internalizing error-aware reasoning for mathematical tasks. It claims consistent outperformance over Critique Fine-Tuning (CFT) and standard distillation across five model families on benchmarks including AMC23 (+15.0%), MATH-500 (+12.2%), AIME24/25 (higher Pass@1 and low-k performance), yielding ~7% average gains, while avoiding the capability degradation seen in CFT (e.g., -21.3% on IFEval). Controlled ablations are said to link gains directly to critique specificity and relevance.

Significance. If the results hold under proper controls, CGD offers a practical, inference-efficient intermediate paradigm for improving reasoning robustness without output-format drift or general capability loss. The empirical scale across model families and the ablation evidence tying gains to feedback quality are strengths; however, the significance is tempered by the need to confirm that training-time error patterns transfer to the student's own inference-time generations.

major comments (2)
  1. [§3] §3 (Method description): The source of flawed responses used in training (e.g., whether generated by the student model itself via rollouts, a fixed weaker model, or synthetic corruption) is not specified. This is load-bearing for the internalization claim, because a distribution mismatch between training flaws and the student's own inference-time errors could reduce the method to supervised correction of foreign error types rather than acquisition of self-aware reasoning. An ablation comparing self-generated vs. external flawed responses is needed to address the distribution-shift concern.
  2. [§4] §4 (Experiments and results): The reported gains (7% average, +15.0% on AMC23, +12.2% on MATH-500) and ablations lack details on dataset splits, number of runs, statistical significance, or exact hyperparameter controls. Without these, the central empirical claim of consistent outperformance and the attribution to critique specificity remain only partially verifiable.
minor comments (2)
  1. [Abstract] Abstract: The phrasing that gains are 'directly driven by the specificity and relevance' of feedback should include a forward reference to the specific ablation table or figure for precision.
  2. [§4] Notation: Ensure consistent use of 'Pass@1' and 'Pass@k' across text and tables; minor inconsistencies in benchmark naming (e.g., AIME24 vs. AIME25) could be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (Method description): The source of flawed responses used in training (e.g., whether generated by the student model itself via rollouts, a fixed weaker model, or synthetic corruption) is not specified. This is load-bearing for the internalization claim, because a distribution mismatch between training flaws and the student's own inference-time errors could reduce the method to supervised correction of foreign error types rather than acquisition of self-aware reasoning. An ablation comparing self-generated vs. external flawed responses is needed to address the distribution-shift concern.

    Authors: We appreciate the referee highlighting this important aspect of the method. We agree that explicitly stating the source of flawed responses is necessary for properly interpreting the internalization claim. This detail was omitted from the original §3 description. The flawed responses were obtained by prompting a fixed weaker model (a smaller model from the same family as the student) to generate solutions on the training problems and retaining those with incorrect answers. This choice allowed us to curate a diverse set of realistic errors efficiently. We acknowledge that an ablation contrasting self-generated rollouts with external flaws would further strengthen the distribution-shift analysis. However, conducting such an ablation would require substantial additional compute for student rollouts across the full training set, which was not feasible in the present study. We maintain that the core benefit arises from training the student to actively interpret and apply the critique signal to correct errors, which encourages transferable error-aware reasoning even when initial flaws originate externally. We will revise §3 to specify the source of flawed responses and add a brief discussion of this point, including it as a limitation for future investigation. This constitutes a partial revision. revision: partial

  2. Referee: [§4] §4 (Experiments and results): The reported gains (7% average, +15.0% on AMC23, +12.2% on MATH-500) and ablations lack details on dataset splits, number of runs, statistical significance, or exact hyperparameter controls. Without these, the central empirical claim of consistent outperformance and the attribution to critique specificity remain only partially verifiable.

    Authors: We agree that additional experimental details are required to make the results fully verifiable and reproducible. We will update §4 (and the associated appendix) to include: the precise dataset splits employed for each benchmark (training, validation, and test portions); results averaged over three independent runs with different random seeds, accompanied by standard deviations; statistical significance assessments (e.g., via paired t-tests) for the primary comparisons against baselines; and a complete enumeration of hyperparameters, including learning rates, batch sizes, training epochs, temperature settings for critique generation, and the exact teacher model configurations. These additions will directly support the claims of consistent outperformance and the role of critique specificity. This will be a full revision of the experimental reporting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of definitional inputs

full rationale

The paper defines Critique-Guided Distillation operationally as a training procedure that uses teacher critiques only at fine-tuning time to refine flawed responses, then validates the resulting performance gains through direct experiments on mathematical reasoning benchmarks (AMC23, MATH-500, AIME24/25) across five model families plus controlled ablations on critique specificity. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the claimed improvements to the method's own inputs by construction. The central claims rest on external benchmark comparisons and ablation controls rather than any self-referential derivation, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions from LLM fine-tuning literature with no new free parameters, axioms, or invented entities introduced beyond the training paradigm itself.

axioms (1)
  • domain assumption Teacher-provided critiques supply actionable signals for internalizing error-aware reasoning when used only as training supervision.
    Invoked as the mechanism driving gains in the abstract description of CGD.

pith-pipeline@v0.9.0 · 5810 in / 1338 out tokens · 40367 ms · 2026-05-22T14:14:27.840505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    URLhttps://arxiv.org/abs/2204.05862. Beatriz Borges, Niket Tandon, Tanja K¨aser, and Antoine Bosselut. Let me teach you: Pedagogical foundations of feedback for language models.arXiv preprint arXiv:2307.00279, 2023. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InI...

  2. [2]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    URLhttps://arxiv.org/abs/2201.11903. Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, and Markus Freitag. Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback, 2024. URLhttps://arxiv.org/ abs/2311.09336. Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Sh...

  3. [3]

    These numbers are distinct, meaning they are all different from one another, ... ,→ ,→ ### Multiplication of the Prime Numbers To find the smallest positive integer divisible by these five distinct prime numbers, we multiply them together:,→ 2 * 3 * 5 * 7 * 11 = 2310 This calculation is straightforward and demonstrates the direct application of the initia...

  4. [4]

    These are the first five prime numbers in the sequence of prime numbers. ,→ ,→ Now, to find the smallest positive integer divisible by these five distinct prime numbers, ...,→ The calculation is as follows: 2 * 3 * 5 * 7 * 11 = 2310 Therefore, the smallest positive integer divisible by five distinct prime numbers is indeed 2310. This number is the product...

  5. [5]

    The sequence starts with 1, which is plausible for a graph with 2 vertices and 1 edge

  6. [6]

    The sequence increases by 1 for the next three steps (1, 2, 3), which is consistent with adding edges to a graph while keeping the number of vertices constant

  7. [7]

    The sequence then increases by 1 again (3, 4, 5), which is also consistent with adding edges to a graph while keeping the number of vertices constant

  8. [8]

    Key Check:The sequence does not require any negative scores (i.e., more edges than vertices), which is a valid mathematical possibility

    The sequence ends with 5, which is plausible for a graph with 6 vertices and 1 edge. Key Check:The sequence does not require any negative scores (i.e., more edges than vertices), which is a valid mathematical possibility. Therefore, the sequence is feasible. Step 4: Conclusion The given score sequence (1, 1, 1, 2, 2, 3, 4, 5, 5) is valid and corresponds t...

  9. [9]

    **Surjectivity**: For everyM∈R 2×2, there existsM ′ ∈R 2×2 such thatT(M ′) =M3

    **Injectivity**:T(M 1) =T(M 2)⇒M 1 =M 2 2. **Surjectivity**: For everyM∈R 2×2, there existsM ′ ∈R 2×2 such thatT(M ′) =M3. **Linearity**:T(aM 1 +bM 2) =aT(M 1) +bT(M 2)for alla, b∈RandM 1, M2 ∈R 2×2 Step 2: Linearity The transformationT(M) = 1 2 3 4 Mis clearly linear because matrix multiplication is distributive and associative. Thus, the linearity condi...

  10. [10]

    Three vertices are connected in a triangle (degrees 1)

  11. [11]

    Two vertices are connected to each other and to one of the vertices in the triangle (degrees 2)

  12. [12]

    One vertex is connected to one of the vertices in the triangle and to one of the vertices with degree 2 (degree 3)

  13. [13]

    One vertex is connected to the vertices with degrees 1, 2, and 3 (degree 4)

  14. [14]

    This construction ensures that the degrees of the vertices match the given score, and the graph remains connected

    Two vertices are connected to all other vertices, including the ones with degrees 1, 2, 3, and 4 (degrees 5). This construction ensures that the degrees of the vertices match the given score, and the graph remains connected. Therefore, the statement that there exists a graph with the score (1, 1, 1, 2, 2, 3, 4, 5, 5) is true. The final answer is:T rue 30 ...

  15. [15]

    **Injectivity**: The transformation must be one-to-one, meaning that ifT(M 1) =T(M 2), thenM 1 = M2. 2. **Surjectivity**: The transformation must be onto, meaning that for every matrixN∈R 2×2, there exists a matrixM∈R 2×2 such thatT(M) =N. Let’s examine the transformationT(M) = 1 2 3 4 M. GivenM= a b c d , we have: T(M) = 1 2 3 4 a b c d = 1a+ 2c1b+ 2d 3a...