Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Kai Han; Wenye Lin

arxiv: 2603.01683 · v2 · pith:5V5C5Y2Znew · submitted 2026-03-02 · 💻 cs.CL · cs.AI

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Wenye Lin , Kai Han This is my paper

Pith reviewed 2026-05-21 12:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords post-trainingknowledge retentionon-policy distillationcatastrophic forgettingreasoningLLMsmath reasoningdistillation

0 comments

The pith

KL-constrained rewards let LLMs absorb new reasoning skills from minimally edited data without losing prior knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that recent doubts about KL-divergence were misplaced and that the KL-constrained reward formulation is actually central to stopping catastrophic forgetting when post-training large language models on reasoning tasks. It introduces Surgical Post-Training (SPOT), a framework that first uses an oracle to make minimal targeted corrections to erroneous reasoning steps in the model's own outputs, then trains with a reward-based binary cross-entropy loss under the KL constraint. With only 4k such rectified math pairs the method raises Qwen3-8B accuracy by 6.2 percent on average across both in-domain and out-of-domain benchmarks while finishing training in 16 minutes on eight H800 GPUs. A reader should care because the approach also supplies a stronger starting checkpoint for later reinforcement learning stages.

Core claim

The authors establish that the KL-constrained reward formulation plays a critical role in retaining knowledge during post-training for reasoning. This insight leads to the SPOT framework, which consists of an Oracle-driven data rectification pipeline that surgically corrects erroneous reasoning steps via minimal edits to generate proximal on-policy data, paired with a reward-based binary cross-entropy objective that simultaneously improves reasoning performance and mitigates forgetting.

What carries the argument

The proximal on-policy distillation framework, built on an Oracle rectification pipeline that applies minimal edits to model-generated reasoning steps and a KL-constrained reward-based binary cross-entropy objective.

If this is right

SPOT with 4k rectified math pairs raises average accuracy by 6.2 percent on both in-domain and out-of-domain tasks.
The resulting checkpoint serves as a stronger initialization that elevates the performance ceiling of subsequent reinforcement learning.
The full training run completes in 16 minutes on 8x H800 GPUs.
Knowledge retention during reasoning post-training is achieved through the combination of proximal on-policy data and the KL-constrained reward objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rectification-plus-KL approach might transfer to other skill-injection settings such as code generation or scientific reasoning.
If the minimal-edit property holds at scale, the method could reduce the data volume needed for safe specialization of frontier models.
The result invites re-examination of why KL terms appeared ineffective in prior distillation studies that used different data-generation pipelines.

Load-bearing premise

The oracle rectification step produces corrected data that stays close to the original model outputs and contains no new systematic errors or biases.

What would settle it

If models trained with the rectified 4k pairs show no accuracy gain or even drops on out-of-domain tasks relative to the base model, or if inspection reveals that the minimal edits introduce new consistent mistakes absent from the original generations.

Figures

Figures reproduced from arXiv: 2603.01683 by Kai Han, Wenye Lin.

**Figure 1.** Figure 1: Illustration of SPOT. Left: Our framework utilizes an Oracle to apply surgical rectifications to erroneous reasoning steps, generating positive samples that remain proximal to the model’s original distribution. Top-Right: Unlike the relative ranking used in DPO, we leverage an explicit classification loss that proves more effective for reasoning tasks. Bottom-Right: The tethering effect inherent in our rew… view at source ↗

**Figure 2.** Figure 2: IFEval Acc Results (avg@5). SFT+ forgets instruction following ability, while reward-based methods do not. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss curve. Reward-SFT and DPO converge rapidly to near-zero as the policy satisfies the relative margin constraint. SFT+ remains high as it attempts to maximize absolute likelihood, driving continuous parameter updates. These theoretical insights are empirically substantiated by the training loss trajectories in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of implicit rewards during training. Left: reward scores for chosen responses; Right: reward scores for rejected responses. The reward shows how much the model’s preference for the selected response has increased compared to its initial stage. with verifiable truth. To bridge this gap, we introduce a reward-based binary cross-entropy objective. 4.1. The “Pull-Up” Effect While Reward-SFT effective… view at source ↗

**Figure 5.** Figure 5: Distributions of change ratio. A higher change ratio indicates that reasoning failures occur earlier in the reasoning chain. The distribution shape is determined by the model ability and the data difficulty together. E. Connect4 Task Game Rules Connect4 is a two-player game of perfect information played on a vertical grid of dimension 6 × 7. Players alternate turns dropping distinct pieces into one of the … view at source ↗

**Figure 6.** Figure 6: Random Rectification Example 1. Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Random Rectification Example 2. Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Random Rectification Example 3. Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Random Rectification Example 4. Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks, requiring merely 16-minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual-AI/SPoT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Surgical Post-Training (SPOT), a proximal on-policy distillation method for injecting reasoning knowledge into LLMs while mitigating catastrophic forgetting. It consists of an Oracle-based data rectification pipeline that performs minimal edits to erroneous reasoning steps to produce proximal on-policy data, combined with a reward-based binary cross-entropy objective that incorporates a KL constraint. The authors claim both analytical and empirical support for the KL term's role in knowledge retention, reporting that SPOT with 4k rectified math pairs improves Qwen3-8B accuracy by 6.2% on average across in- and out-of-domain tasks, requires only 16 minutes of training on 8x H800 GPUs, and yields a superior initialization for subsequent RL.

Significance. If the rectification pipeline reliably produces data that remains close to the base model's output distribution without introducing systematic biases, the result would offer a practical, data-efficient approach to reasoning post-training with explicit retention guarantees. The reported gains from a small curated set, the public code release, and the downstream RL benefit are concrete strengths that could influence efficient fine-tuning pipelines. However, the central claims depend on unverified properties of the Oracle edits, so the significance is currently provisional.

major comments (2)

The abstract and introduction assert that the Oracle pipeline 'surgically correct[s] erroneous steps via minimal edits' to generate proximal on-policy data, yet no quantitative controls (edit distance, token-level divergence, or error-introduction rate) are reported for the 4k math pairs. This property is load-bearing for the claim that observed retention stems from the KL-constrained objective rather than higher-quality supervision; without such metrics the empirical 6.2% gain cannot be unambiguously attributed to the proposed formulation.
The analytical demonstration that the KL-constrained reward aids retention is described as internal to the paper's formulation. If the derivation relies on assumptions introduced to match observed forgetting behavior (as suggested by the circularity concern), it risks being self-referential; a concrete external test or comparison against standard KL-regularized objectives on held-out retention metrics would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: The abstract and introduction assert that the Oracle pipeline 'surgically correct[s] erroneous steps via minimal edits' to generate proximal on-policy data, yet no quantitative controls (edit distance, token-level divergence, or error-introduction rate) are reported for the 4k math pairs. This property is load-bearing for the claim that observed retention stems from the KL-constrained objective rather than higher-quality supervision; without such metrics the empirical 6.2% gain cannot be unambiguously attributed to the proposed formulation.

Authors: We agree that the absence of quantitative metrics on the Oracle rectification pipeline leaves an important gap in supporting the claim of proximity. In the revised manuscript we will add a dedicated subsection (or appendix) reporting average Levenshtein edit distance, token-level KL divergence between the base model outputs and the rectified trajectories, and the observed rate of newly introduced errors across the 4k math pairs. These statistics will be computed on the exact data used for the reported experiments and will be used to quantify how close the rectified data remains to the original on-policy distribution. revision: yes
Referee: The analytical demonstration that the KL-constrained reward aids retention is described as internal to the paper's formulation. If the derivation relies on assumptions introduced to match observed forgetting behavior (as suggested by the circularity concern), it risks being self-referential; a concrete external test or comparison against standard KL-regularized objectives on held-out retention metrics would strengthen the claim.

Authors: The derivation in Section 4 follows directly from the properties of the binary cross-entropy reward combined with the KL penalty and does not rely on post-hoc fitting to observed forgetting curves. Nevertheless, we acknowledge the value of an external empirical check. In the revision we will include a new set of experiments that compare SPOT against a standard KL-regularized supervised fine-tuning baseline on held-out retention tasks (e.g., previously learned non-math capabilities). Retention will be measured by accuracy on those tasks before and after post-training, providing a direct side-by-side evaluation that goes beyond the analytical argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external benchmarks

full rationale

The paper's central claims rest on an analytical demonstration of the KL-constrained reward's role in retention plus empirical gains from the rectification pipeline and binary cross-entropy objective. No equations, fitted parameters, or self-citations are exhibited that reduce the reported 6.2% accuracy improvement or the 'proximal on-policy' property to inputs by construction. The rectification pipeline is presented as an independent mechanism whose outputs are then used in training; the analytical part is described as internal but does not collapse into a tautology or renamed fit within the provided text. This is the normal case of a self-contained empirical framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that an external Oracle can produce high-quality proximal corrections and that the KL term in the reward is the decisive factor for retention; no explicit free parameters or new physical entities are introduced beyond standard LLM training hyperparameters.

axioms (1)

domain assumption KL-constrained reward formulation is critical for knowledge retention during post-training
Stated as shown analytically in the abstract; details of the derivation are not visible from the provided text.

invented entities (1)

Oracle for surgical step correction no independent evidence
purpose: Generate proximal on-policy data by minimal edits to erroneous reasoning steps
New component introduced in the data rectification pipeline; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5739 in / 1414 out tokens · 47743 ms · 2026-05-21T12:37:56.031130+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the KL-constrained reward formulation actually plays a critical role in retaining knowledge... Elastic Tether... λ=1−σ(r_θ)
IndisputableMonolith/Foundation/Atomicity.lean atomic_tick unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

data rectification pipeline... surgically correct erroneous steps via minimal edits... RLCS filtering

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[3]

**IMPORTANT OUTPUT FORMAT:**

Carefully match the student model's original writing style, including their tone, vocabulary, formatting and sentence structure. **IMPORTANT OUTPUT FORMAT:**

work page
[7]

Here is the corrected version

Do not output meta-phrases like "Here is the corrected version" Rectification Prompt With Ground Truth Act as a helpful teaching assistant. Your goal is to revise a student model's answer to make it correct, while maintaining the student model's original writing style, tone, and formatting. The final result should look as if the student model had solved t...

work page
[8]

Identify the correct parts of the student model's answer and keep them

work page
[9]

Replace the incorrect parts with correct reasoning

work page
[10]

**IMPORTANT OUTPUT FORMAT:**

Carefully match the student model's original writing style, including their tone, vocabulary, and sentence structure. **IMPORTANT OUTPUT FORMAT:**

work page
[11]

First output``=== CORRECTED STARTED ===''followed by the corrected answer

work page
[12]

Ends with the corrected answer in the format:'Therefore, the final answer is: $\\boxed{{ANSWER}}$.'

work page
[13]

Then output``=== CORRECTED ENDED ===''at the end of the corrected trace

work page
[14]

Here is the corrected version

Do not output meta-phrases like "Here is the corrected version" G. Rectification Samples Figure 6.Random Rectification Example 1.Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro. 13 Surgical Post-Training: Cutting Errors, Keeping Knowledge Figure 7.Random Rectification Example 2.Left: answer from Qwen3-8B; Right: rectification by Gemini ...

work page

[1] [3]

**IMPORTANT OUTPUT FORMAT:**

Carefully match the student model's original writing style, including their tone, vocabulary, formatting and sentence structure. **IMPORTANT OUTPUT FORMAT:**

work page

[2] [7]

Here is the corrected version

Do not output meta-phrases like "Here is the corrected version" Rectification Prompt With Ground Truth Act as a helpful teaching assistant. Your goal is to revise a student model's answer to make it correct, while maintaining the student model's original writing style, tone, and formatting. The final result should look as if the student model had solved t...

work page

[3] [8]

Identify the correct parts of the student model's answer and keep them

work page

[4] [9]

Replace the incorrect parts with correct reasoning

work page

[5] [10]

**IMPORTANT OUTPUT FORMAT:**

Carefully match the student model's original writing style, including their tone, vocabulary, and sentence structure. **IMPORTANT OUTPUT FORMAT:**

work page

[6] [11]

First output``=== CORRECTED STARTED ===''followed by the corrected answer

work page

[7] [12]

Ends with the corrected answer in the format:'Therefore, the final answer is: $\\boxed{{ANSWER}}$.'

work page

[8] [13]

Then output``=== CORRECTED ENDED ===''at the end of the corrected trace

work page

[9] [14]

Here is the corrected version

Do not output meta-phrases like "Here is the corrected version" G. Rectification Samples Figure 6.Random Rectification Example 1.Left: answer from Qwen3-8B; Right: rectification by Gemini 2.5 Pro. 13 Surgical Post-Training: Cutting Errors, Keeping Knowledge Figure 7.Random Rectification Example 2.Left: answer from Qwen3-8B; Right: rectification by Gemini ...

work page