arxiv: 2605.08741 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

Zhengyang Zhao , Lu Ma , Wentao Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-distillationreasoning harnesslarge language modelson-policy trainingcomplex reasoningstandalone performancemathematical reasoningknowledge internalization

0 comments

The pith

Harness-augmented self-distillation internalizes complex reasoning skills into LLMs for strong standalone performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Inference-time harnesses improve large language model reasoning on complex tasks but do not change the model's intrinsic abilities. The paper introduces on-policy harness self-distillation, in which the harness-augmented model generates teaching signals to train its own base version. This transfers the harness's extra supervisory information directly into the model's parameters. The outcome is improved independent performance on math reasoning and classification benchmarks, with no further gain from reattaching the harness later. Readers care because the work shows how temporary external aids can become permanent internal capabilities.

Core claim

On-Policy Harness Self-Distillation (OPHSD) employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft-verify harness for text classification and plan-solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines. Analysis indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex

What carries the argument

On-Policy Harness Self-Distillation (OPHSD), the training loop that uses the current model plus its task-specific harness to produce distillation targets for an unaugmented copy of the same model.

If this is right

The distilled model reaches higher accuracy on complex reasoning tasks when evaluated without any external harness.
Reattaching the harness at inference time provides no benefit and can reduce performance.
Harnesses function as temporary training scaffolds whose benefits remain in the base model after training ends.
The method applies across different harness designs and reasoning domains such as mathematical problem solving and text classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference-time compute can decrease because multi-step external workflows are no longer required after training.
The same self-distillation pattern could transfer other external tools or structured prompting strategies into model weights.
Iterative rounds of increasingly capable harnesses might bootstrap performance on tasks that currently resist internalization.

Load-bearing premise

The extra supervisory signals produced by the harness-augmented teacher can be internalized by the student model to produce robust standalone performance without the harness at inference.

What would settle it

A controlled experiment in which an OPHSD-trained model shows no accuracy gain over a standard baseline when both are tested without the harness, or in which reattaching the harness to the OPHSD model produces a clear performance increase.

Figures

Figures reproduced from arXiv: 2605.08741 by Lu Ma, Wentao Zhang, Zhengyang Zhao.

**Figure 1.** Figure 1: Overview of the OPHSD framework. Left: The student policy simultaneously rollouts and interacts with the harness to generate trajectories, while the teacher generates supervisory signals based on harness trajectories and update the student by reverse KL. Top Right: Draft-verify Harness for text classification, using an online memory bank as the privileged input z(x). Bottom Right: Plan-Solve Harness for ma… view at source ↗

**Figure 2.** Figure 2: Evaluation performance on online text classification across the training stage. We test [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation of pass@8 performance across the training stage on four math benchmarks (%). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Average output length on math benchmarks. OPHSD ensures stability while reducing redundant reasoning. Online text classification: retrieval-conditioned reasoning is written into the weights. We use GPT-4o as an LLM-as-judge to detect whether the student’s chain of thought generated without external retrieval spontaneously incorporates a case citation reasoning step (detailed in Appendix A.4) [PITH_FULL_I… view at source ↗

**Figure 6.** Figure 6: Performance gain across difficulty tiers. Samples are grouped into Hard, Medium, and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emph{On-Policy Harness Self-Distillation} (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft--verify harness for text classification and plan--solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83\% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces On-Policy Harness Self-Distillation (OPHSD), a self-distillation technique that uses a harness-augmented version of the current model as a teacher to transfer capabilities from inference-time harnesses (draft-verify for text classification and plan-solve for mathematical reasoning) into the base model. The central claim is that this internalization leads to robust standalone performance on complex reasoning tasks, outperforming baselines such as OPSD by up to 10.83% on HMMT25, with the added observation that reattaching the harness at inference time provides no benefit and may degrade results.

Significance. Should the internalization effect be confirmed, the work offers a promising way to use harnesses as temporary training aids rather than permanent inference components, potentially simplifying deployment of reasoning systems while maintaining or improving performance. The public availability of code and data is a positive factor for verification and extension.

major comments (2)

[Abstract] Abstract and analysis section: the central internalization claim is supported only indirectly via performance gains over OPSD and the result that reattaching the harness yields no benefit or degradation; this does not isolate whether harness-specific logic (e.g., draft-verify or plan-solve steps) was absorbed versus general gains from extra supervisory signals or training compute, and no ablations or student-output analysis are described.
[Experiments] Experiments/results: the reported +10.83% gain over OPSD and related comparisons lack explicit error bars, run counts, or data-split details in the summary, which are required to assess whether the outperformance reliably supports the robust generalizability claim.

minor comments (1)

[Abstract] Clarify the precise meaning of 'on-policy' in OPHSD, as the term is standard in RL but here appears to denote using the current model as its own teacher.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the internalization claim and the experimental reporting.

read point-by-point responses

Referee: [Abstract] Abstract and analysis section: the central internalization claim is supported only indirectly via performance gains over OPSD and the result that reattaching the harness yields no benefit or degradation; this does not isolate whether harness-specific logic (e.g., draft-verify or plan-solve steps) was absorbed versus general gains from extra supervisory signals or training compute, and no ablations or student-output analysis are described.

Authors: We agree that the current support for internalization is indirect. The gains over OPSD isolate the value of harness-derived signals beyond standard distillation, while the reattachment result indicates the model no longer benefits from (and can be harmed by) the external workflow. To more directly demonstrate absorption of harness-specific logic rather than generic training effects, we will add ablations (e.g., equivalent extra compute without harness structure) and student-output analysis in the revised analysis section. revision: yes
Referee: [Experiments] Experiments/results: the reported +10.83% gain over OPSD and related comparisons lack explicit error bars, run counts, or data-split details in the summary, which are required to assess whether the outperformance reliably supports the robust generalizability claim.

Authors: We concur that error bars, run counts, and data-split details are required for assessing reliability. The revised manuscript will report all main results with standard deviations over multiple random seeds, explicitly state the number of runs, and detail the train/validation/test splits used for each task. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training procedure with independent benchmark evaluation

full rationale

The paper defines OPHSD as a self-distillation procedure that uses a harness-augmented teacher to provide extra supervisory signals, then reports empirical gains on standard tasks (draft-verify for classification, plan-solve for math) against baselines like OPSD. No mathematical derivations, equations, or predictions appear that reduce by construction to fitted parameters or self-definitions. No self-citations are load-bearing for uniqueness theorems or ansatzes. The internalization interpretation rests on performance comparisons and the observation that reattaching the harness yields no benefit, which are falsifiable experimental outcomes rather than tautological reductions. The method and claims are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; standard LLM training assumptions are implicit but unspecified.

axioms (1)

domain assumption Standard assumptions in LLM fine-tuning and knowledge distillation hold for the proposed self-distillation process.
Implicit in the description of using harness-augmented teacher for student training.

pith-pipeline@v0.9.0 · 5514 in / 1088 out tokens · 48808 ms · 2026-05-12T03:46:16.184494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 7 internal anchors

[1]

On-policy distillation of language models: Learning from self- generated mistakes, 2024

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

work page 2024
[2]

Effective context engineering for ai agents

Anthropic. Effective context engineering for ai agents. Engineering blog

work page
[3]

Claude code: Overview

Anthropic. Claude code: Overview. https://code.claude.com/docs/en/overview, 2025

work page 2025
[4]

Claude 4.6.https://www.anthropic.com/claude, 2026

Anthropic. Claude 4.6.https://www.anthropic.com/claude, 2026

work page 2026
[5]

Unlocking the codex harness: how we built the app server, 2026

Celia Chen. Unlocking the codex harness: how we built the app server, 2026. OpenAI Engineering blog, published February 4, 2026

work page 2026
[6]

Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models, 2024

Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models, 2024

work page 2024
[7]

Deepseek-v4: Towards highly efficient million-token context intelligence, April

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April

work page
[8]

Accessed: 2026-05-02

work page 2026
[9]

Lawbench: Benchmarking legal knowledge of large language models

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Benchmarking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7933–7962, 2024

work page 2024
[10]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Gemini 3 pro.https://gemini.google.com/, 2025

Google. Gemini 3 pro.https://gemini.google.com/, 2025

work page 2025
[12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[13]

Self-distillation zero: Self-revision turns binary rewards into dense supervision, 2026

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision, 2026

work page 2026
[14]

arXiv preprint arXiv:2504.11456 , year=

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page arXiv 2025
[15]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

work page arXiv 2026
[16]

Reinforcement learning via self-distillation, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026

work page 2026
[17]

arXiv preprint arXiv:2603.11137 , year =

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[18]

Meta-harness: End-to-end optimization of model harnesses, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. 10

work page 2026
[19]

Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

work page arXiv 2026
[20]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025
[21]

Hermes agent, 2026

Nous Research. Hermes agent, 2026. GitHub repository

work page 2026
[22]

Gpt-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026

OpenAI. Gpt-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026

work page 2026
[23]

Openclaw, 2026

OpenClaw Team. Openclaw, 2026. GitHub repository

work page 2026
[24]

Natural-Language Agent Harnesses

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-language agent harnesses.arXiv preprint arXiv:2603.25723, 2026

work page arXiv 2026
[25]

Crisp: Compressed reasoning via iterative self-policy distillation, 2026

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation, 2026

work page 2026
[26]

What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

work page 2016
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[28]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review arXiv 2026
[29]

Self-distillation enables continual learning, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026

work page 2026
[30]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025

work page 2025
[31]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review arXiv 2026
[32]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Learning using privileged information: Similarity control and knowledge transfer.Journal of Machine Learning Research, 16(61):2023–2049, 2015

Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control and knowledge transfer.Journal of Machine Learning Research, 16(61):2023–2049, 2015

work page 2023
[34]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review arXiv 2026
[35]

Cail2018: A large-scale legal dataset for judgment prediction, 2018

Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. Cail2018: A large-scale legal dataset for judgment prediction, 2018

work page 2018
[36]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

work page 2023
[37]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[38]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Online experiential learning for language models, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models, 2026

work page 2026
[40]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

be concise

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. 12 Appendix Content A Implementation Details 13 A.1 Harness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Baselines . . . . . . . . . . . ...

work page 2026
[42]

in a similar case

cites_pseudo_examples (bool): Does the trace explicitly recall or cite at least one specific past case / precedent / example as evidence to support its conclusion (e.g. “in a similar case ...”, “previously, when ...was charged with ...”, “consider example: ...”)? General domain knowledge or definitional reasoning does NOT count

work page
[43]

this could be A or B, but ...so it is A

compares_alternatives (bool): Does the trace explicitly weigh at least two candidate labels against each other before committing to a final answer (e.g. “this could be A or B, but ...so it is A”)? Then provide a one-sentence reason for your decision. Return STRICT JSON with exactly the keys: {"cites_pseudo_examples": <bool>, "compares_alternatives": <bool...

work page 2015
[44]

Mode A’s internal retrieval is bidirectional.The student’s chain of thought self-stages a recall → compare → decideloop thatretains negligent homicide as a viable candidate alongsideintentional injury

work page
[45]

The harness amplifies exactly the one-sided evidence the student had been weighingagainstinternally

Mode B’s external retrieval is unidirectional.Cosine nearest neighbours in the training memory bank are dominated by routine intentional injury fight cases; the challenger slot, meant to expose the student to contrastive alternatives, gets filled with semantically unrelated offences because no negligent homicide neighbour is close enough to the fight-patt...

work page
[46]

draft a candidate, compare with similar precedents, accept or revise

Harness interfered with the trained student model.Asked to verify a draft already con- ditioned on five intentional injury exemplars, the student abandons the bidirectional weighing it ran in Mode A and locks onto the single-stream harness evidence. The internally tracked negligent homicide candidate is pushed out of the answer set, and the gold label dis...

work page