arxiv: 2605.12741 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

Yuwei Zhang , Sha Li , Changlong Yu , Qin Lu , Shuowei Jin , Chengyu Dong , Haoran Liu , Ilgee Hong

show 4 more authors

Xintong Li Zhenyu Shi Bing Yin Jingbo Shang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-distillationreflectionfailure feedbacklarge language modelscontinual learningpost-trainingreinforcement learningGRPO

0 comments

The pith

Reflection-Enhanced Self-Distillation lets models learn from failure feedback by creating diagnostic reflections and a reusable global playbook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RESD to help large language models improve from environmental interactions even when successful outcomes are rare. Instead of relying on successful demonstrations like standard self-distillation, RESD has the model generate retrospective reflections on failed trajectories to diagnose errors and maintains a persistent global playbook of lessons. This enriched context allows the self-teacher to offer token-level supervision without any successful rollouts. Evaluations show it outperforms baselines and improves faster than GRPO with far fewer samples. This matters because it could make continual learning more practical in settings where good results are infrequent.

Core claim

RESD transforms raw failure feedback into an active source of corrective supervision by interpreting failed trajectories through retrospective reflections that diagnose local errors and by curating a persistent global playbook that preserves reusable lessons across training steps, thereby enabling actionable token-level supervision even in the absence of successful rollouts.

What carries the argument

Retrospective reflections that diagnose local errors in failed trajectories, combined with a curated persistent global playbook that stores reusable lessons for cross-step supervision.

If this is right

RESD substantially outperforms standard self-distillation baselines on continual learning tasks.
RESD achieves significantly faster early-stage improvement than GRPO while using only a single rollout per prompt instead of 8 times as many samples.
The enriched context from reflections and playbook allows learning without waiting for successful demonstrations.
Token-level supervision becomes available from failure data alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RESD could lower the number of interactions needed in reinforcement learning setups for LLMs.
If the playbook truly captures reusable lessons, it might transfer across different tasks or environments.
Removing the reflection step or the playbook would likely reduce the method to standard self-distillation performance.
Future work could test whether the reflections remain accurate as the model improves over many steps.

Load-bearing premise

The model-generated retrospective reflections correctly identify local errors and the global playbook stores reusable lessons without adding noise or causing errors to compound during training.

What would settle it

Compare performance when using randomly generated or incorrect reflections instead of model-generated ones, or train without the playbook component to see if the efficiency gains over GRPO disappear.

Figures

Figures reproduced from arXiv: 2605.12741 by Bing Yin, Changlong Yu, Chengyu Dong, Haoran Liu, Ilgee Hong, Jingbo Shang, Qin Lu, Sha Li, Shuowei Jin, Xintong Li, Yuwei Zhang, Zhenyu Shi.

**Figure 1.** Figure 1: RESD improves interaction efficiency during training. The x-axis is the number of samples. A fundamental challenge in the post-training of Large Language Models (LLMs) is enabling continuous improvement through environmental interactions. Traditionally, Reinforcement Learning (RL) algorithms, such as PPO or GRPO, have been the standard paradigm for aligning models with desired outcomes. However, a crit… view at source ↗

**Figure 2.** Figure 2: Overview of the RESD framework. The student generates a rollout and receives environment feedback. On failure, a local self-reflection diagnoses the error and a global playbook curation step distills reusable lessons into a persistent playbook. The enriched context (reflection, curated playbook, and any cached solutions) is fed to the teacher prompt, whose output distribution provides token-level supervisi… view at source ↗

**Figure 3.** Figure 3: Per-task accuracy on FINER under varying rollout batch sizes. SDPO (N=1) degrades without peer demonstrations, while adding reflection and playbook curation (SDPO+Ref, N=1) recovers and surpasses the N=8 baseline. The degradation observed in the N=1 setting suggests that environment feedback alone is not sufficient to support effective self-distillation. This is somewhat surprising, as the teacher is s… view at source ↗

**Figure 4.** Figure 4: Per-prompt mean accuracy distribution on FINER over training steps. All-wrong cases’ proportion decreases better for SDPO+Ref. Building on the above observation, we introduce Reflection-Enhanced Self-Distillation (RESD), a framework that operationalizes active feedback understanding within the self-distillation loop. RESD maintains two forms of persistent context: a playbook Pt, following the broader ide… view at source ↗

**Figure 5.** Figure 5: (a)(b) Token-level distillation loss for training step 20. Each token is shaded from white [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Validation performance across training steps for [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy over training steps for the runs reported in Table 2. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Per-step training latency for BOUNCINGSIM-EASY. While RESD significantly improves interaction efficiency by requiring only a single rollout per prompt, the generation of retrospective reflections and playbook curation introduces additional inference steps. To understand the latency implications of these components, we analyze the per-step training latency of RESD compared to SDPO and GRPO baselines. As ill… view at source ↗

read the original abstract

Enabling Large Language Models (LLMs) to continuously improve from environmental interactions is a central challenge in post-training. While on-policy self-distillation offers a promising paradigm, existing methods predominantly treat environmental feedback as a passive conditioning signal. Consequently, they heavily rely on successful demonstrations and struggle to learn in rare-success regimes. To bridge this gap, we introduce Reflection-Enhanced Self-Distillation (RESD), a framework that transforms raw failure feedback into an active source of corrective supervision. Instead of passively appending feedback, RESD interprets failed trajectories by generating retrospective reflections to diagnose local errors, and curates a persistent global playbook to preserve reusable lessons across training steps. The enriched context enables the self-teacher to provide actionable token-level supervision even in the absence of successful rollouts. Empirical evaluations on multiple continual learning tasks demonstrate that RESD substantially outperforms standard self-distillation baselines. Furthermore, RESD achieves significantly faster early-stage improvement than GRPO with $8\times$ samples using only a single rollout per prompt, highlighting its superior interaction efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RESD adds model-generated reflections on failures plus a persistent playbook to self-distillation so it can supervise from unsuccessful rollouts, but the accuracy of those reflections is not directly measured.

read the letter

The main takeaway is that this paper targets the practical problem of self-distillation when positive examples are scarce. Instead of waiting for successes, RESD has the model write retrospective reflections on failed trajectories to spot local errors, then folds those into a running global playbook that supplies token-level supervision across steps. The result is claimed to beat standard self-distillation baselines and to reach faster early gains than GRPO while using only one rollout per prompt instead of eight times as many samples. That efficiency angle is the part worth paying attention to for anyone running continual learning loops on LLMs. What is actually new is the concrete pairing of reflection generation with a persistent, curated playbook on top of the usual self-distillation setup; prior work cited in the abstract does not reduce to this exact mechanism for rare-success regimes. The approach is straightforward and directly attacks interaction cost, which is a real constraint in post-training. The soft spot is exactly where the stress-test note flags it: the reflections are produced by the same base model that is still failing, so there is no built-in guarantee they diagnose errors correctly rather than adding plausible noise. The paper does not appear to isolate reflection fidelity with separate checks or ablations, nor does it track whether the playbook drifts or compounds mistakes over many steps. Without those controls, it is hard to know how much of the reported improvement comes from the new components versus other training details. This is for readers already working on LLM post-training, on-policy distillation, or agent continual learning who want a concrete tweak to try. It is not a broad theoretical shift, but the problem it names is common enough that the method could be useful if the mechanism holds. I would send it to peer review because the core idea is testable and the efficiency claim is the sort of result referees can pressure-test with targeted experiments.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Reflection-Enhanced Self-Distillation (RESD), a framework that enables LLMs to improve from rare-success interactions by generating retrospective reflections on failed trajectories to diagnose local errors and curating a persistent global playbook to aggregate reusable lessons. This enriched context supplies token-level supervision for self-distillation even without successful rollouts. The paper claims that RESD substantially outperforms standard self-distillation baselines on multiple continual learning tasks and achieves significantly faster early-stage improvement than GRPO while using only a single rollout per prompt versus 8× samples.

Significance. If the empirical results hold after verification of the core assumptions, RESD would offer a practical advance for post-training LLMs in sparse-reward interactive settings by converting failure feedback into active corrective signals. The approach improves interaction efficiency and addresses a recognized limitation of passive conditioning in on-policy self-distillation. Strengths include the explicit handling of rare-success regimes and the focus on reusable lesson preservation across steps.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The central claims of substantial outperformance and faster early-stage improvement lack reported metrics (e.g., exact accuracy deltas, dataset sizes, number of runs, and p-values). Without these and without ablations that isolate the contribution of reflection fidelity versus playbook curation, the empirical superiority cannot be verified as load-bearing for the method.
[§3.2 (Reflection and Playbook)] §3.2 (Reflection and Playbook): The method assumes model-generated reflections accurately diagnose errors and that the playbook curation avoids noise accumulation. No direct evaluation (human judgment, oracle comparison, or consistency checks over training steps) is provided to test this assumption, which is the least secure link in rare-success regimes and directly affects whether performance gains are attributable to the proposed components.
[§4.3 (GRPO Comparison)] §4.3 (GRPO Comparison): The efficiency claim (single rollout vs. GRPO with 8× samples) requires explicit confirmation that the rollout budget, prompt distribution, and evaluation protocol are matched; otherwise the interaction-efficiency advantage is not directly comparable and undermines the cross-method conclusion.

minor comments (2)

[Figure 1] Figure 1: The framework diagram would be clearer with explicit arrows and labels distinguishing the reflection-generation step from the playbook-update step.
[§3] Notation: The distinction between local reflection tokens and global playbook entries should be defined once in §3 and used consistently to avoid ambiguity in later equations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support and verifiability of our claims without altering the core contributions.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central claims of substantial outperformance and faster early-stage improvement lack reported metrics (e.g., exact accuracy deltas, dataset sizes, number of runs, and p-values). Without these and without ablations that isolate the contribution of reflection fidelity versus playbook curation, the empirical superiority cannot be verified as load-bearing for the method.

Authors: We agree that these details are necessary for full verification. In the revised manuscript, we will expand §4 to report exact accuracy deltas with standard deviations across tasks, specify dataset sizes (e.g., 5k prompts per continual learning task), confirm that all experiments used 3 independent random seeds, and include p-values from paired t-tests against baselines. We will also add targeted ablations in a new subsection that isolate reflection fidelity (by comparing model-generated reflections to oracle or random variants) versus playbook curation (by ablating the persistent global playbook while keeping reflections). These revisions will directly address the load-bearing nature of each component. revision: yes
Referee: [§3.2 (Reflection and Playbook)] §3.2 (Reflection and Playbook): The method assumes model-generated reflections accurately diagnose errors and that the playbook curation avoids noise accumulation. No direct evaluation (human judgment, oracle comparison, or consistency checks over training steps) is provided to test this assumption, which is the least secure link in rare-success regimes and directly affects whether performance gains are attributable to the proposed components.

Authors: This assumption is indeed central, and we acknowledge the value of direct evaluation. While indirect support comes from the performance gains in ablations and qualitative trajectory examples already in the appendix, we will add a new analysis in the revision: human judgment ratings on 200 sampled reflections for error-diagnosis accuracy (with inter-annotator agreement), oracle comparisons on a held-out set where possible, and consistency checks tracking playbook entry reuse and noise impact via ablation on downstream task performance over training steps. This will be presented in §3.2 and the appendix to substantiate the claims. revision: yes
Referee: [§4.3 (GRPO Comparison)] §4.3 (GRPO Comparison): The efficiency claim (single rollout vs. GRPO with 8× samples) requires explicit confirmation that the rollout budget, prompt distribution, and evaluation protocol are matched; otherwise the interaction-efficiency advantage is not directly comparable and undermines the cross-method conclusion.

Authors: We confirm that the experiments used identical prompt distributions and evaluation protocols as stated in §4.3. To make the interaction-efficiency comparison fully transparent, we will revise §4.3 to include an explicit table and paragraph detailing total rollout budgets (RESD: 1 rollout + reflection tokens per prompt; GRPO: 8 samples per prompt), normalized interaction counts, and per-interaction improvement curves. This clarification will show that the reported faster early-stage gains hold under matched conditions while preserving the distinction in supervision density. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper defines RESD via new procedural components (retrospective reflections on failed trajectories and curation of a persistent global playbook) that are introduced as explicit additions to standard self-distillation. These components are then evaluated empirically against external baselines (standard self-distillation and GRPO) rather than being defined in terms of the target performance metrics or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked that reduce the central claims to inputs by construction. The derivation remains self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claim rests on the unverified capacity of the base LLM to produce diagnostically useful reflections and on the assumption that a curated playbook remains stable and beneficial over multiple training steps; no free parameters or invented physical entities are stated.

axioms (2)

domain assumption LLMs can generate accurate retrospective reflections that diagnose local errors in failed trajectories
Invoked when the method transforms raw failure feedback into corrective supervision
domain assumption A persistent global playbook can preserve reusable lessons across training steps without introducing compounding noise
Required for the enriched context to remain beneficial rather than harmful

invented entities (1)

Reflection-Enhanced Self-Distillation (RESD) framework no independent evidence
purpose: To convert failure feedback into actionable token-level supervision
New procedural construct introduced by the paper

pith-pipeline@v0.9.0 · 5516 in / 1390 out tokens · 30605 ms · 2026-05-14T21:20:07.083787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 13 internal anchors

[1]

Gkd: Generalized knowledge distillation for auto- regressive sequence models,

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models.arXiv preprint arXiv:2306.13649, 12, 2023

work page arXiv 2023
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[3]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

work page 2025
[4]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review arXiv 2024
[5]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[6]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

work page 2016
[9]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

work page arXiv 2024
[10]

Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2026

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2026

work page 2026
[11]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

work page arXiv 2026
[12]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation. 10

work page 2025
[13]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

work page 2023
[14]

arXiv preprint arXiv:2602.04942 , year =

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page arXiv 2026
[15]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Self-distillation enables continual learning, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026

work page 2026
[18]

Rl’s razor: Why online reinforcement learning forgets less, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025

work page 2025
[19]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[20]

Rl grokking recipe: How does rl unlock and transfer new algorithms in llms?, sep 2025

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, and Dawn Song. Rl grokking recipe: How does rl unlock and transfer new algorithms in llms?, sep 2025

work page 2025
[21]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023
[23]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

work page arXiv 2024
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Answer- ing questions by meta-reasoning over multiple chains of thought

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answer- ing questions by meta-reasoning over multiple chains of thought. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5942–5966, 2023

work page 2023
[29]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Accept if the tape contains BRBR

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Denny Zhou, Swaroop Mishra, Huaixiu S Zheng, et al. Self-discover: Large language models self-compose reasoning structures.Advances in Neural Information Processing Systems, 37:126032–126058, 2024. A Discussion on the Self-Distillation Objective As established in Section 2, our unifi...

work page 2024