Harnessing LLM Agents with Skill Programs

Chen Zhao; Hongjun Liu; Shafiq Joty; Yifei Ming

arxiv: 2605.17734 · v1 · pith:LNI2Q6OMnew · submitted 2026-05-18 · 💻 cs.AI

Harnessing LLM Agents with Skill Programs

Hongjun Liu , Yifei Ming , Shafiq Joty , Chen Zhao This is my paper

Pith reviewed 2026-05-20 11:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsprogram functionsskill programsagent interventionweb search reasoningmath reasoningcoding tasksself-improvement

0 comments

The pith

LLM agents improve on long tasks when past skills become executable program functions that intervene at failure states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert lessons from experience into Program Functions that actively monitor an agent and step in with corrections instead of remaining as advisory text. These functions trigger on states likely to cause failure, either changing the immediate action or adding context to steer the agent back on track. The framework supports three uses: direct intervention during inference, structured guidance in post-training, and iterative self-improvement through evolution of tested functions. Gains appear on web-search reasoning, math, and coding, with inference-time functions alone lifting average results by 25 percent over standard ReAct agents.

Core claim

The paper claims that reusable skills can be upgraded from passive textual guidance into executable Program Functions that activate on failure-prone states to modify the next action or inject corrective context, forming a modular system usable at inference time for immediate loop intervention, during post-training for structured supervision, or through controlled evolution of validated functions for self-improvement, with reported gains such as a 25 percent average improvement on web-search reasoning over multi-loop ReAct agents and a 30.4 percent gain over Search-R1 when combining post-training and evolution.

What carries the argument

Program Functions, executable code segments that detect failure-prone states and intervene by altering the agent's next action or adding context.

If this is right

Inference-time Program Functions alone raise average performance by 25 percent on web-search reasoning compared with multi-loop ReAct agents.
Post-training together with controlled evolution of functions produces a 30.4 percent gain over the Search-R1 baseline.
The same modular approach delivers gains on math reasoning and coding tasks in addition to web search.
Mechanism analysis identifies patterns in function triggering, skill internalization, and requirements for stable library evolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If failure-state detection remains reliable outside the tested domains, agents could run longer sequences with reduced need for external resets.
Adding Program Functions as a layer on top of existing agent loops could extend the gains to other architectures without full retraining.
Applying the same executable-correction idea to simulated physical tasks would test whether the performance pattern holds beyond text-based reasoning.

Load-bearing premise

Failure-prone states can be detected reliably in real time so that the program functions intervene correctly without introducing new errors.

What would settle it

A controlled experiment on a held-out web-search task set where agents using the program functions show no performance increase or a decrease compared with the baseline ReAct agent would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.17734 by Chen Zhao, Hongjun Liu, Shafiq Joty, Yifei Ming.

**Figure 2.** Figure 2: Overview of HASP. (a) At inference time, retrieved Program Functions (PFs) guide multiturn agent rollouts by modifying actions or injecting context, while emitted signals support policy internalization and PF evolution. (b) A PF-guided turn converts the policy proposal, intervention, execution result, and feedback into signals for post-training and skill library update. natural-language reminders, PFs mak… view at source ↗

**Figure 3.** Figure 3: Training dynamics for the six post-training settings over processed tokens. We report [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Case study on a MuSiQue two-hop entity-resolution question. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Optimization diagnostics for the six post-training settings. [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of all six post-training settings plotted against global step. From left [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Open-loop training dynamics under a fixed PF library. We compare supervised fine-tuning, [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Closed-loop training dynamics under evolving PF libraries. We compare supervised [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HASP turns skills into executable guardrails that trigger on failure states, with solid reported gains but thin validation on detection accuracy and intervention safety.

read the letter

The main thing here is that the paper converts past experience into Program Functions that detect failure-prone states in real time and intervene by changing the next action or injecting context. This moves beyond advisory text prompts and produces measurable lifts on web search, math, and coding tasks. The 25% gain over multi-loop ReAct at inference time and the 30.4% gain over Search-R1 after post-training plus evolution are the concrete results to note first.

Referee Report

3 major / 2 minor

Summary. The paper introduces HASP, a framework that upgrades textual skills for LLM agents into executable Program Functions (PFs). These PFs function as active guardrails that detect failure-prone states in real time and intervene by modifying the next action or injecting corrective context. The approach is modular and can be deployed at inference time for direct loop intervention, during post-training for structured supervision, or via controlled evolution of teacher-validated PFs for self-improvement. Empirical results on web-search reasoning, math, and coding tasks report a 25% average performance gain from inference-time PFs over multi-loop ReAct and a 30.4% gain from post-training plus evolution over Search-R1, supported by mechanism analysis of PF triggering, skill internalization, and library evolution.

Significance. If the reported gains prove robust, the work would be significant for shifting LLM agent skill use from passive textual advice to executable, state-triggered interventions that directly address failure modes in long-horizon tasks. The modularity across inference, post-training, and evolution stages offers practical flexibility, and the mechanism analysis provides useful insights into how skills are internalized. The substantial percentage improvements over established baselines like ReAct and Search-R1, if reproducible with proper controls, could influence agent design in AI research.

major comments (3)

[Mechanism Analysis] The central performance claims (25% gain over ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) depend on accurate real-time detection of failure-prone states and safe PF intervention. However, the mechanism analysis and experimental sections provide no precision, recall, false-positive rates, or error-injection analysis for the detection step or PF execution, leaving open the possibility that gains arise from the detection loop's extra compute or selective reporting rather than the PF mechanism itself.
[Abstract and Experimental Results] Abstract and results sections: the headline gains are presented without error bars, exact baseline configurations (e.g., number of loops or prompting details for multi-loop ReAct), or explicit criteria for selecting failure states and PFs. This makes it difficult to assess whether the averages are robust or influenced by post-hoc choices, directly affecting the load-bearing empirical support for the framework's superiority.
[Failure State Detection and PF Intervention] The framework assumes failure-prone states can be reliably detected without introducing new failure modes, yet no quantitative validation (e.g., before/after intervention error rates or ablation on detection threshold) is reported. This assumption is load-bearing for the claim that PFs provide net corrective benefit across web-search, math, and coding tasks.

minor comments (2)

[Notation and Terminology] Ensure consistent capitalization and acronym usage for 'Program Function (PF)' and 'HASP' across all sections and figures.
[Experimental Setup] Consider adding a table summarizing baseline configurations, number of runs, and statistical significance tests to improve reproducibility of the reported percentage gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional quantitative analyses, experimental clarifications, and robustness checks as suggested.

read point-by-point responses

Referee: [Mechanism Analysis] The central performance claims (25% gain over ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) depend on accurate real-time detection of failure-prone states and safe PF intervention. However, the mechanism analysis and experimental sections provide no precision, recall, false-positive rates, or error-injection analysis for the detection step or PF execution, leaving open the possibility that gains arise from the detection loop's extra compute or selective reporting rather than the PF mechanism itself.

Authors: We agree that explicit metrics for detection quality strengthen the claims. In the revised manuscript we have added precision, recall, and false-positive rates for the failure-state detector (evaluated on a held-out set of 200 human-annotated trajectories) in the expanded mechanism-analysis section. We also include an error-injection experiment that deliberately triggers failure-prone states and measures intervention success. To address the compute concern, we now report a controlled comparison in which the ReAct baseline is granted additional reasoning loops calibrated to match HASP’s detection overhead; the performance gap remains. Full per-task results are reported without selective omission. revision: yes
Referee: [Abstract and Experimental Results] Abstract and results sections: the headline gains are presented without error bars, exact baseline configurations (e.g., number of loops or prompting details for multi-loop ReAct), or explicit criteria for selecting failure states and PFs. This makes it difficult to assess whether the averages are robust or influenced by post-hoc choices, directly affecting the load-bearing empirical support for the framework's superiority.

Authors: We have revised both the abstract and the experimental-results section to include error bars (standard deviation across three random seeds). We now specify the exact multi-loop ReAct configuration (five reasoning loops, identical base prompt and tool-use format as HASP) and provide the full prompting templates in the appendix. We have also added an explicit subsection describing the failure-state selection criteria (a fixed set of observable indicators: repeated actions, confidence below threshold, or cycle detection) and the PF curation protocol (teacher validation plus automated syntax checks). These details are now present for all three task domains. revision: yes
Referee: [Failure State Detection and PF Intervention] The framework assumes failure-prone states can be reliably detected without introducing new failure modes, yet no quantitative validation (e.g., before/after intervention error rates or ablation on detection threshold) is reported. This assumption is load-bearing for the claim that PFs provide net corrective benefit across web-search, math, and coding tasks.

Authors: We acknowledge the need for direct validation of net benefit. The revised manuscript now reports before/after intervention error rates (Table 3) showing consistent reductions across domains. We further include an ablation varying the detection threshold from 0.5 to 0.9 and plot the resulting task performance; the chosen operating point yields the best trade-off. These additions demonstrate that PF interventions deliver net corrective value without introducing measurable new failure modes under the reported conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external baselines

full rationale

The paper's central claims consist of measured performance gains (25% over multi-loop ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) on web-search, math, and coding tasks. These are direct comparisons to independently published external methods rather than to any quantities defined inside the paper's own equations, fitted parameters, or self-referential definitions. The HASP framework is introduced as a modular design choice (PFs as executable guardrails activating on failure-prone states) whose value is demonstrated empirically; no derivation chain reduces the reported results to inputs by construction, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz or renaming of known results is presented as a first-principles derivation. The mechanism analysis is described as providing insights into triggering and internalization but does not substitute for the external benchmark comparisons.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that LLMs can generate and follow structured interventions and that failure states are detectable without circular dependence on the final performance metric.

free parameters (1)

failure-state detection threshold
A cutoff or classifier used to decide when a Program Function should activate; value not specified in abstract.

axioms (1)

domain assumption LLM agents operate in identifiable failure-prone states where external intervention improves outcomes
Invoked when PFs are described as activating on such states.

invented entities (1)

Program Function (PF) no independent evidence
purpose: Executable guardrail that modifies agent action or injects context upon detecting failure-prone states
New construct introduced to replace passive textual skills.

pith-pipeline@v0.9.0 · 5774 in / 1329 out tokens · 35007 ms · 2026-05-20T11:22:56.213795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

[1]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023

work page 2023
[2]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling (COLM), 2024

work page 2024
[3]

In-the-flow agentic system optimization for effective planning and tool use, 2025

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use, 2025

work page 2025
[4]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274, 2023

work page 2023
[5]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-R1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. VerlTool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

work page arXiv 2025
[7]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026
[8]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023

work page 2023
[9]

Skill0: In-context agentic reinforcement learning for skill internalization, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization, 2026

work page 2026
[10]

Expel: Llm agents are experiential learners, 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, 2024

work page 2024
[11]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. ReSearch: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025. 10

work page arXiv 2025
[14]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

work page arXiv 2025
[17]

ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

work page arXiv 2025
[18]

Acecoder: Acing coder rl via automated test-case synthesis, 2025

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis, 2025

work page 2025
[19]

Posterior-grpo: Rewarding reasoning processes in code generation, 2025

Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation, 2025

work page 2025
[20]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[21]

V oyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023

work page 2023
[22]

Memskill: Learning and evolving memory skills for self-evolving agents, 2026

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents, 2026

work page 2026
[23]

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026

work page 2026
[24]

Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025

work page 2025
[25]

Reinforcement learning for self-improving agent with skill library, 2026

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library, 2026

work page 2026
[26]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

work page 2018
[27]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics

work page 2020
[28]

Musique: Multihop questions via single-hop question composition, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

work page 2022
[29]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[30]

American mathematics competitions

MAA. American mathematics competitions. InAmerican Mathematics Competitions, 2023

work page 2023
[31]

Math twenty four (24s game) dataset

Nathan Lile. Math twenty four (24s game) dataset. https://huggingface.co/ datasets/nlile/24-game, 2024. 11

work page 2024
[32]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[34]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[37]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025

work page arXiv 2025
[38]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

retrieval_failure

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding, 2025. 12 Appendix A Limitations HASP has several limitations. First, while the core PF-only setting does not require a teacher, some stronger variants use external teachers for PF selection, teacher review,...

work page 2025
[40]

A SKILL.md specification (YAML frontmatter + markdown body)

work page
[41]

SEARCH"/

A ProgramFunction Python class with should_activate() and intervene() methods. IMPORTANT RULES: - The PF must be deterministic (NO LLM calls in should_activate). - should_activate receives: step_context (dict), action_type (str: "SEARCH"/"READ"/"FINAL"), arg (str). - intervene receives the same + optional teacher model, and returns an Intervention. - The ...

work page 2005
[42]

FORMAT: stdin script with input()/print()

work page
[43]

ACTION FINAL(

BRANCHES: - sum is odd -> No - sum == 0 -> No - any two sum to the third -> Yes - otherwise -> No" ACTION FINAL(""" a, b, c = map(int, input().split()) if (a + b + c) % 2 == 1: print("No") elif a + b + c == 0: print("No") elif a + b == c or a + c == b or b + c == a: print("Yes") else: print("No") """) JUDGE test 0 (3 8 5) passed (3+5==8) test 1 (2 2 2) FA...

work page
[44]

>> [PF FIRED] code_edge_cases -> INJECT_CONTEXT trigger : FINAL with stdin parsing + branching code intervention: phase instruction reminds:

BRANCHES: - all equal -> Yes <-- missed by baseline - sum odd -> No - sum <= 3 -> No - sum == 4 or 6 -> Yes - any two sum to third -> Yes - otherwise -> No" >> [PF FIRED] code_edge_cases -> INJECT_CONTEXT trigger : FINAL with stdin parsing + branching code intervention: phase instruction reminds: "Trace each sample input through your branches before FINAL...

work page 2024

[1] [1]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023

work page 2023

[2] [2]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling (COLM), 2024

work page 2024

[3] [3]

In-the-flow agentic system optimization for effective planning and tool use, 2025

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use, 2025

work page 2025

[4] [4]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274, 2023

work page 2023

[5] [5]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-R1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. VerlTool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

work page arXiv 2025

[7] [7]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026

[8] [8]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023

work page 2023

[9] [9]

Skill0: In-context agentic reinforcement learning for skill internalization, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization, 2026

work page 2026

[10] [10]

Expel: Llm agents are experiential learners, 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, 2024

work page 2024

[11] [11]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. ReSearch: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025. 10

work page arXiv 2025

[14] [14]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025

work page arXiv 2025

[17] [17]

ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

work page arXiv 2025

[18] [18]

Acecoder: Acing coder rl via automated test-case synthesis, 2025

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis, 2025

work page 2025

[19] [19]

Posterior-grpo: Rewarding reasoning processes in code generation, 2025

Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation, 2025

work page 2025

[20] [20]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023

[21] [21]

V oyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023

work page 2023

[22] [22]

Memskill: Learning and evolving memory skills for self-evolving agents, 2026

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents, 2026

work page 2026

[23] [23]

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026

work page 2026

[24] [24]

Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025

work page 2025

[25] [25]

Reinforcement learning for self-improving agent with skill library, 2026

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library, 2026

work page 2026

[26] [26]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

work page 2018

[27] [27]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics

work page 2020

[28] [28]

Musique: Multihop questions via single-hop question composition, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

work page 2022

[29] [29]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[30] [30]

American mathematics competitions

MAA. American mathematics competitions. InAmerican Mathematics Competitions, 2023

work page 2023

[31] [31]

Math twenty four (24s game) dataset

Nathan Lile. Math twenty four (24s game) dataset. https://huggingface.co/ datasets/nlile/24-game, 2024. 11

work page 2024

[32] [32]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[34] [34]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025

[37] [37]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025

work page arXiv 2025

[38] [38]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

retrieval_failure

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding, 2025. 12 Appendix A Limitations HASP has several limitations. First, while the core PF-only setting does not require a teacher, some stronger variants use external teachers for PF selection, teacher review,...

work page 2025

[40] [40]

A SKILL.md specification (YAML frontmatter + markdown body)

work page

[41] [41]

SEARCH"/

A ProgramFunction Python class with should_activate() and intervene() methods. IMPORTANT RULES: - The PF must be deterministic (NO LLM calls in should_activate). - should_activate receives: step_context (dict), action_type (str: "SEARCH"/"READ"/"FINAL"), arg (str). - intervene receives the same + optional teacher model, and returns an Intervention. - The ...

work page 2005

[42] [42]

FORMAT: stdin script with input()/print()

work page

[43] [43]

ACTION FINAL(

BRANCHES: - sum is odd -> No - sum == 0 -> No - any two sum to the third -> Yes - otherwise -> No" ACTION FINAL(""" a, b, c = map(int, input().split()) if (a + b + c) % 2 == 1: print("No") elif a + b + c == 0: print("No") elif a + b == c or a + c == b or b + c == a: print("Yes") else: print("No") """) JUDGE test 0 (3 8 5) passed (3+5==8) test 1 (2 2 2) FA...

work page

[44] [44]

>> [PF FIRED] code_edge_cases -> INJECT_CONTEXT trigger : FINAL with stdin parsing + branching code intervention: phase instruction reminds:

BRANCHES: - all equal -> Yes <-- missed by baseline - sum odd -> No - sum <= 3 -> No - sum == 4 or 6 -> Yes - any two sum to third -> Yes - otherwise -> No" >> [PF FIRED] code_edge_cases -> INJECT_CONTEXT trigger : FINAL with stdin parsing + branching code intervention: phase instruction reminds: "Trace each sample input through your branches before FINAL...

work page 2024