Harnessing LLM Agents with Skill Programs
Pith reviewed 2026-05-20 11:22 UTC · model grok-4.3
The pith
LLM agents improve on long tasks when past skills become executable program functions that intervene at failure states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that reusable skills can be upgraded from passive textual guidance into executable Program Functions that activate on failure-prone states to modify the next action or inject corrective context, forming a modular system usable at inference time for immediate loop intervention, during post-training for structured supervision, or through controlled evolution of validated functions for self-improvement, with reported gains such as a 25 percent average improvement on web-search reasoning over multi-loop ReAct agents and a 30.4 percent gain over Search-R1 when combining post-training and evolution.
What carries the argument
Program Functions, executable code segments that detect failure-prone states and intervene by altering the agent's next action or adding context.
If this is right
- Inference-time Program Functions alone raise average performance by 25 percent on web-search reasoning compared with multi-loop ReAct agents.
- Post-training together with controlled evolution of functions produces a 30.4 percent gain over the Search-R1 baseline.
- The same modular approach delivers gains on math reasoning and coding tasks in addition to web search.
- Mechanism analysis identifies patterns in function triggering, skill internalization, and requirements for stable library evolution.
Where Pith is reading between the lines
- If failure-state detection remains reliable outside the tested domains, agents could run longer sequences with reduced need for external resets.
- Adding Program Functions as a layer on top of existing agent loops could extend the gains to other architectures without full retraining.
- Applying the same executable-correction idea to simulated physical tasks would test whether the performance pattern holds beyond text-based reasoning.
Load-bearing premise
Failure-prone states can be detected reliably in real time so that the program functions intervene correctly without introducing new errors.
What would settle it
A controlled experiment on a held-out web-search task set where agents using the program functions show no performance increase or a decrease compared with the baseline ReAct agent would disprove the central claim.
Figures
read the original abstract
Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HASP, a framework that upgrades textual skills for LLM agents into executable Program Functions (PFs). These PFs function as active guardrails that detect failure-prone states in real time and intervene by modifying the next action or injecting corrective context. The approach is modular and can be deployed at inference time for direct loop intervention, during post-training for structured supervision, or via controlled evolution of teacher-validated PFs for self-improvement. Empirical results on web-search reasoning, math, and coding tasks report a 25% average performance gain from inference-time PFs over multi-loop ReAct and a 30.4% gain from post-training plus evolution over Search-R1, supported by mechanism analysis of PF triggering, skill internalization, and library evolution.
Significance. If the reported gains prove robust, the work would be significant for shifting LLM agent skill use from passive textual advice to executable, state-triggered interventions that directly address failure modes in long-horizon tasks. The modularity across inference, post-training, and evolution stages offers practical flexibility, and the mechanism analysis provides useful insights into how skills are internalized. The substantial percentage improvements over established baselines like ReAct and Search-R1, if reproducible with proper controls, could influence agent design in AI research.
major comments (3)
- [Mechanism Analysis] The central performance claims (25% gain over ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) depend on accurate real-time detection of failure-prone states and safe PF intervention. However, the mechanism analysis and experimental sections provide no precision, recall, false-positive rates, or error-injection analysis for the detection step or PF execution, leaving open the possibility that gains arise from the detection loop's extra compute or selective reporting rather than the PF mechanism itself.
- [Abstract and Experimental Results] Abstract and results sections: the headline gains are presented without error bars, exact baseline configurations (e.g., number of loops or prompting details for multi-loop ReAct), or explicit criteria for selecting failure states and PFs. This makes it difficult to assess whether the averages are robust or influenced by post-hoc choices, directly affecting the load-bearing empirical support for the framework's superiority.
- [Failure State Detection and PF Intervention] The framework assumes failure-prone states can be reliably detected without introducing new failure modes, yet no quantitative validation (e.g., before/after intervention error rates or ablation on detection threshold) is reported. This assumption is load-bearing for the claim that PFs provide net corrective benefit across web-search, math, and coding tasks.
minor comments (2)
- [Notation and Terminology] Ensure consistent capitalization and acronym usage for 'Program Function (PF)' and 'HASP' across all sections and figures.
- [Experimental Setup] Consider adding a table summarizing baseline configurations, number of runs, and statistical significance tests to improve reproducibility of the reported percentage gains.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional quantitative analyses, experimental clarifications, and robustness checks as suggested.
read point-by-point responses
-
Referee: [Mechanism Analysis] The central performance claims (25% gain over ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) depend on accurate real-time detection of failure-prone states and safe PF intervention. However, the mechanism analysis and experimental sections provide no precision, recall, false-positive rates, or error-injection analysis for the detection step or PF execution, leaving open the possibility that gains arise from the detection loop's extra compute or selective reporting rather than the PF mechanism itself.
Authors: We agree that explicit metrics for detection quality strengthen the claims. In the revised manuscript we have added precision, recall, and false-positive rates for the failure-state detector (evaluated on a held-out set of 200 human-annotated trajectories) in the expanded mechanism-analysis section. We also include an error-injection experiment that deliberately triggers failure-prone states and measures intervention success. To address the compute concern, we now report a controlled comparison in which the ReAct baseline is granted additional reasoning loops calibrated to match HASP’s detection overhead; the performance gap remains. Full per-task results are reported without selective omission. revision: yes
-
Referee: [Abstract and Experimental Results] Abstract and results sections: the headline gains are presented without error bars, exact baseline configurations (e.g., number of loops or prompting details for multi-loop ReAct), or explicit criteria for selecting failure states and PFs. This makes it difficult to assess whether the averages are robust or influenced by post-hoc choices, directly affecting the load-bearing empirical support for the framework's superiority.
Authors: We have revised both the abstract and the experimental-results section to include error bars (standard deviation across three random seeds). We now specify the exact multi-loop ReAct configuration (five reasoning loops, identical base prompt and tool-use format as HASP) and provide the full prompting templates in the appendix. We have also added an explicit subsection describing the failure-state selection criteria (a fixed set of observable indicators: repeated actions, confidence below threshold, or cycle detection) and the PF curation protocol (teacher validation plus automated syntax checks). These details are now present for all three task domains. revision: yes
-
Referee: [Failure State Detection and PF Intervention] The framework assumes failure-prone states can be reliably detected without introducing new failure modes, yet no quantitative validation (e.g., before/after intervention error rates or ablation on detection threshold) is reported. This assumption is load-bearing for the claim that PFs provide net corrective benefit across web-search, math, and coding tasks.
Authors: We acknowledge the need for direct validation of net benefit. The revised manuscript now reports before/after intervention error rates (Table 3) showing consistent reductions across domains. We further include an ablation varying the detection threshold from 0.5 to 0.9 and plot the resulting task performance; the chosen operating point yields the best trade-off. These additions demonstrate that PF interventions deliver net corrective value without introducing measurable new failure modes under the reported conditions. revision: yes
Circularity Check
No significant circularity: empirical claims rest on external baselines
full rationale
The paper's central claims consist of measured performance gains (25% over multi-loop ReAct at inference time; 30.4% over Search-R1 after post-training and evolution) on web-search, math, and coding tasks. These are direct comparisons to independently published external methods rather than to any quantities defined inside the paper's own equations, fitted parameters, or self-referential definitions. The HASP framework is introduced as a modular design choice (PFs as executable guardrails activating on failure-prone states) whose value is demonstrated empirically; no derivation chain reduces the reported results to inputs by construction, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz or renaming of known results is presented as a first-principles derivation. The mechanism analysis is described as providing insights into triggering and internalization but does not substitute for the external benchmark comparisons.
Axiom & Free-Parameter Ledger
free parameters (1)
- failure-state detection threshold
axioms (1)
- domain assumption LLM agents operate in identifiable failure-prone states where external intervention improves outcomes
invented entities (1)
-
Program Function (PF)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Toolformer: Language models can teach themselves to use tools, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023
work page 2023
-
[2]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling (COLM), 2024
work page 2024
-
[3]
In-the-flow agentic system optimization for effective planning and tool use, 2025
Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use, 2025
work page 2025
-
[4]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274, 2023
work page 2023
-
[5]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-R1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. VerlTool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025
-
[7]
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026
work page 2026
-
[8]
Reflexion: Language agents with verbal reinforcement learning, 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023
work page 2023
-
[9]
Skill0: In-context agentic reinforcement learning for skill internalization, 2026
Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization, 2026
work page 2026
-
[10]
Expel: Llm agents are experiential learners, 2024
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, 2024
work page 2024
-
[11]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. ReSearch: Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025. 10
-
[14]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025
Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025
-
[17]
ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025
Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025
-
[18]
Acecoder: Acing coder rl via automated test-case synthesis, 2025
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis, 2025
work page 2025
-
[19]
Posterior-grpo: Rewarding reasoning processes in code generation, 2025
Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation, 2025
work page 2025
-
[20]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
work page 2023
-
[21]
V oyager: An open-ended embodied agent with large language models, 2023
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023
work page 2023
-
[22]
Memskill: Learning and evolving memory skills for self-evolving agents, 2026
Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents, 2026
work page 2026
-
[23]
Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026
work page 2026
-
[24]
Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. Evolver: Self-evolving llm agents through an experience-driven lifecycle, 2025
work page 2025
-
[25]
Reinforcement learning for self-improving agent with skill library, 2026
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library, 2026
work page 2026
-
[26]
Cohen, Ruslan Salakhut- dinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018
work page 2018
-
[27]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics
work page 2020
-
[28]
Musique: Multihop questions via single-hop question composition, 2022
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition, 2022
work page 2022
-
[29]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[30]
American mathematics competitions
MAA. American mathematics competitions. InAmerican Mathematics Competitions, 2023
work page 2023
-
[31]
Math twenty four (24s game) dataset
Nathan Lile. Math twenty four (24s game) dataset. https://huggingface.co/ datasets/nlile/24-game, 2024. 11
work page 2024
-
[32]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[34]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page 2025
-
[37]
Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025
-
[38]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding, 2025. 12 Appendix A Limitations HASP has several limitations. First, while the core PF-only setting does not require a teacher, some stronger variants use external teachers for PF selection, teacher review,...
work page 2025
-
[40]
A SKILL.md specification (YAML frontmatter + markdown body)
-
[41]
A ProgramFunction Python class with should_activate() and intervene() methods. IMPORTANT RULES: - The PF must be deterministic (NO LLM calls in should_activate). - should_activate receives: step_context (dict), action_type (str: "SEARCH"/"READ"/"FINAL"), arg (str). - intervene receives the same + optional teacher model, and returns an Intervention. - The ...
work page 2005
-
[42]
FORMAT: stdin script with input()/print()
-
[43]
BRANCHES: - sum is odd -> No - sum == 0 -> No - any two sum to the third -> Yes - otherwise -> No" ACTION FINAL(""" a, b, c = map(int, input().split()) if (a + b + c) % 2 == 1: print("No") elif a + b + c == 0: print("No") elif a + b == c or a + c == b or b + c == a: print("Yes") else: print("No") """) JUDGE test 0 (3 8 5) passed (3+5==8) test 1 (2 2 2) FA...
-
[44]
BRANCHES: - all equal -> Yes <-- missed by baseline - sum odd -> No - sum <= 3 -> No - sum == 4 or 6 -> Yes - any two sum to third -> Yes - otherwise -> No" >> [PF FIRED] code_edge_cases -> INJECT_CONTEXT trigger : FINAL with stdin parsing + branching code intervention: phase instruction reminds: "Trace each sample input through your branches before FINAL...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.