SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

Bangzheng Pu; Dong Dong; Duling Xu; Jialin Li; Jiawei Guan; Zaifeng Pan; Zheng Chen

arxiv: 2605.15215 · v1 · pith:LRAFZOQ3new · submitted 2026-05-12 · 💻 cs.AI · cs.SE

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

Duling Xu , Zheng Chen , Zaifeng Pan , Jiawei Guan , Dong Dong , Jialin Li , Bangzheng Pu This is my paper

Pith reviewed 2026-05-19 17:50 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LLM agentsskill compilationruntime interfacesboundary extractionagent efficiencySkillsBench

0 comments

The pith

SkillSmith compiles agent skills offline into minimal boundary-guided interfaces to cut redundant context and reasoning in LLM systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents commonly receive full skill descriptions as context for each task, which injects irrelevant details and forces repeated planning steps. SkillSmith instead compiles entire skill packages ahead of time by extracting their operational boundaries into compact executable interfaces. At runtime the agent calls only the relevant boundary-defined components rather than the original text. On the SkillsBench benchmark this yields large drops in tokens used, thinking steps, and wall-clock time while also allowing interfaces built by a strong model to boost accuracy when a weaker model runs them. The result is a shift from on-the-fly skill interpretation to pre-compiled, minimal runtime access.

Core claim

SkillSmith is a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces by extracting fine-grained operational boundaries from skills, enabling agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead.

What carries the argument

Extraction of fine-grained operational boundaries from skill descriptions to produce minimal executable interfaces for dynamic runtime access.

Load-bearing premise

Extracting fine-grained operational boundaries from skill descriptions is feasible and preserves all task-relevant behavior so agents never need to fall back to the full original skill text.

What would settle it

If agents using the compiled interfaces must revert to full skill text on many tasks or show lower success rates than raw-skill baselines on SkillsBench or a held-out task set, the efficiency and accuracy claims would not hold.

Figures

Figures reproduced from arXiv: 2605.15215 by Bangzheng Pu, Dong Dong, Duling Xu, Jialin Li, Jiawei Guan, Zaifeng Pan, Zheng Chen.

**Figure 2.** Figure 2: SkillSmith system overview. workflow runtimes for developer-authored graphs. SkillSmith focuses on a different layer: converting reusable skill specifications into structured, source-grounded, and resumable workflow artifacts. This reduces repeated skill interpretation while preserving selective LLM invocation for steps that genuinely require generation. 3 Method SkillSmith treats skills as compilable capa… view at source ↗

**Figure 3.** Figure 3: Boundary contract as a static runtime ABI record. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime interpretation of a boundary contract as a guarded state machine. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Overall runtime benefits across seven SkillsBench tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-model correctness and runtime benefit summary. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Harness-level solve-stage reductions. We evaluate whether compiled skills remain effective across agent harnesses in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Recently, skills have been widely adopted in large language model (LLM)-based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task-solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill-specific reasoning and planning. To this end, we propose SkillSmith, a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine-grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve-stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token-proportional monetary cost by 57.44% compared with using raw-skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at https://github.com/AetherHeart-AI/Aeloon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillSmith pre-compiles skills into lean runtime interfaces to cut context and reasoning overhead, with reported 50%+ gains on SkillsBench that depend on the extraction staying lossless.

read the letter

SkillSmith's core move is to take skill descriptions, extract their operational boundaries offline, and turn them into minimal executable interfaces that the agent can call at runtime instead of receiving the full text. This targets two redundancies: unnecessary context in the prompt and repeated planning over the same skill details. The compiler-runtime split and the cross-model reuse angle (stronger model compiles, weaker model runs) are the parts that feel fresh compared with standard retrieval-plus-injection setups in agent frameworks. They back it with public code and data, which is straightforward to check. On SkillsBench the numbers are specific: 57% lower solve-stage tokens, 43% fewer thinking iterations, 2x faster solve time, and matching cost reduction versus raw skills. The reuse claim is also concrete, showing accuracy lift when a small model uses the compiled artifact where direct interpretation fails. Those are the practical wins worth looking at. The soft spot is the boundary extraction step itself. If it drops implicit conditionals, state transitions, or unstated dependencies, the agent will either fail or revert to full skill text and lose the efficiency edge. The abstract gives no coverage metrics, no formal characterization of the extraction, and no failure-case breakdown, so it is not yet clear how often this happens across skill types. Without those details the gains could be tied to the benchmark rather than general. This is for people who build or tune production LLM agent systems and care about token budgets and latency. A reader already working on skill libraries or tool-use loops would get the most out of the compiler idea and the reuse experiment. I would send it to peer review. The idea is practical, the measurements are worth a closer look, and the public artifacts make verification feasible even if the extraction robustness needs more evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SkillSmith, a boundary-first compiler-runtime framework for LLM-based agent systems. It compiles skill packages offline into minimal executable interfaces by extracting fine-grained operational boundaries from skill descriptions. This approach aims to reduce redundancy from irrelevant context injection and repeated skill-specific reasoning. On the SkillsBench benchmark, it reports reductions of 57.44% in solve-stage token usage, 42.99% in thinking iterations, 50.57% in solve time (2.02x faster), and 57.44% in token-proportional monetary cost compared to raw-skills usage. It also claims that compiled artifacts from a stronger model can be reused by a smaller runtime model to improve accuracy where raw skill interpretation fails. The source code and data are made available.

Significance. If the boundary extraction is indeed lossless and the efficiency gains are robust, SkillSmith could offer a practical way to optimize agent performance in skill-based systems by minimizing unnecessary context and reasoning overhead, and enabling cross-model reuse of compiled skills. The open availability of code and data is a positive aspect for reproducibility in the field.

major comments (2)

[§3] The procedure for extracting operational boundaries from skill descriptions lacks a formal characterization, algorithm, or coverage metrics over skill complexity (see §3). This is load-bearing for the central claims, as any omitted implicit dependencies, conditional logic, or state transitions would force fallback to full skill text and erase the reported 57.44% token reduction and 2.02x speedup.
[§4] The evaluation reports concrete percentage gains on SkillsBench without error bars, variance measures, or detailed baseline implementation descriptions (see §4). This makes it difficult to confirm that the measured improvements are free of post-hoc selection or unstated advantages in the raw-skills comparison.

minor comments (2)

[Abstract] The abstract could include a one-sentence overview of the boundary extraction approach to better contextualize the results.
[§2] Clarify notation for terms such as 'solve-stage token usage' and 'thinking iterations' on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing SkillSmith. We address each major comment below and indicate the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [§3] The procedure for extracting operational boundaries from skill descriptions lacks a formal characterization, algorithm, or coverage metrics over skill complexity (see §3). This is load-bearing for the central claims, as any omitted implicit dependencies, conditional logic, or state transitions would force fallback to full skill text and erase the reported 57.44% token reduction and 2.02x speedup.

Authors: We agree that a more formal presentation of the boundary extraction procedure would improve clarity and address concerns about potential omissions. In the revised manuscript, we will add a formal characterization of operational boundaries, include pseudocode for the extraction algorithm, and report coverage metrics evaluated across varying levels of skill complexity. These additions will explicitly demonstrate how implicit dependencies, conditional logic, and state transitions are captured, thereby supporting the validity of the efficiency gains without requiring fallback to full skill text. revision: yes
Referee: [§4] The evaluation reports concrete percentage gains on SkillsBench without error bars, variance measures, or detailed baseline implementation descriptions (see §4). This makes it difficult to confirm that the measured improvements are free of post-hoc selection or unstated advantages in the raw-skills comparison.

Authors: We acknowledge that the evaluation section would benefit from additional statistical rigor and transparency. In the revision, we will include error bars and variance measures derived from multiple independent runs on SkillsBench. We will also expand the description of the raw-skills baseline implementation, including exact prompting strategies and matching procedures, to enable full reproducibility and rule out any unstated advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements against explicit baseline

full rationale

The paper introduces SkillSmith as a boundary-first compiler-runtime framework and supports its claims solely through direct empirical evaluation on SkillsBench. Reported reductions (57.44% token usage, 42.99% thinking iterations, 50.57% solve time) are measured against an explicit raw-skills baseline rather than being derived from any internal equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the derivation chain; the central efficiency and reuse claims rest on observable runtime behavior, not on quantities defined in terms of themselves. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that skill descriptions contain extractable, stable operational boundaries that can be compiled once and reused without loss of correctness. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Skill descriptions contain fine-grained operational boundaries that can be extracted automatically and remain sufficient for correct execution.
Invoked in the description of the boundary-first compiler step.

pith-pipeline@v0.9.0 · 5770 in / 1262 out tokens · 33623 ms · 2026-05-19T17:50:08.857771+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SkillSmith compiles skill packages into minimal executable interfaces... boundary contract B=(τ, O, Cio, R, V, πa, πs, F)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extracting fine-grained operational boundaries from skills

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Equipping agents for the real world with agent skills

Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills ,

work page
[3]

Agent skills

Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2026. Documentation, accessed 2026-05-04

work page 2026
[4]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, 2026. Official announcement. Accessed: 2026-05-07

work page 2026
[5]

Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023

work page 1946
[6]

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

DeepSeek V4 Preview Release

DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026. Official announcement. Accessed: 2026-05-07

work page 2026
[8]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Hassan Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex t...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023
[10]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[11]

Langgraph persistence

LangChain. Langgraph persistence. https://docs.langchain.com/oss/python/ langgraph/persistence, 2026. Documentation, accessed 2026-05-04

work page 2026
[12]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 10

work page 2023
[14]

GPT-5.5 in ChatGPT

OpenAI. GPT-5.5 in ChatGPT. https://help.openai.com/en/articles/ 11909943-gpt-55-in-chatgpt , 2026. Official documentation. Accessed: 2026-05- 07

work page 2026
[15]

OpenAI Codex CLI: Getting started

OpenAI. OpenAI Codex CLI: Getting started. https://help.openai.com/en/articles/ 11096431-openai-codex-cli-getting-started , 2026. Documentation. Accessed: 2026- 03-11

work page 2026
[16]

OpenCode: The open source ai coding agent

OpenCode. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026. Open-source software. Accessed: 2026-05-01

work page 2026
[17]

OpenRouter: A unified interface for language models

OpenRouter. OpenRouter: A unified interface for language models. https://openrouter. ai/, 2026. Accessed: 2026-05-07

work page 2026
[18]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, pages 126544–126565. Curran Associates, Inc., 2024

work page 2024
[19]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. Official announcement. Accessed: 2026-05-07

work page 2026
[21]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[22]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[23]

Auto-GPT

Significant Gravitas. Auto-GPT. https://github.com/significant-gravitas/ autogpt, 2023. GitHub repository, accessed 2026-04-24

work page 2023
[24]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[26]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[28]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[29]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[30]

steps”, “workflow

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, pages 62557–62583. Curran Associates, Inc., 2024. 11 A Additi...

work page 2024

[1] [1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Equipping agents for the real world with agent skills

Anthropic. Equipping agents for the real world with agent skills. https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills ,

work page

[3] [3]

Agent skills

Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2026. Documentation, accessed 2026-05-04

work page 2026

[4] [4]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, 2026. Official announcement. Accessed: 2026-05-07

work page 2026

[5] [5]

Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023

work page 1946

[6] [6]

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

DeepSeek V4 Preview Release

DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026. Official announcement. Accessed: 2026-05-07

work page 2026

[8] [8]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Hassan Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex t...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023

[10] [10]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[11] [11]

Langgraph persistence

LangChain. Langgraph persistence. https://docs.langchain.com/oss/python/ langgraph/persistence, 2026. Documentation, accessed 2026-05-04

work page 2026

[12] [12]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 10

work page 2023

[14] [14]

GPT-5.5 in ChatGPT

OpenAI. GPT-5.5 in ChatGPT. https://help.openai.com/en/articles/ 11909943-gpt-55-in-chatgpt , 2026. Official documentation. Accessed: 2026-05- 07

work page 2026

[15] [15]

OpenAI Codex CLI: Getting started

OpenAI. OpenAI Codex CLI: Getting started. https://help.openai.com/en/articles/ 11096431-openai-codex-cli-getting-started , 2026. Documentation. Accessed: 2026- 03-11

work page 2026

[16] [16]

OpenCode: The open source ai coding agent

OpenCode. OpenCode: The open source ai coding agent. https://opencode.ai/, 2026. Open-source software. Accessed: 2026-05-01

work page 2026

[17] [17]

OpenRouter: A unified interface for language models

OpenRouter. OpenRouter: A unified interface for language models. https://openrouter. ai/, 2026. Accessed: 2026-05-07

work page 2026

[18] [18]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, pages 126544–126565. Curran Associates, Inc., 2024

work page 2024

[19] [19]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. Official announcement. Accessed: 2026-05-07

work page 2026

[21] [21]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023

[22] [22]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[23] [23]

Auto-GPT

Significant Gravitas. Auto-GPT. https://github.com/significant-gravitas/ autogpt, 2023. GitHub repository, accessed 2026-04-24

work page 2023

[24] [24]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[26] [26]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022

[28] [28]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[29] [29]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[30] [30]

steps”, “workflow

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, pages 62557–62583. Curran Associates, Inc., 2024. 11 A Additi...

work page 2024