arxiv: 2604.23853 · v1 · submitted 2026-04-26 · 💻 cs.AI

Recognition: unknown

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Boqin Yuan , Renchu Song , Yue Su , Sen Yang , Jing Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsskill distillationcost attributionprune patchesTraceCardagent trajectoriesbenchmark transfercost reduction

0 comments

The pith

Per-step cost tracking in LLM agent traces enables pruning of wasteful steps that cuts costs 32% across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a tracing platform that records every LLM call, tool use, and sub-agent action in a session and summarizes it as a TraceCard containing per-step dollar costs. From these records a distillation process extracts three kinds of skill patches: preserve patches that keep successful behaviors, prune patches that remove expensive steps shown by counterfactuals to be irrelevant, and repair patches that fix failures. Experiments on held-out tasks from one benchmark show that adding cost signals and applying prune patches each reduce quality regressions in the resulting skills. When the same distilled rules are tested on a different benchmark, prune patches transfer successfully and lower median costs by 32 percent, whereas preserve patches trained on benchmark-specific patterns produce performance drops.

Core claim

ClawTrace records full agent sessions and compiles them into compact TraceCards with per-step USD costs, token counts, and redundancy flags. CostCraft reads these cards to produce preserve, prune, and repair skill patches, where prune patches are each supported by a counterfactual argument against a named high-cost step. Ablations confirm that both cost attribution and prune patches independently lower quality regressions on 30 held-out SpreadsheetBench tasks; on 30 unrelated SkillsBench tasks the prune rules transfer and cut median cost by 32 percent while preserve rules cause regressions.

What carries the argument

TraceCard, a compact YAML summary of an agent session that attaches per-step USD cost, token counts, and redundancy flags to every LLM call, tool use, and sub-agent spawn, allowing counterfactual identification of non-essential expensive steps.

Load-bearing premise

Per-step costs can be accurately attributed and counterfactual arguments can reliably identify steps that did not affect the outcome without introducing bias or missing context.

What would settle it

Direct measurement of median cost and task success rate on SkillsBench tasks after applying prune patches distilled from SpreadsheetBench traces, compared against a no-prune baseline.

Figures

Figures reproduced from arXiv: 2604.23853 by Boqin Yuan, Jing Qin, Renchu Song, Sen Yang, Yue Su.

**Figure 1.** Figure 1: End-to-end architecture. Capture: ClawTrace instruments an agent session via eight event hooks. Compile: a deterministic compiler produces a TraceCard per session. Distill: CostCraft emits preserve, prune, and repair patches that merge into an evolved SKILL.md. This paper contributes on two fronts. First, ClawTrace is an open-source tracing platform that instruments agent sessions through eight event hooks… view at source ↗

**Figure 2.** Figure 2: ClawTrace execution-path view showing per-span cost attribution, tool-call payloads, and sub-agent nesting view at source ↗

**Figure 3.** Figure 3: Quality outcome rates across ablation condi view at source ↗

**Figure 5.** Figure 5: Per-task cost on 30 SkillsBench tasks. Hollow circles: baseline; filled circles: CostCraft (green = quality view at source ↗

**Figure 6.** Figure 6: Trajectory dashboard. Each row is one agent run, with columns for total cost, token count, step count, and view at source ↗

**Figure 7.** Figure 7: Step timeline (Gantt view) for a single trajectory. Each bar is one span; bar length is wall-clock duration. Re view at source ↗

read the original abstract

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawTrace adds per-step cost tracking and counterfactual pruning to LLM agent distillation, with the cross-benchmark transfer asymmetry as the clearest new observation.

read the letter

ClawTrace records every LLM call, tool use, and sub-agent step with real USD costs and packs the session into a compact TraceCard. CostCraft then turns those cards into three kinds of patches: preserve for successful behaviors, prune for expensive steps that counterfactual checks say did not matter, and repair for failures backed by oracle evidence. The concrete new element is feeding actual cost numbers into the pruning decision instead of relying only on success or failure signals. Most distillation work stops at behavioral rules; this one tries to make the rules cheaper to run. On 30 held-out SpreadsheetBench tasks the ablations indicate that cost attribution and pruning each cut quality regressions on their own. The more striking result is the transfer test on 30 SkillsBench tasks: prune rules moved across benchmarks and lowered median cost by 32 percent, while preserve rules, tuned to the first benchmark's conventions, produced regressions on the new tasks. That asymmetry is worth noting for anyone trying to build general agent skills. The open release of ClawTrace and the TraceCards is also useful; it gives outsiders a direct way to inspect the data and run their own checks. The main soft spot is the level of experimental detail. The abstract reports the 30-task splits and the 32 percent figure, but does not spell out the exact baselines, how quality regressions were scored, or whether the counterfactual arguments were validated against human judgment. Without those pieces it is hard to judge how much the cost savings depend on the specific benchmarks or whether the pruning step introduces systematic bias by missing context. The assumption that per-step costs can be cleanly attributed and that counterfactuals reliably identify irrelevant steps is plausible but still needs tighter validation. This work is aimed at researchers and engineers who build and deploy LLM agents and want distillation pipelines that pay attention to runtime cost. The practical framing and the released artifacts give it enough substance that a serious editor should send it out for peer review rather than desk reject. The authors will need to expand the methods and controls section, but the core idea and the transfer observation are worth referee attention.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn, compiling sessions into TraceCards with per-step USD costs, token counts, and redundancy flags. Built on this, CostCraft distills TraceCards into preserve patches (keep successful behaviors), prune patches (remove expensive non-impactful steps via counterfactuals), and repair patches (fix failures with oracle evidence). Ablations on 30 held-out SpreadsheetBench tasks claim independent quality-regression reductions from cost attribution and pruning; cross-application to 30 SkillsBench tasks shows prune rules transfer (32% median cost cut) while preserve rules regress.

Significance. If the empirical claims hold with adequate controls, the work supplies open infrastructure for cost-aware skill distillation in LLM agents, addressing a gap where pipelines cannot distinguish necessary fixes from wasteful steps. The reported asymmetry in cross-benchmark transfer of prune versus preserve rules is a substantive observation that could guide future distillation design. Open release of ClawTrace and TraceCards is a clear strength for reproducibility.

major comments (2)

[Ablations / Results] Ablations paragraph (abstract and presumed §4/Results): the central claim that cost attribution and prune patches independently reduce quality regressions on 30 held-out SpreadsheetBench tasks lacks any description of experimental setup, baselines, quality metric definition, statistical significance testing, or confound controls. This is load-bearing for the empirical contribution and prevents verification that the data support the stated conclusions.
[Method / CostCraft] Prune patch description (abstract and presumed §3/Method): the counterfactual arguments used to identify high-cost steps that 'did not affect the outcome' are foundational to the prune mechanism, yet no details are provided on how these counterfactuals are constructed, validated, or protected against bias or missing context. This directly affects the reliability of the reported 32% cost reduction on SkillsBench.

minor comments (2)

[Abstract] Abstract: the phrase 'compact YAML summary' for TraceCards would be clearer with a short illustrative excerpt or schema in the main text or appendix.
[Discussion] The cross-benchmark asymmetry is presented as an empirical finding; a brief discussion of possible causes (e.g., benchmark-specific conventions) would improve interpretability without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of open infrastructure for cost-aware agent skill distillation. We respond to each major comment below and will update the manuscript with additional details to address the concerns raised.

read point-by-point responses

Referee: [Ablations / Results] Ablations paragraph (abstract and presumed §4/Results): the central claim that cost attribution and prune patches independently reduce quality regressions on 30 held-out SpreadsheetBench tasks lacks any description of experimental setup, baselines, quality metric definition, statistical significance testing, or confound controls. This is load-bearing for the empirical contribution and prevents verification that the data support the stated conclusions.

Authors: We agree that the current description of the ablations is insufficient for verification. The manuscript will be revised to include a detailed experimental setup in §4: the 30 held-out tasks are randomly sampled from SpreadsheetBench excluding the training set; baselines include a no-cost-attribution variant and a no-prune variant; quality is measured by success rate regression (drop in task completion); we will add results from statistical tests (paired t-test, p < 0.05 reported); and controls for LLM version and prompt consistency. This will substantiate the independent reductions in quality regressions. revision: yes
Referee: [Method / CostCraft] Prune patch description (abstract and presumed §3/Method): the counterfactual arguments used to identify high-cost steps that 'did not affect the outcome' are foundational to the prune mechanism, yet no details are provided on how these counterfactuals are constructed, validated, or protected against bias or missing context. This directly affects the reliability of the reported 32% cost reduction on SkillsBench.

Authors: The details on counterfactual construction are indeed only briefly mentioned. We will expand §3 to describe: construction via step removal and re-execution on the same initial state using TraceCard context; validation through outcome equivalence checks and sample manual inspection; and bias mitigation by full context inclusion and noting limitations of potential missing information in traces. These additions will clarify the basis for the prune patches and the observed 32% cost reduction transfer to SkillsBench. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ablations are self-contained

full rationale

The paper introduces ClawTrace for per-step cost tracing and CostCraft for generating preserve/prune/repair patches, then reports empirical ablations on held-out SpreadsheetBench tasks and cross-benchmark transfer to SkillsBench. No derivation chain, equations, or first-principles predictions are present that could reduce to fitted inputs or self-definitions. Claims of independent benefits from cost attribution and pruning rest on experimental results rather than any self-referential construction. Self-citations are absent from the load-bearing claims, and the open release of TraceCards enables external verification. This is the standard case of an infrastructure-plus-ablations paper with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims depend on accurate cost measurement and valid counterfactual reasoning, which are domain assumptions without independent evidence provided in the abstract.

axioms (2)

domain assumption Per-step costs can be precisely attributed to individual actions in agent trajectories.
Essential for distinguishing useful from costly steps and for the benefits of cost attribution.
domain assumption Counterfactual arguments can validly determine if a step did not affect the outcome.
Basis for creating reliable prune patches.

pith-pipeline@v0.9.0 · 5536 in / 1506 out tokens · 53928 ms · 2026-05-08T05:59:43.829891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 19 canonical work pages · 13 internal anchors

[1]

Agent skills enable a new class of realistic and trivially simple prompt injections,

David Schmotz, Sahar Abdelnabi, and Maksym Andriushchenko. Agent skills enable a new class of realistic and trivially simple prompt injections, 2025. URLhttps://arxiv.org/abs/2510.26328

work page arXiv 2025
[2]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills, 2026. URL https://arxiv.org/abs/2603.25158

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. Coevoskills: Self-evolving agent skills via co-evolutionary verification, 2026. URLhttps://arxiv.org/abs/2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

arXiv preprint arXiv:2603.01145 , year=

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026. URL https://arxiv.org/abs/2603.01145

work page arXiv 2026
[5]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review arXiv 2026
[6]

Langsmith evaluation, 2024

LangChain. Langsmith evaluation, 2024. URL https://docs.langchain.com/langsmith/evaluation. Accessed: 2026-04-24

2024
[7]

Langfuse documentation, 2024

Langfuse. Langfuse documentation, 2024. URLhttps://langfuse.com/docs. Accessed: 2026-04-24

2024
[8]

Phoenix documentation: LLM evals, 2024

Arize AI. Phoenix documentation: LLM evals, 2024. URL https://arize.com/docs/phoenix/ evaluation/llm-evals. Accessed: 2026-04-24. 9

2024
[9]

Opentelemetry specification, 2024

OpenTelemetry Authors. Opentelemetry specification, 2024. URLhttps://opentelemetry.io/docs/specs/ otel/. Accessed: 2026-04-25

2024
[10]

Spreadsheetbench: Towards challenging real world spreadsheet manipulation, 2024

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation, 2024. URL https: //arxiv.org/abs/2406.14991

work page arXiv 2024
[11]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026. URLhttps://arxiv.org/abs/2602.12430

work page internal anchor Pith review arXiv 2026
[12]

Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026. URL https://arxiv.org/abs/2603. 15401

2026
[13]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URLhttps://arxiv.org/abs/2603.02766

work page arXiv 2026
[14]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URLhttps://arxiv.org/abs/2504.07079

work page internal anchor Pith review arXiv 2025
[15]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. URL https: //arxiv.org/abs/2305.16291

work page internal anchor Pith review arXiv 2023
[16]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303. 11366

2023
[17]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory, 2026. URL https://arxiv.org/abs/2509.25140

work page internal anchor Pith review arXiv 2026
[18]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.08234

work page internal anchor Pith review arXiv 2026
[19]

Agentops: Enabling observability of llm agents

Liming Dong, Qinghua Lu, and Liming Zhu. Agentops: Enabling observability of llm agents, 2024. URL https://arxiv.org/abs/2411.05285

work page arXiv 2024
[20]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305.05176

work page internal anchor Pith review arXiv 2023
[21]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025. URL https://arxiv.org/abs/ 2406.18665

work page internal anchor Pith review arXiv 2025
[22]

Puppygraph: Query graph on data lakes, 2024

PuppyGraph. Puppygraph: Query graph on data lakes, 2024. URL https://www.puppygraph.com/. Accessed: 2026-04-25

2024
[23]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.08155

work page internal anchor Pith review arXiv 2023
[24]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. URL https: //arxiv.org/abs/2308.00352

work page internal anchor Pith review arXiv 2024
[25]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society, 2023. URL https://arxiv. org/abs/2303.17760

work page internal anchor Pith review arXiv 2023
[26]

L-mars: Legal multi-agent workflow with orchestrated reasoning and agentic search,

Ziqi Wang and Boqin Yuan. L-mars: Legal multi-agent workflow with orchestrated reasoning and agentic search,
[27]

URLhttps://arxiv.org/abs/2509.00761

work page arXiv
[28]

When a cell is marked pending, compute it before ending the session

Hamel Husain and Shreya Shankar. LLM evals: Everything you need to know, January 2026. URL https: //hamel.dev/blog/posts/evals-faq/. Accessed: 2026-04-24. 10 Appendix A ClawTrace Platform Demo ClawTrace (https://www.clawtrace.ai/) is an observability and optimization platform for OpenClaw agents. Its goal is to make agents better, cheaper, and faster by g...

2026
[29]

Callinspect_mismatchesto identify the failure surface
[30]

Trace the failure to a specific agent decision or missing step in the TraceCard
[31]

If needed, callread_gold_snippetto confirm the expected output
[32]

Output format:JSON with fieldsaction(repair),rule,failure_type,evidence,confidence

Callfinal_patchwith the repair rule grounded in the observed failure and oracle evidence. Output format:JSON with fieldsaction(repair),rule,failure_type,evidence,confidence. Constraint:If you cannot diagnose the failure within 3 tool calls, emit a low-confidence patch. The merge step will deprioritize it. 14 C.3 Merge Operator Prompt Merge Operator System...
[33]

Repair patches with causal diagnosis (highest)
[34]

Prune patches with a named cost target and counterfactual
[35]

Preserve patches that appear in two or more trajectories
[36]

sb-task-47484

Singleton preserve patches (drop these). Conflict resolution:When two patches target the same behavior, repair supersedes prune, which supersedes preserve. When two patches of the same type conflict, keep the one with stronger evidence. Output structure:The merged skill must have exactly five sections: 1.Trigger— when the skill applies. 2.Workflow— step-b...