pith. machine review for the scientific record. sign in

arxiv: 2604.23853 · v1 · submitted 2026-04-26 · 💻 cs.AI

Recognition: unknown

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsskill distillationcost attributionprune patchesTraceCardagent trajectoriesbenchmark transfercost reduction
0
0 comments X

The pith

Per-step cost tracking in LLM agent traces enables pruning of wasteful steps that cuts costs 32% across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a tracing platform that records every LLM call, tool use, and sub-agent action in a session and summarizes it as a TraceCard containing per-step dollar costs. From these records a distillation process extracts three kinds of skill patches: preserve patches that keep successful behaviors, prune patches that remove expensive steps shown by counterfactuals to be irrelevant, and repair patches that fix failures. Experiments on held-out tasks from one benchmark show that adding cost signals and applying prune patches each reduce quality regressions in the resulting skills. When the same distilled rules are tested on a different benchmark, prune patches transfer successfully and lower median costs by 32 percent, whereas preserve patches trained on benchmark-specific patterns produce performance drops.

Core claim

ClawTrace records full agent sessions and compiles them into compact TraceCards with per-step USD costs, token counts, and redundancy flags. CostCraft reads these cards to produce preserve, prune, and repair skill patches, where prune patches are each supported by a counterfactual argument against a named high-cost step. Ablations confirm that both cost attribution and prune patches independently lower quality regressions on 30 held-out SpreadsheetBench tasks; on 30 unrelated SkillsBench tasks the prune rules transfer and cut median cost by 32 percent while preserve rules cause regressions.

What carries the argument

TraceCard, a compact YAML summary of an agent session that attaches per-step USD cost, token counts, and redundancy flags to every LLM call, tool use, and sub-agent spawn, allowing counterfactual identification of non-essential expensive steps.

Load-bearing premise

Per-step costs can be accurately attributed and counterfactual arguments can reliably identify steps that did not affect the outcome without introducing bias or missing context.

What would settle it

Direct measurement of median cost and task success rate on SkillsBench tasks after applying prune patches distilled from SpreadsheetBench traces, compared against a no-prune baseline.

Figures

Figures reproduced from arXiv: 2604.23853 by Boqin Yuan, Jing Qin, Renchu Song, Sen Yang, Yue Su.

Figure 1
Figure 1. Figure 1: End-to-end architecture. Capture: ClawTrace instruments an agent session via eight event hooks. Compile: a deterministic compiler produces a TraceCard per session. Distill: CostCraft emits preserve, prune, and repair patches that merge into an evolved SKILL.md. This paper contributes on two fronts. First, ClawTrace is an open-source tracing platform that instruments agent sessions through eight event hooks… view at source ↗
Figure 2
Figure 2. Figure 2: ClawTrace execution-path view showing per-span cost attribution, tool-call payloads, and sub-agent nesting view at source ↗
Figure 3
Figure 3. Figure 3: Quality outcome rates across ablation condi view at source ↗
Figure 5
Figure 5. Figure 5: Per-task cost on 30 SkillsBench tasks. Hollow circles: baseline; filled circles: CostCraft (green = quality view at source ↗
Figure 6
Figure 6. Figure 6: Trajectory dashboard. Each row is one agent run, with columns for total cost, token count, step count, and view at source ↗
Figure 7
Figure 7. Figure 7: Step timeline (Gantt view) for a single trajectory. Each bar is one span; bar length is wall-clock duration. Re view at source ↗
read the original abstract

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn, compiling sessions into TraceCards with per-step USD costs, token counts, and redundancy flags. Built on this, CostCraft distills TraceCards into preserve patches (keep successful behaviors), prune patches (remove expensive non-impactful steps via counterfactuals), and repair patches (fix failures with oracle evidence). Ablations on 30 held-out SpreadsheetBench tasks claim independent quality-regression reductions from cost attribution and pruning; cross-application to 30 SkillsBench tasks shows prune rules transfer (32% median cost cut) while preserve rules regress.

Significance. If the empirical claims hold with adequate controls, the work supplies open infrastructure for cost-aware skill distillation in LLM agents, addressing a gap where pipelines cannot distinguish necessary fixes from wasteful steps. The reported asymmetry in cross-benchmark transfer of prune versus preserve rules is a substantive observation that could guide future distillation design. Open release of ClawTrace and TraceCards is a clear strength for reproducibility.

major comments (2)
  1. [Ablations / Results] Ablations paragraph (abstract and presumed §4/Results): the central claim that cost attribution and prune patches independently reduce quality regressions on 30 held-out SpreadsheetBench tasks lacks any description of experimental setup, baselines, quality metric definition, statistical significance testing, or confound controls. This is load-bearing for the empirical contribution and prevents verification that the data support the stated conclusions.
  2. [Method / CostCraft] Prune patch description (abstract and presumed §3/Method): the counterfactual arguments used to identify high-cost steps that 'did not affect the outcome' are foundational to the prune mechanism, yet no details are provided on how these counterfactuals are constructed, validated, or protected against bias or missing context. This directly affects the reliability of the reported 32% cost reduction on SkillsBench.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'compact YAML summary' for TraceCards would be clearer with a short illustrative excerpt or schema in the main text or appendix.
  2. [Discussion] The cross-benchmark asymmetry is presented as an empirical finding; a brief discussion of possible causes (e.g., benchmark-specific conventions) would improve interpretability without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of open infrastructure for cost-aware agent skill distillation. We respond to each major comment below and will update the manuscript with additional details to address the concerns raised.

read point-by-point responses
  1. Referee: [Ablations / Results] Ablations paragraph (abstract and presumed §4/Results): the central claim that cost attribution and prune patches independently reduce quality regressions on 30 held-out SpreadsheetBench tasks lacks any description of experimental setup, baselines, quality metric definition, statistical significance testing, or confound controls. This is load-bearing for the empirical contribution and prevents verification that the data support the stated conclusions.

    Authors: We agree that the current description of the ablations is insufficient for verification. The manuscript will be revised to include a detailed experimental setup in §4: the 30 held-out tasks are randomly sampled from SpreadsheetBench excluding the training set; baselines include a no-cost-attribution variant and a no-prune variant; quality is measured by success rate regression (drop in task completion); we will add results from statistical tests (paired t-test, p < 0.05 reported); and controls for LLM version and prompt consistency. This will substantiate the independent reductions in quality regressions. revision: yes

  2. Referee: [Method / CostCraft] Prune patch description (abstract and presumed §3/Method): the counterfactual arguments used to identify high-cost steps that 'did not affect the outcome' are foundational to the prune mechanism, yet no details are provided on how these counterfactuals are constructed, validated, or protected against bias or missing context. This directly affects the reliability of the reported 32% cost reduction on SkillsBench.

    Authors: The details on counterfactual construction are indeed only briefly mentioned. We will expand §3 to describe: construction via step removal and re-execution on the same initial state using TraceCard context; validation through outcome equivalence checks and sample manual inspection; and bias mitigation by full context inclusion and noting limitations of potential missing information in traces. These additions will clarify the basis for the prune patches and the observed 32% cost reduction transfer to SkillsBench. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ablations are self-contained

full rationale

The paper introduces ClawTrace for per-step cost tracing and CostCraft for generating preserve/prune/repair patches, then reports empirical ablations on held-out SpreadsheetBench tasks and cross-benchmark transfer to SkillsBench. No derivation chain, equations, or first-principles predictions are present that could reduce to fitted inputs or self-definitions. Claims of independent benefits from cost attribution and pruning rest on experimental results rather than any self-referential construction. Self-citations are absent from the load-bearing claims, and the open release of TraceCards enables external verification. This is the standard case of an infrastructure-plus-ablations paper with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims depend on accurate cost measurement and valid counterfactual reasoning, which are domain assumptions without independent evidence provided in the abstract.

axioms (2)
  • domain assumption Per-step costs can be precisely attributed to individual actions in agent trajectories.
    Essential for distinguishing useful from costly steps and for the benefits of cost attribution.
  • domain assumption Counterfactual arguments can validly determine if a step did not affect the outcome.
    Basis for creating reliable prune patches.

pith-pipeline@v0.9.0 · 5536 in / 1506 out tokens · 53928 ms · 2026-05-08T05:59:43.829891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    Agent skills enable a new class of realistic and trivially simple prompt injections,

    David Schmotz, Sahar Abdelnabi, and Maksym Andriushchenko. Agent skills enable a new class of realistic and trivially simple prompt injections, 2025. URLhttps://arxiv.org/abs/2510.26328

  2. [2]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills, 2026. URL https://arxiv.org/abs/2603.25158

  3. [3]

    Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. Coevoskills: Self-evolving agent skills via co-evolutionary verification, 2026. URLhttps://arxiv.org/abs/2604.01687

  4. [4]

    arXiv preprint arXiv:2603.01145 , year=

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026. URL https://arxiv.org/abs/2603.01145

  5. [5]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

  6. [6]

    Langsmith evaluation, 2024

    LangChain. Langsmith evaluation, 2024. URL https://docs.langchain.com/langsmith/evaluation. Accessed: 2026-04-24

  7. [7]

    Langfuse documentation, 2024

    Langfuse. Langfuse documentation, 2024. URLhttps://langfuse.com/docs. Accessed: 2026-04-24

  8. [8]

    Phoenix documentation: LLM evals, 2024

    Arize AI. Phoenix documentation: LLM evals, 2024. URL https://arize.com/docs/phoenix/ evaluation/llm-evals. Accessed: 2026-04-24. 9

  9. [9]

    Opentelemetry specification, 2024

    OpenTelemetry Authors. Opentelemetry specification, 2024. URLhttps://opentelemetry.io/docs/specs/ otel/. Accessed: 2026-04-25

  10. [10]

    Spreadsheetbench: Towards challenging real world spreadsheet manipulation, 2024

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation, 2024. URL https: //arxiv.org/abs/2406.14991

  11. [11]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026. URLhttps://arxiv.org/abs/2602.12430

  12. [12]

    Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026

    Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026. URL https://arxiv.org/abs/2603. 15401

  13. [13]

    Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems, 2026. URLhttps://arxiv.org/abs/2603.02766

  14. [14]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URLhttps://arxiv.org/abs/2504.07079

  15. [15]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. URL https: //arxiv.org/abs/2305.16291

  16. [16]

    Reflexion: Language agents with verbal reinforcement learning, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303. 11366

  17. [17]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory, 2026. URL https://arxiv.org/abs/2509.25140

  18. [18]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.08234

  19. [19]

    Agentops: Enabling observability of llm agents

    Liming Dong, Qinghua Lu, and Liming Zhu. Agentops: Enabling observability of llm agents, 2024. URL https://arxiv.org/abs/2411.05285

  20. [20]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305.05176

  21. [21]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025. URL https://arxiv.org/abs/ 2406.18665

  22. [22]

    Puppygraph: Query graph on data lakes, 2024

    PuppyGraph. Puppygraph: Query graph on data lakes, 2024. URL https://www.puppygraph.com/. Accessed: 2026-04-25

  23. [23]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.08155

  24. [24]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. URL https: //arxiv.org/abs/2308.00352

  25. [25]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society, 2023. URL https://arxiv. org/abs/2303.17760

  26. [26]

    L-mars: Legal multi-agent workflow with orchestrated reasoning and agentic search,

    Ziqi Wang and Boqin Yuan. L-mars: Legal multi-agent workflow with orchestrated reasoning and agentic search,

  27. [27]

    URLhttps://arxiv.org/abs/2509.00761

  28. [28]

    When a cell is marked pending, compute it before ending the session

    Hamel Husain and Shreya Shankar. LLM evals: Everything you need to know, January 2026. URL https: //hamel.dev/blog/posts/evals-faq/. Accessed: 2026-04-24. 10 Appendix A ClawTrace Platform Demo ClawTrace (https://www.clawtrace.ai/) is an observability and optimization platform for OpenClaw agents. Its goal is to make agents better, cheaper, and faster by g...

  29. [29]

    Callinspect_mismatchesto identify the failure surface

  30. [30]

    Trace the failure to a specific agent decision or missing step in the TraceCard

  31. [31]

    If needed, callread_gold_snippetto confirm the expected output

  32. [32]

    Output format:JSON with fieldsaction(repair),rule,failure_type,evidence,confidence

    Callfinal_patchwith the repair rule grounded in the observed failure and oracle evidence. Output format:JSON with fieldsaction(repair),rule,failure_type,evidence,confidence. Constraint:If you cannot diagnose the failure within 3 tool calls, emit a low-confidence patch. The merge step will deprioritize it. 14 C.3 Merge Operator Prompt Merge Operator System...

  33. [33]

    Repair patches with causal diagnosis (highest)

  34. [34]

    Prune patches with a named cost target and counterfactual

  35. [35]

    Preserve patches that appear in two or more trajectories

  36. [36]

    sb-task-47484

    Singleton preserve patches (drop these). Conflict resolution:When two patches target the same behavior, repair supersedes prune, which supersedes preserve. When two patches of the same type conflict, keep the one with stronger evidence. Output structure:The merged skill must have exactly five sections: 1.Trigger— when the skill applies. 2.Workflow— step-b...