Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

Bing Zhu; Guanghui Wang; Peiyang He; Wei Qiu; Xing Zhang; Yanwei Cui; Ziyuan Li

arxiv: 2605.19576 · v1 · pith:4ZCHFOXRnew · submitted 2026-05-19 · 💻 cs.AI · cs.CL· cs.SE

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

Xing Zhang , Yanwei Cui , Guanghui Wang , Ziyuan Li , Wei Qiu , Bing Zhu , Peiyang He This is my paper

Pith reviewed 2026-05-20 06:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.SE

keywords library driftself-evolving agentsLLM skill librariesoutcome-driven retirementskill governanceMBPP benchmarkperformance stagnationtrace diagnostics

0 comments

The pith

Self-evolving LLM skill libraries accumulate skills without management and stagnate, but a minimal governance recipe of outcome-driven retirement, bounded active sets, and meta-skill priors reverses the drift and raises held-out performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-evolving skill libraries for large language models encounter a silent failure called library drift, where skills pile up without outcome-based lifecycle controls and produce retrieval failures, false injections, and flat task performance. A sympathetic reader would care because the same libraries that should improve agents over time instead stop helping, while human-curated skills succeed. The authors isolate the cause with controlled ablations and show that three simple governance rules together restore steady improvement on held-out coding tasks across many rounds of evolution.

Core claim

Library drift is the process in which unbounded skill accumulation without outcome-driven lifecycle management produces retrieval degradation, false-positive injections, and performance stagnation; the authors isolate it with two ablations that respectively freeze skill injection and force premature retirement, then demonstrate that the combination of outcome-driven retirement, a bounded active-cap, and a meta-skill authoring prior reverses the effect and raises held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 over 100 rounds on MBPP+ hard-100.

What carries the argument

Library drift, made visible by an append-only evidence log that records per-skill contribution scores, attribution verdicts, and router engagement metrics, and corrected by the three-part governance recipe of outcome-driven retirement plus bounded active-cap plus meta-skill authoring prior.

If this is right

Disabling skill injection produces a flat performance floor while premature retirement actively lowers scores, confirming that drift is the active mechanism.
Eight separate ablations show that outcome-driven retirement and the active-cap are load-bearing while certain other controls are subsumed.
The governance recipe produces a rolling gain of +0.328 on held-out tasks and sustains improvement across the full 100-round window.
Trace-level diagnostics make the onset of drift detectable before end-task scores degrade.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar drift patterns are likely to appear in any long-running agent that maintains an open-ended memory of learned procedures.
The same governance pattern could be tested on non-coding domains such as tool-use or planning benchmarks to check whether the three rules generalize.
Over longer horizons the bounded active-cap may need adaptive sizing rules that the current recipe does not yet specify.

Load-bearing premise

The ablations that turn off skill injection or force early retirement cleanly separate library drift from other performance factors without adding new confounds.

What would settle it

Running the same self-evolving loop on MBPP+ hard-100 for 100 rounds with the full governance recipe applied and observing no sustained rise above the 0.258 baseline, or seeing the trace-level contribution scores fail to improve attribution accuracy.

Figures

Figures reproduced from arXiv: 2605.19576 by Bing Zhu, Guanghui Wang, Peiyang He, Wei Qiu, Xing Zhang, Yanwei Cui, Ziyuan Li.

**Figure 1.** Figure 1: The Ratchet loop and where library drift is diagnosed and fixed. Inference (top): each task flows through Router→Solver→Grader→Capsule. Memory (middle): Skill Bank, Meta-Skill, and Evidence Log. Reflection (bottom): the Critic produces attribution verdicts (the diagnostic signal); the Curator retires under-performers and enforces the bounded cap (the fix). Without outcome-driven retirement and the bounded … view at source ↗

**Figure 2.** Figure 2: Held-out pass@1 by round (3-seed mean ±1 std). A1 (flat floor) and A4 (below floor) exhibit library drift. A5/A6 (relaxed dedup) slightly exceed the Default—the meta-skill subsumes explicit filtering. A7 (doubled cap) shows comparable mean but higher variance. A8 (meta-synth refresh) matches A5/A6 gains but at 55% more wall time (10.1 h vs. 6.5 h). but not to actively harm, confirming that the evidence flo… view at source ↗

read the original abstract

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names library drift as a distinct failure mode in self-evolving skill libraries and shows a governance recipe that raises late-window pass@1 on MBPP+ hard-100, but the ablations risk confounding retrieval statistics with the claimed mechanism.

read the letter

The core takeaway is that unbounded skill accumulation without outcome-driven retirement creates retrieval degradation and stalls performance in LLM agent libraries. The authors isolate this with two trigger ablations and then demonstrate a minimal fix—outcome-driven retirement plus an active-cap plus meta-skill authoring—that moves held-out pass@1 from 0.258 to a late-window mean of 0.584 over 100 rounds on MBPP+ hard-100. That lift is the main empirical result worth noting.

Referee Report

2 major / 2 minor

Summary. The paper claims that self-evolving LLM skill libraries suffer from a silent failure mode termed 'library drift' caused by unbounded skill accumulation without outcome-driven lifecycle management, resulting in retrieval degradation, false-positive injections, and performance stagnation. It isolates the mechanism via two trigger ablations (disabling skill injection yields a flat +0.002 gain; premature retirement causes -0.019 harm), supplies trace-level diagnostics including an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics, and verifies a minimal governance fix (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that raises held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain +0.328) on MBPP+ hard-100 over 100 rounds, with eight ablations decomposing the load-bearing components.

Significance. If the results and isolation hold, the work supplies a concrete, reproducible playbook for diagnosing and mitigating library drift in self-evolving agents, including trace diagnostics that surface the issue before end-task degradation and a governance recipe with quantified gains. The explicit ablation decomposition and provision of contribution scores/router metrics are strengths that could make the mechanism observable and falsifiable for the broader LLM-agent community.

major comments (2)

[Ablations (abstract and § on trigger ablations)] The ablation disabling skill injection (described in the abstract) necessarily shrinks the skill pool available to the router and therefore changes retrieval hit rates and active-set statistics; the manuscript gives no indication that library sizes or engagement metrics were matched (e.g., by padding with neutral skills or fixing k), so the reported +0.002 flat floor may reflect altered retrieval difficulty rather than clean isolation of drift.
[Ablations (abstract and § on trigger ablations)] The premature-retirement ablation (abstract) alters active-set cardinality and therefore baseline retrieval difficulty; without explicit controls that hold active library size and engagement statistics fixed across conditions, the reported -0.019 harm cannot be unambiguously attributed to the absence of outcome-driven lifecycle management rather than to the secondary change in retrieval setup.

minor comments (2)

The abstract refers to 'eight ablations' that decompose governance mechanisms; a summary table listing each ablation, its effect size, and which component it disables would improve readability.
Contribution scores and router engagement metrics are central to the trace diagnostics; clarify whether these quantities are computed from the same outcome data used to measure final pass@1 or from an independent log.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our ablation design. The concerns about potential confounding from changes in library size and retrieval statistics are well-taken, and we address each point below with plans for revision.

read point-by-point responses

Referee: [Ablations (abstract and § on trigger ablations)] The ablation disabling skill injection (described in the abstract) necessarily shrinks the skill pool available to the router and therefore changes retrieval hit rates and active-set statistics; the manuscript gives no indication that library sizes or engagement metrics were matched (e.g., by padding with neutral skills or fixing k), so the reported +0.002 flat floor may reflect altered retrieval difficulty rather than clean isolation of drift.

Authors: We agree that the skill-injection ablation alters pool size and could affect retrieval difficulty. The original intent was to show that performance plateaus without new injections, isolating the contribution of ongoing skill addition to drift. To strengthen the isolation, we will revise the manuscript to include a size-matched control condition (padding the disabled-injection library with neutral placeholder skills to hold active-set cardinality and k fixed) and report the corresponding router engagement, hit-rate, and contribution-score statistics across conditions. revision: yes
Referee: [Ablations (abstract and § on trigger ablations)] The premature-retirement ablation (abstract) alters active-set cardinality and therefore baseline retrieval difficulty; without explicit controls that hold active library size and engagement statistics fixed across conditions, the reported -0.019 harm cannot be unambiguously attributed to the absence of outcome-driven lifecycle management rather than to the secondary change in retrieval setup.

Authors: We concur that the premature-retirement ablation changes active-set size and may influence retrieval baselines. The ablation was designed to demonstrate active harm from retiring skills without outcome-driven criteria. In revision we will add a matched-cardinality control (e.g., retiring skills but immediately replacing them with neutral fillers to preserve active-set size) and supply the full set of engagement metrics, hit rates, and per-skill contribution scores to confirm that the observed harm is attributable to the lifecycle mechanism rather than retrieval changes. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on held-out benchmarks and component ablations

full rationale

The derivation chain relies on direct measurement of pass@1 on MBPP+ hard-100, ablations that disable injection or impose retirement, and trace logs of contribution scores and router metrics. These are presented as observable diagnostics and external-task results rather than quantities defined in terms of the target performance lift. No equations or definitions reduce the reported gains (+0.328 rolling) or ablation deltas to the same fitted inputs by construction. The paper supplies independent checks via held-out evaluation and controlled component removal, satisfying the criteria for a self-contained empirical argument.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; full paper may contain additional fitted thresholds or modeling choices not visible here.

free parameters (1)

active-cap size
Bound on number of active skills; value not stated in abstract but required for the governance recipe.

invented entities (1)

library drift no independent evidence
purpose: Names the silent failure mode of unbounded skill accumulation without outcome-driven lifecycle management
New term introduced to label the observed degradation; no independent evidence outside the paper's own experiments.

pith-pipeline@v0.9.0 · 5786 in / 1101 out tokens · 50788 ms · 2026-05-20T06:04:03.776906+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Library drift occurs when a self-evolving skill library’s accumulated artifacts reduce the agent’s expected performance below its no-skill baseline... E[pass@1|St] < E[p0]
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A skill is retired once n(s) ≥ Nmin trials have accumulated and the empirical contribution ĉ(s) ≤ −τ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 8 internal anchors

[1]

AutoManual : Generating instruction manuals by LLM agents via interactive environmental learning

Chen, M., Li, Y., Yang, Y., Yu, S., Lin, B., and He, X. AutoManual : Generating instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[2]

Cascade: Cumulative agentic skill creation through autonomous development and evolution,

Huang, X., Chen, J., Fei, Y., Li, Z., Schwaller, P., and Ceder, G. CASCADE : Cumulative agentic skill creation through autonomous development and evolution. arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025
[3]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017

work page 2017
[4]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Li, X., Chen, W., Liu, Y., Zheng, S., Chen, X., He, Y., Li, Y., You, B., Shen, H., Sun, J., et al. SkillsBench : Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[6]

Self-refine: Iterative refinement with self-feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[7]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Ni, J., Liu, Y., Liu, X., Sun, Y., Zhou, M., Cheng, P., Wang, D., Jiang, X., and Jiang, G. Trace2Skill : Parallel inductive skill distillation for LLM agents. arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Yu, J., Dess \` , R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[9]

Reflexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Wang, J., Ren, Y., and Zhang, H. From procedural skills to strategy genes: Towards experience-driven test-time evolution. arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Agent Workflow Memory

Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Wu, R., Wang, X., Mei, J., Cai, P., Fu, D., Yang, C., Wen, L., Yang, X., Shen, Y., Wang, Y., et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yang, Y., Li, J., Pan, Q., Zhan, B., Cai, Y., Du, L., Zhou, J., Chen, K., Chen, Q., Li, X., et al. AutoSkill : Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[15]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023
[16]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Zhang, X., Wang, G., Cui, Y., Qiu, W., Li, Z., Zhu, B., and He, P. Experience compression spectrum: Unifying memory, skills, and rules in LLM agents. arXiv preprint arXiv:2604.15877, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Zhang, X., Wang, G., Cui, Y., Qiu, W., Li, Z., Zhu, B., and He, P. Do agent rules shape or distort? guardrails beat guidance in coding agents. arXiv preprint arXiv:2604.11088, 2026 b

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

ExpeL : LLM agents are experiential learners

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., and Huang, G. ExpeL : LLM agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[19]

P., Zhang, H., Gonzalez, J

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[1] [1]

AutoManual : Generating instruction manuals by LLM agents via interactive environmental learning

Chen, M., Li, Y., Yang, Y., Yu, S., Lin, B., and He, X. AutoManual : Generating instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[2] [2]

Cascade: Cumulative agentic skill creation through autonomous development and evolution,

Huang, X., Chen, J., Fei, Y., Li, Z., Schwaller, P., and Ceder, G. CASCADE : Cumulative agentic skill creation through autonomous development and evolution. arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025

[3] [3]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017

work page 2017

[4] [4]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Li, X., Chen, W., Liu, Y., Zheng, S., Chen, X., He, Y., Li, Y., You, B., Shen, H., Sun, J., et al. SkillsBench : Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[6] [6]

Self-refine: Iterative refinement with self-feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[7] [7]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Ni, J., Liu, Y., Liu, X., Sun, Y., Zhou, M., Cheng, P., Wang, D., Jiang, X., and Jiang, G. Trace2Skill : Parallel inductive skill distillation for LLM agents. arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Yu, J., Dess \` , R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[9] [9]

Reflexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[10] [10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Wang, J., Ren, Y., and Zhang, H. From procedural skills to strategy genes: Towards experience-driven test-time evolution. arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Agent Workflow Memory

Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Wu, R., Wang, X., Mei, J., Cai, P., Fu, D., Yang, C., Wen, L., Yang, X., Shen, Y., Wang, Y., et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yang, Y., Li, J., Pan, Q., Zhan, B., Cai, Y., Du, L., Zhou, J., Chen, K., Chen, Q., Li, X., et al. AutoSkill : Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[15] [15]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023

[16] [16]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Zhang, X., Wang, G., Cui, Y., Qiu, W., Li, Z., Zhu, B., and He, P. Experience compression spectrum: Unifying memory, skills, and rules in LLM agents. arXiv preprint arXiv:2604.15877, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Zhang, X., Wang, G., Cui, Y., Qiu, W., Li, Z., Zhu, B., and He, P. Do agent rules shape or distort? guardrails beat guidance in coding agents. arXiv preprint arXiv:2604.11088, 2026 b

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

ExpeL : LLM agents are experiential learners

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., and Huang, G. ExpeL : LLM agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[19] [19]

P., Zhang, H., Gonzalez, J

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023