Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Boyuan Ma; Changran Hu; Chen Wu; Fenglu Hong; Hanchen Li; James Zou; Jay Rainton; Kunle Olukotun; Mengmeng Ji; Qizheng Zhang

arxiv: 2510.04618 · v3 · submitted 2025-10-06 · 💻 cs.LG · cs.AI· cs.CL

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang , Changran Hu , Shubhangi Upasani , Boyuan Ma , Fenglu Hong , Vamsidhar Kamanuru , Jay Rainton , Chen Wu

show 5 more authors

Mengmeng Ji Hanchen Li Urmish Thakker James Zou Kunle Olukotun

This is my paper

Pith reviewed 2026-05-12 16:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords context adaptationLLM agentscontext engineeringagent memoryexecution feedbackplaybook evolutionself-improving systemsdomain-specific reasoning

0 comments

The pith

Treating contexts as evolving playbooks through generation, reflection, and curation lets LLMs improve their own performance on agent and reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACE as a way to adapt inputs to language models without changing their weights. Instead of rewriting contexts from scratch, which often drops details or erodes knowledge over time, the method builds playbooks that accumulate strategies step by step. The process uses three linked steps to generate new material, reflect on what worked, and curate what to keep. It applies both before deployment and during live use, relying on the model's own task outcomes rather than external labels. This matters because many current LLM applications depend on careful input design, and a reliable way to grow that input over time could make systems more capable at lower cost.

Core claim

ACE treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. This prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches thetop

What carries the argument

The modular generation-reflection-curation process that turns contexts into accumulating playbooks.

If this is right

Contexts can be refined both as one-time system prompts and as ongoing agent memory stores.
Adaptation works from execution outcomes alone, removing the need for curated training examples.
Lower latency and rollout cost accompany the accuracy improvements on agent and finance tasks.
Smaller open-source models reach parity with larger production agents on hard splits of agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The playbook structure may support multi-session tasks where strategies must carry over days or weeks of interaction.
Similar incremental curation could reduce the frequency of full model retraining in deployed applications.
If the reflection step generalizes, the method might extend to domains where feedback is noisier than in current benchmarks.

Load-bearing premise

The generation, reflection, and curation steps can be executed without introducing biases or overhead that cancel out the gains in accuracy and speed.

What would settle it

A controlled run on a long-horizon task where repeated playbook updates cause loss of specific facts or where final performance falls below the no-update baseline.

read the original abstract

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACE gives a practical way to evolve LLM contexts for better agent performance, but the efficiency claims need checking against total inference costs from all process steps.

read the letter

The punchline is that this paper gives a practical playbook for making LLM contexts self-improving without retraining, and the reported results on agents and finance tasks are worth a look if the methods hold up. What is new is the ACE framework's three-step process of generation, reflection, and curation to accumulate strategies incrementally. This directly targets brevity bias and context collapse by keeping detailed knowledge instead of summarizing it away. It applies both to static prompts and dynamic agent memory, using natural execution feedback for adaptation without labels. The paper does well in showing concrete gains: 10.6% on agent benchmarks, 8.6% on finance, and matching or beating top agents on AppWorld with a smaller open-source model. The idea of treating contexts as evolving playbooks that scale with long-context models is a solid extension of prior adaptation work. The soft spots are in the evaluation details and the efficiency accounting. The abstract mentions outperforming baselines and reducing latency and cost, but without specifics on what the baselines are, how many runs, or statistical significance, it's difficult to gauge how robust the improvements are. More importantly, the stress-test concern holds: the modular process involves multiple LLM invocations per update, yet there's no reported total inference cost that includes generation, reflection, and curation. If those extra calls add significant overhead, the claimed reductions in adaptation latency and rollout cost could be overstated. The full paper might address this, but based on the summary, it remains an assumption. This work is for people focused on agentic systems and efficient LLM deployment in specific domains. A practitioner or researcher dealing with context management in production agents would get value from the structured approach and the unsupervised adaptation angle. It deserves a serious referee because the claims are specific and benchmark-based, making them checkable, even if revisions for more rigorous cost analysis and experimental transparency are needed. Recommendation: Yes, send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the ACE (Agentic Context Engineering) framework, which treats contexts as evolving playbooks updated through a modular generation-reflection-curation process to mitigate brevity bias and context collapse in LLM applications. It reports empirical gains of +10.6% on agent benchmarks and +8.6% on finance tasks, reduced adaptation latency and rollout costs, effective unsupervised adaptation via natural execution feedback, and competitive AppWorld leaderboard performance matching or exceeding a top production agent on key splits despite using a smaller open-source model.

Significance. If the reported gains hold under rigorous scrutiny, the work would be significant for advancing context-based self-improvement in LLMs, offering a scalable alternative to fine-tuning that preserves detailed knowledge and leverages long-context capabilities. The unsupervised adaptation aspect and efficiency claims could influence agent design and domain-specific reasoning systems.

major comments (2)

[Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.
[Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.

minor comments (1)

[Abstract] Abstract: The phrase 'significantly reducing' is used without any quantitative measure of the latency or cost reductions, which reduces clarity on the magnitude of the efficiency benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving clarity and rigor in presenting the ACE framework's efficiency and empirical results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.

Authors: We agree that the abstract's efficiency claims would be stronger with explicit supporting data. The current manuscript does not include a per-stage breakdown of token counts, wall-clock times, or aggregate inference costs across the generation-reflection-curation pipeline. In the revised version, we will add a dedicated analysis (likely in Section 4 or an appendix) reporting these metrics for ACE versus baselines, including all LLM calls, to demonstrate net savings. This data was collected during our experiments and can be presented without changing the core findings. revision: yes
Referee: [Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.

Authors: We acknowledge that the manuscript would benefit from expanded methodological transparency to allow full assessment of the reported gains. While the full text describes the benchmarks, key baselines (e.g., standard prompting, iterative rewriting methods, and production agents), and evaluation protocols, we will revise the Experiments section to explicitly detail: the complete list of baselines with implementation references, number of runs (with seeds), statistical tests (e.g., significance levels and variance), standard deviations, and implementation specifics such as model versions, hyperparameters, and prompt structures. This will strengthen reproducibility and the link between data and claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents ACE as a modular generation-reflection-curation process for evolving contexts and reports performance gains (+10.6% on agents, +8.6% on finance) plus efficiency improvements solely through comparisons to external baselines and leaderboards such as AppWorld. No derivation chain, equations, fitted parameters, or first-principles results are claimed; the central claims rest on standard empirical evaluation rather than any self-definition, renamed known result, or self-citation that reduces the outcome to the framework's own inputs. The absence of mathematical modeling or predictive steps derived from the method itself makes the reported results independent of internal circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the framework name itself are stated. The central claim rests on the unverified assumption that the described modular process produces the reported gains.

invented entities (1)

ACE framework no independent evidence
purpose: Evolving contexts as playbooks via generation, reflection, and curation
Newly introduced method whose effectiveness is asserted via benchmark results in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1278 out tokens · 57678 ms · 2026-05-12T16:40:02.866680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models.
IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
cs.AI 2026-05 unverdicted novelty 7.0

DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
EXG: Self-Evolving Agents with Experience Graphs
cs.AI 2026-05 unverdicted novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
cs.CL 2026-05 unverdicted novelty 7.0

PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
cs.HC 2026-02 unverdicted novelty 7.0

EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
cs.CV 2026-01 conditional novelty 7.0

VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
PACE: Two-Timescale Self-Evolution for Small Language Model Agents
cs.LG 2026-05 unverdicted novelty 6.0

PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
cs.AI 2026-05 unverdicted novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
cs.CV 2026-05 unverdicted novelty 6.0

EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Life-Harness evolves reusable runtime interventions from training failures to improve frozen LLM agents by 88.5% on average across 126 settings in seven deterministic environments while transferring across 18 model backbones.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
SkillEvolver: Skill Learning as a Meta-Skill
cs.AI 2026-05 unverdicted novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
cs.CL 2026-04 unverdicted novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
cs.AI 2026-04 unverdicted novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
cs.AI 2026-04 unverdicted novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Configuring Agentic AI Coding Tools: An Exploratory Study
cs.SE 2026-02 unverdicted novelty 6.0

Developers overwhelmingly rely on simple static context files such as AGENTS.md to configure agentic AI coding tools, while advanced mechanisms like skills and subagents see very low adoption.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
cs.AI 2025-11 unverdicted novelty 6.0

ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
cs.LG 2026-05 unverdicted novelty 5.0

APEX maintains an explicit strategy space via a DAG with fork discovery and policy selection to sustain exploration in self-evolving LLM agents and reports outperformance on Jericho games and WebArena.
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
cs.AI 2026-05 unverdicted novelty 5.0

FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on ...
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
cs.AI 2026-05 conditional novelty 5.0

In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
cs.CV 2026-05 unverdicted novelty 5.0

MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 5.0

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV
cs.NI 2026-05 unverdicted novelty 5.0

A joint optimization approach using SOCP for UAV trajectories, DRL-LLM for resource scheduling, and LP for offloading achieves higher task success rates and system efficiency than multi-agent RL baselines in simulated...
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 5.0

AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
cs.AI 2026-04 unverdicted novelty 5.0

Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
cs.AI 2026-04 unverdicted novelty 5.0

Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems
cs.CY 2026-04 unverdicted novelty 5.0

A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 4.0

AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants
cs.SE 2026-04 unverdicted novelty 4.0

Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.