Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Pith reviewed 2026-05-12 16:40 UTC · model grok-4.3
The pith
Treating contexts as evolving playbooks through generation, reflection, and curation lets LLMs improve their own performance on agent and reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ACE treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. This prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches thetop
What carries the argument
The modular generation-reflection-curation process that turns contexts into accumulating playbooks.
If this is right
- Contexts can be refined both as one-time system prompts and as ongoing agent memory stores.
- Adaptation works from execution outcomes alone, removing the need for curated training examples.
- Lower latency and rollout cost accompany the accuracy improvements on agent and finance tasks.
- Smaller open-source models reach parity with larger production agents on hard splits of agent benchmarks.
Where Pith is reading between the lines
- The playbook structure may support multi-session tasks where strategies must carry over days or weeks of interaction.
- Similar incremental curation could reduce the frequency of full model retraining in deployed applications.
- If the reflection step generalizes, the method might extend to domains where feedback is noisier than in current benchmarks.
Load-bearing premise
The generation, reflection, and curation steps can be executed without introducing biases or overhead that cancel out the gains in accuracy and speed.
What would settle it
A controlled run on a long-horizon task where repeated playbook updates cause loss of specific facts or where final performance falls below the no-update baseline.
read the original abstract
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the ACE (Agentic Context Engineering) framework, which treats contexts as evolving playbooks updated through a modular generation-reflection-curation process to mitigate brevity bias and context collapse in LLM applications. It reports empirical gains of +10.6% on agent benchmarks and +8.6% on finance tasks, reduced adaptation latency and rollout costs, effective unsupervised adaptation via natural execution feedback, and competitive AppWorld leaderboard performance matching or exceeding a top production agent on key splits despite using a smaller open-source model.
Significance. If the reported gains hold under rigorous scrutiny, the work would be significant for advancing context-based self-improvement in LLMs, offering a scalable alternative to fine-tuning that preserves detailed knowledge and leverages long-context capabilities. The unsupervised adaptation aspect and efficiency claims could influence agent design and domain-specific reasoning systems.
major comments (2)
- [Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.
- [Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.
minor comments (1)
- [Abstract] Abstract: The phrase 'significantly reducing' is used without any quantitative measure of the latency or cost reductions, which reduces clarity on the magnitude of the efficiency benefit.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving clarity and rigor in presenting the ACE framework's efficiency and empirical results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.
read point-by-point responses
-
Referee: [Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.
Authors: We agree that the abstract's efficiency claims would be stronger with explicit supporting data. The current manuscript does not include a per-stage breakdown of token counts, wall-clock times, or aggregate inference costs across the generation-reflection-curation pipeline. In the revised version, we will add a dedicated analysis (likely in Section 4 or an appendix) reporting these metrics for ACE versus baselines, including all LLM calls, to demonstrate net savings. This data was collected during our experiments and can be presented without changing the core findings. revision: yes
-
Referee: [Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.
Authors: We acknowledge that the manuscript would benefit from expanded methodological transparency to allow full assessment of the reported gains. While the full text describes the benchmarks, key baselines (e.g., standard prompting, iterative rewriting methods, and production agents), and evaluation protocols, we will revise the Experiments section to explicitly detail: the complete list of baselines with implementation references, number of runs (with seeds), statistical tests (e.g., significance levels and variance), standard deviations, and implementation specifics such as model versions, hyperparameters, and prompt structures. This will strengthen reproducibility and the link between data and claims. revision: yes
Circularity Check
No circularity: empirical framework evaluated on external benchmarks
full rationale
The paper presents ACE as a modular generation-reflection-curation process for evolving contexts and reports performance gains (+10.6% on agents, +8.6% on finance) plus efficiency improvements solely through comparisons to external baselines and leaderboards such as AppWorld. No derivation chain, equations, fitted parameters, or first-principles results are claimed; the central claims rest on standard empirical evaluation rather than any self-definition, renamed known result, or self-citation that reduces the outcome to the framework's own inputs. The absence of mathematical modeling or predictive steps derived from the method itself makes the reported results independent of internal circular construction.
Axiom & Free-Parameter Ledger
invented entities (1)
-
ACE framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models.
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 49 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
-
EXG: Self-Evolving Agents with Experience Graphs
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
-
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
-
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
-
PACE: Two-Timescale Self-Evolution for Small Language Model Agents
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
-
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
-
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.
-
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Life-Harness evolves reusable runtime interventions from training failures to improve frozen LLM agents by 88.5% on average across 126 settings in seven deterministic environments while transferring across 18 model backbones.
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
SkillEvolver: Skill Learning as a Meta-Skill
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
-
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
-
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
-
Configuring Agentic AI Coding Tools: An Exploratory Study
Developers overwhelmingly rely on simple static context files such as AGENTS.md to configure agentic AI coding tools, while advanced mechanisms like skills and subagents see very low adoption.
-
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
-
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
APEX maintains an explicit strategy space via a DAG with fork discovery and policy selection to sustain exploration in self-evolving LLM agents and reports outperformance on Jericho games and WebArena.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on ...
-
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
-
Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV
A joint optimization approach using SOCP for UAV trajectories, DRL-LLM for resource scheduling, and LP for offloading achieves higher task success rates and system efficiency than multi-agent RL baselines in simulated...
-
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
-
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
-
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
-
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
-
A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems
A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.
-
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.