hub Canonical reference

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru · 2025 · cs.LG · arXiv 2510.04618

Canonical reference. 90% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 1

citation-polarity summary

background 9 use method 1

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

EXG: Self-Evolving Agents with Experience Graphs

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.

Meta-Harness: End-to-End Optimization of Model Harnesses

cs.AI · 2026-03-30 · unverdicted · novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.

EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution

cs.HC · 2026-02-20 · unverdicted · novelty 7.0

EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

cs.CV · 2026-01-16 · conditional · novelty 7.0

VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

SkillEvolver: Skill Learning as a Meta-Skill

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.

Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.

ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.

SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

cs.DB · 2026-04-17 · unverdicted · novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

citing papers explorer

Showing 44 of 44 citing papers.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 107 · 2 links · internal anchor
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
EXG: Self-Evolving Agents with Experience Graphs cs.AI · 2026-05-18 · unverdicted · none · ref 49 · internal anchor
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures cs.CL · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 69 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception cs.AI · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
Meta-Harness: End-to-End Optimization of Model Harnesses cs.AI · 2026-03-30 · unverdicted · none · ref 61 · internal anchor
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution cs.HC · 2026-02-20 · unverdicted · none · ref 17 · internal anchor
EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning cs.CV · 2026-01-16 · conditional · none · ref 69 · internal anchor
VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
PACE: Two-Timescale Self-Evolution for Small Language Model Agents cs.LG · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 17 · internal anchor
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning cs.CV · 2026-05-21 · unverdicted · none · ref 22 · internal anchor
EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 52 · internal anchor
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
SkillEvolver: Skill Learning as a Meta-Skill cs.AI · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 60 · 2 links · internal anchor
MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 31 · internal anchor
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 92 · internal anchor
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration cs.AI · 2026-04-20 · unverdicted · none · ref 8 · internal anchor
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis cs.AI · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization cs.AI · 2026-04-19 · unverdicted · none · ref 115 · internal anchor
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation cs.CL · 2026-04-17 · unverdicted · none · ref 25 · internal anchor
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yielding 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3 within 100 steps.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation cs.DB · 2026-04-17 · unverdicted · none · ref 108 · internal anchor
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Configuring Agentic AI Coding Tools: An Exploratory Study cs.SE · 2026-02-16 · unverdicted · none · ref 31 · internal anchor
Developers overwhelmingly rely on simple static context files such as AGENTS.md to configure agentic AI coding tools, while advanced mechanisms like skills and subagents see very low adoption.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory cs.AI · 2025-11-26 · unverdicted · none · ref 49 · internal anchor
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents cs.LG · 2026-05-20 · unverdicted · none · ref 27 · internal anchor
APEX maintains an explicit strategy space via a DAG with fork discovery and policy selection to sustain exploration in self-evolving LLM agents and reports outperformance on Jericho games and WebArena.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 17 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast cs.AI · 2026-05-15 · unverdicted · none · ref 25 · internal anchor
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP cs.AI · 2026-05-15 · conditional · none · ref 26 · internal anchor
In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph cs.CV · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV cs.NI · 2026-05-06 · unverdicted · none · ref 59 · internal anchor
A joint optimization approach using SOCP for UAV trajectories, DRL-LLM for resource scheduling, and LP for offloading achieves higher task success rates and system efficiency than multi-agent RL baselines in simulated dense urban IoV environments.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization cs.IR · 2026-04-21 · unverdicted · none · ref 25 · 2 links · internal anchor
AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems cs.SE · 2026-04-14 · unverdicted · none · ref 53 · internal anchor
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in OpenClaw's gateway context.
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents cs.AI · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent cs.AI · 2026-04-08 · unverdicted · none · ref 21 · internal anchor
Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems cs.CY · 2026-04-06 · unverdicted · none · ref 27 · internal anchor
A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants cs.SE · 2026-04-09 · unverdicted · none · ref 28 · internal anchor
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents cs.AI · 2026-05-21 · unreviewed · ref 58 · internal anchor
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems cs.AI · 2026-05-20 · unreviewed · ref 27 · internal anchor
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unreviewed · ref 11 · internal anchor
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace cs.AI · 2026-05-11 · unreviewed · ref 55 · internal anchor
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unreviewed · ref 128 · 2 links · internal anchor
From Context to Skills: Can Language Models Learn from Context Skillfully? cs.AI · 2026-04-30 · unreviewed · ref 50 · internal anchor

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer