AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Hojoon Kim; Thierry Tambe; Yuheng Wu

arxiv: 2604.24039 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI· cs.CL

AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Hojoon Kim , Yuheng Wu , Thierry Tambe This is my paper

Pith reviewed 2026-05-08 04:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords embodied AILLM planningcachingasynchronous updatesmulti-agent systemslatency reductiontask success

0 comments

The pith

Embodied AI agents reuse cached plans to skip most LLM calls and cut latency by 65 percent while raising success rates 22 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that embodied tasks have strong plan locality, so the next plan is largely predictable from the current one. AgenticCache therefore maintains a runtime cache of frequent plan transitions that agents query directly instead of calling the LLM at every step. A background Cache Updater runs asynchronously to validate and refine those cached entries with fresh LLM calls. Across four multi-agent embodied benchmarks and three models, the approach delivers 22 percent higher task success on average, 65 percent lower simulation latency, and 50 percent fewer tokens. This reuse turns the per-step LLM bottleneck into a practical, low-cost planning loop.

Core claim

Embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. AgenticCache maintains a runtime cache of frequent plan transitions that agents query directly, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Evaluated across four multi-agent embodied benchmarks and three models, this yields an average 22 percent gain in task success rate, a 65 percent reduction in simulation latency, and a 50 percent drop in token usage.

What carries the argument

A runtime cache of frequent plan transitions, queried by agents at each step and kept fresh by an asynchronous background Cache Updater that validates entries without blocking execution.

If this is right

Agents finish tasks with far fewer direct LLM queries per episode.
Simulation runs finish much faster because planning no longer waits on model responses each step.
Token budgets stretch further, supporting longer or more complex scenarios at the same cost.
Task success improves on average, showing that reused cached plans can be more consistent than fresh generations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar locality patterns may appear in other sequential tasks such as web navigation or game playing, allowing the same cache approach outside embodied settings.
Cache size and update rate could be adjusted on the fly based on observed hit rates to balance freshness against overhead.
Pairing the cache with smaller local models for initial population might reduce dependence on large LLMs even more.

Load-bearing premise

Embodied tasks display strong plan locality, so cached transitions can be reused safely without frequent errors or drift from good behavior.

What would settle it

A new embodied benchmark where plans shift unpredictably at every step, causing AgenticCache to lose its latency and token savings or to post lower success rates than per-step LLM planning.

Figures

Figures reproduced from arXiv: 2604.24039 by Hojoon Kim, Thierry Tambe, Yuheng Wu.

**Figure 1.** Figure 1: Overview of AgenticCache. (a) Embodied AI agent framework. (b) Evaluation highlights on GPT-5. LLM later verifies. Yet, these approaches still rely on LLM calls at every step, leaving runtime overhead. In practice, the next plan is often predictable from the local context, a property we refer to as plan locality (Sutton et al., 1998). For example, once an object has been grasped, placing it at the target … view at source ↗

**Figure 2.** Figure 2: Latency breakdown across agents and benchmarks. first perceives the environment by gathering observations, tracking task goals, and maintaining memory. It then plans by decomposing long-horizon objectives into subgoals. Finally, it acts by executing these actions in the environment. The environment is then updated, yielding new observations for the next round of perception and planning. LLM-Powered Embodi… view at source ↗

**Figure 3.** Figure 3: Comparison of four planning strategies. (a) Synchronous plan-act loop. (b) Parallelized planning-acting. (c) Speculative planning. (d) AgenticCache. such as moving back or undoing manipulations, takes time. Limitations of Existing Parallel Planning. A limitation of both methods is their reliance on repeated LLM queries for plan generation, so the LLM cost scales linearly with the trajectory length. Neither… view at source ↗

**Figure 5.** Figure 5: Pattern-based agents exploit plan locality but suffer large performance gaps without context-aware updates. situations arise. This need for both efficiency and contextual adaptability motivates the design of AgenticCache. 4 AGENTICCACHE DESIGN In this section, we present AgenticCache’s design, covering the cache as a local planner (Section 4.1), the asynchronous Cache Updater (Section 4.2), an optional war… view at source ↗

**Figure 6.** Figure 6: Runtime example of AgenticCache execution. The updater then (a) adds a new or updated transition for pt → p ′ t+k , (b) decreases the counts of the mispredicted transition, and (c) replaces the ongoing plan with p ′ t+k if it is executable. This immediate replacement preserves robustness under stale cache hits. Rather than waiting for the current cached plan to fully terminate, the agent switches to the c… view at source ↗

**Figure 7.** Figure 7: Snapshots from the four benchmark environments, with agents highlighted in red. start results without prefilling can be found in Section 5.4. Benchmarks view at source ↗

**Figure 9.** Figure 9: Ablation of AgenticCache components on TDW-MAT, comparing static cache, cache updates only, plan replacement only, and the full system. 5.8 Cache Validity Analysis Experimental Setup. To evaluate the reliability of cached plans over time, we measure the Plan Execution Accuracy. At each frame, an action is judged correct if it matches the plan that GPT-5 would have selected in the same state. The metric is … view at source ↗

**Figure 11.** Figure 11: Prompt for TDW-MAT view at source ↗

**Figure 12.** Figure 12: Prompt for TDW-COOK view at source ↗

**Figure 13.** Figure 13: Prompt for TDW-GAME view at source ↗

**Figure 14.** Figure 14: Prompt for BEHAVIOR-1K view at source ↗

read the original abstract

Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per-step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per-step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi-agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache-based plan reuse thus offers a practical path to low-latency, low-cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgenticCache shows a workable cache-plus-async pattern for cutting LLM calls in embodied planning with reported gains, but the results rest on unmeasured plan locality and lack key ablations.

read the letter

The core of this paper is a runtime cache that stores common plan transitions for LLM-based embodied agents, refreshed in the background by an asynchronous Cache Updater. Agents pull from the cache to skip per-step model calls, and the updater validates or refines entries without blocking execution. Across four multi-agent benchmarks and three models, the authors measure 22% higher average task success, 65% lower simulation latency, and 50% fewer tokens used. Code is released, which is helpful for anyone wanting to replicate or adapt the setup. The engineering pattern is straightforward and directly targets the latency and cost problems that come with running planners in loops. It gives practitioners a concrete way to exploit whatever plan reuse exists in their tasks. The soft spots are in the supporting measurements. The abstract and stress-test note give no cache hit rates, no count of updater interventions or invalidations, and no ablation that keeps the updater running but turns off the cache. Without those, it is hard to know how much of the success lift comes from safe reuse versus the extra LLM work happening asynchronously. The central claim of strong plan locality is asserted but not backed by transition statistics or error analysis on cache misses. This leaves open whether the gains would hold if locality were weaker. The work is aimed at teams building or deploying LLM planners for embodied or multi-agent settings where per-step inference is a bottleneck. Readers who need practical systems ideas for reducing compute in agent loops will get value from the architecture and numbers, even if they have to add their own diagnostics. I would bring it to a reading group to look at the full experiments and code. It deserves peer review because the performance claims are large enough and the idea is simple to test, though referees would likely ask for the missing cache metrics and ablations.

Referee Report

2 major / 1 minor

Summary. The paper claims that embodied AI tasks exhibit strong plan locality, enabling reuse of cached plan transitions to avoid per-step LLM calls. It introduces AgenticCache, a framework with a runtime cache queried by agents and a background Cache Updater that asynchronously validates and refines entries via LLM calls. On four multi-agent embodied benchmarks across 12 configurations (4 benchmarks x 3 models), it reports average gains of 22% in task success rate, 65% reduction in simulation latency, and 50% lower token usage. Code is released at a GitHub link.

Significance. If substantiated, the results could meaningfully advance practical LLM-based planning for embodied agents by addressing latency and cost bottlenecks through cache reuse. The premise of plan locality in embodied tasks is a plausible and useful insight, and releasing code supports reproducibility. However, the absence of key experimental details limits the immediate assessability of the contribution.

major comments (2)

[Abstract] Abstract: The headline claims of 22% success-rate improvement, 65% latency reduction, and 50% token reduction are presented without any description of the baselines, the precise definition of success rate or latency, error bars, statistical significance tests, or the handling of cache misses and validation failures. These omissions are load-bearing for the central empirical claim.
[Abstract] Abstract: No cache hit-rate statistics, invalidation frequency, or ablation that disables the Cache Updater while retaining the cache are reported. This leaves open whether the observed gains are causally due to safe plan reuse under strong locality or to unablated factors such as extra LLM work by the updater.

minor comments (1)

[Abstract] Abstract: The parenthetical '(4 benchmarks x 3 models)' would benefit from naming the specific benchmarks and models in the summary paragraph for immediate readability, even if they are detailed later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The feedback identifies important omissions that affect how readily the empirical claims can be evaluated. We address each point below and will revise the manuscript accordingly to improve clarity and substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 22% success-rate improvement, 65% latency reduction, and 50% token reduction are presented without any description of the baselines, the precise definition of success rate or latency, error bars, statistical significance tests, or the handling of cache misses and validation failures. These omissions are load-bearing for the central empirical claim.

Authors: We agree that the abstract would be strengthened by including these details. In the revision we will briefly define the baselines as standard per-step LLM planning without any caching mechanism. Success rate is the fraction of tasks completed successfully within the allotted step budget, latency is the average wall-clock simulation time per task, and token usage counts all LLM tokens consumed. Error bars (standard deviation over five random seeds) and statistical significance (paired t-tests) appear in the main results; we will note this in the abstract. Cache misses fall back to an immediate LLM call, while validation failures are resolved asynchronously by the Cache Updater without blocking agent execution. Concise versions of these clarifications will be added to the abstract. revision: yes
Referee: [Abstract] Abstract: No cache hit-rate statistics, invalidation frequency, or ablation that disables the Cache Updater while retaining the cache are reported. This leaves open whether the observed gains are causally due to safe plan reuse under strong locality or to unablated factors such as extra LLM work by the updater.

Authors: We acknowledge that these supporting analyses are absent from the abstract. We will add cache hit-rate statistics and invalidation frequencies to the results section and provide a brief summary in the abstract. We will also incorporate an ablation that disables the asynchronous Cache Updater while retaining the static cache; this will quantify the updater's contribution and confirm that the reported gains derive primarily from plan locality and safe reuse. The abstract will be updated to reference these additions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on benchmarks

full rationale

The paper introduces AgenticCache as a practical caching system for LLM-based planning in embodied agents and reports measured gains (22% success, 65% latency, 50% tokens) across four benchmarks and three models. These outcomes are obtained by direct execution on the benchmarks rather than by any derivation, equation, or fitted parameter that reduces to the input by construction. The central premise of plan locality is stated as an empirical observation that motivates the design; it is not derived from prior self-citations or defined circularly in terms of the cache hit rate itself. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the presented material.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption of plan locality and introduces the AgenticCache system itself; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one.
Explicitly stated as the foundation for reusing cached plan transitions.

invented entities (1)

AgenticCache framework with runtime cache and background Cache Updater no independent evidence
purpose: To store frequent plan transitions and asynchronously validate/refine them with LLM calls
New system introduced to exploit plan locality; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1387 out tokens · 19021 ms · 2026-05-08T04:26:54.103147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Clone the repository with submodules: git clone -recursive https:// github.com/hojoonleokim/MLSys26_ AgenticCache.git cd MLSys26_AgenticCache

work page
[2]

Create conda environments from the environment.yml in each submodule (on thebaselinebranch): P=MLSys26_AgenticCache conda env create \ -f $P-COHERENT/environment.yml conda env create \ -f $P-CoELA/environment.yml conda env create \ -f $P-COMBO/environment.yml This creates conda environments named coherent, tdw, andcombo, respectively

work page
[3]

Set the OPENAI_API_KEY environment variable for GPT-5 access

work page
[4]

(CoELA & COMBO only) Set up the X server for TDW. Kill any existing display server processes, then start Xorg: # Kill existing Xorg / gnome-shell sudo kill -9 <PID_of_Xorg> sudo kill -9 <PID_of_gnome-shell> # Start X server on display :1 sudo nohup Xorg :1 \ -config /etc/X11/xorg-1.conf & See the TDW server setup guide for generating xorg.conffiles

work page
[5]

The final checkpoint modl-100.ptis used by all evaluation branches

(COMBO only) To reproduce the vision model from scratch, switch to the training-code branch and run the training pipeline: cd MLSys26_AgenticCache-COMBO git checkout training-code cd AVDC/flowdiffusion bash train_all.sh The pipeline consists of four steps: (1) conda env setup, (2) training data generation via TDW (re- quires DISPLAY=:1), (3) text embeddin...

work page
[6]

AgenticCache (agenticcache branch) matches or outperforms the baseline in task success rate

work page
[7]

AgenticCache reduces simulation latency compared to the synchronous baseline

work page
[8]

AgenticCache reduces total token usage

work page
[9]

transport

The parallel and speculative variants show distinct trade-offs compared to the baseline and AgenticCache. A.7 Experiment Customization Reviewers may customize the evaluation as follows: • Run a single branch:Instead of the auto- mated scripts, manually check out a specific branch and run the per-benchmark script (e.g., scripts/test_LMs-gpt-5.shfor CoELA)....

work page
[10]

- You can identify the needed ingredient by checking your recipe, action history and dish

Prefer actions that immediately advance your own recipe. - You can identify the needed ingredient by checking your recipe, action history and dish

work page
[11]

- You can identify the needed ingredient by checking the other agent’s recipe and dish

Otherwise, help the other agent without blocking the cutting board. - You can identify the needed ingredient by checking the other agent’s recipe and dish

work page
[12]

place into puzzle box

Use your private region to temporarily store items or to prevent congestion on the shared cutting board. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Recipe (stack in order): - Yours: {recipe_strs[agent_id]} - Other agent: {recipe_strs[1 - agent_id]} Action History (excluding WAIT): #ACTION_HISTORY#...

work page
[13]

- You can decide which piece to place by checking your puzzle box

Prefer placing your correct piece into the puzzle box. - You can decide which piece to place by checking your puzzle box

work page
[14]

- You can choose which border to use by checking the other agents’ puzzle boxes

If (1) is not possible, pass the piece via a shared border. - You can choose which border to use by checking the other agents’ puzzle boxes

work page
[15]

### Progress You’ve taken **{steps_taken-1}/60** steps

Use private regions to temporarily store pieces or to prevent congestion of the shared borders. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Action History (excluding WAIT): #ACTION_HISTORY# Possible Actions: #POSSIBLE_ACTIONS# Output (strictly one line, no reasoning): Next action: <one of the liste...

work page
[16]

Copy the action string **verbatim** from the list if you can act

work page
[17]

Do **not** add extra words, numbers, or multi-line explanations

work page
[18]

Do **not** invent or rephrase actions

work page
[19]

Always think step by step, but output only one final line in the required format

work page
[20]

Figure 14.Prompt for BEHA VIOR-1K

If you finished the oracle instruction, output [wait]. Figure 14.Prompt for BEHA VIOR-1K

work page

[1] [1]

Clone the repository with submodules: git clone -recursive https:// github.com/hojoonleokim/MLSys26_ AgenticCache.git cd MLSys26_AgenticCache

work page

[2] [2]

Create conda environments from the environment.yml in each submodule (on thebaselinebranch): P=MLSys26_AgenticCache conda env create \ -f $P-COHERENT/environment.yml conda env create \ -f $P-CoELA/environment.yml conda env create \ -f $P-COMBO/environment.yml This creates conda environments named coherent, tdw, andcombo, respectively

work page

[3] [3]

Set the OPENAI_API_KEY environment variable for GPT-5 access

work page

[4] [4]

(CoELA & COMBO only) Set up the X server for TDW. Kill any existing display server processes, then start Xorg: # Kill existing Xorg / gnome-shell sudo kill -9 <PID_of_Xorg> sudo kill -9 <PID_of_gnome-shell> # Start X server on display :1 sudo nohup Xorg :1 \ -config /etc/X11/xorg-1.conf & See the TDW server setup guide for generating xorg.conffiles

work page

[5] [5]

The final checkpoint modl-100.ptis used by all evaluation branches

(COMBO only) To reproduce the vision model from scratch, switch to the training-code branch and run the training pipeline: cd MLSys26_AgenticCache-COMBO git checkout training-code cd AVDC/flowdiffusion bash train_all.sh The pipeline consists of four steps: (1) conda env setup, (2) training data generation via TDW (re- quires DISPLAY=:1), (3) text embeddin...

work page

[6] [6]

AgenticCache (agenticcache branch) matches or outperforms the baseline in task success rate

work page

[7] [7]

AgenticCache reduces simulation latency compared to the synchronous baseline

work page

[8] [8]

AgenticCache reduces total token usage

work page

[9] [9]

transport

The parallel and speculative variants show distinct trade-offs compared to the baseline and AgenticCache. A.7 Experiment Customization Reviewers may customize the evaluation as follows: • Run a single branch:Instead of the auto- mated scripts, manually check out a specific branch and run the per-benchmark script (e.g., scripts/test_LMs-gpt-5.shfor CoELA)....

work page

[10] [10]

- You can identify the needed ingredient by checking your recipe, action history and dish

Prefer actions that immediately advance your own recipe. - You can identify the needed ingredient by checking your recipe, action history and dish

work page

[11] [11]

- You can identify the needed ingredient by checking the other agent’s recipe and dish

Otherwise, help the other agent without blocking the cutting board. - You can identify the needed ingredient by checking the other agent’s recipe and dish

work page

[12] [12]

place into puzzle box

Use your private region to temporarily store items or to prevent congestion on the shared cutting board. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Recipe (stack in order): - Yours: {recipe_strs[agent_id]} - Other agent: {recipe_strs[1 - agent_id]} Action History (excluding WAIT): #ACTION_HISTORY#...

work page

[13] [13]

- You can decide which piece to place by checking your puzzle box

Prefer placing your correct piece into the puzzle box. - You can decide which piece to place by checking your puzzle box

work page

[14] [14]

- You can choose which border to use by checking the other agents’ puzzle boxes

If (1) is not possible, pass the piece via a shared border. - You can choose which border to use by checking the other agents’ puzzle boxes

work page

[15] [15]

### Progress You’ve taken **{steps_taken-1}/60** steps

Use private regions to temporarily store pieces or to prevent congestion of the shared borders. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Action History (excluding WAIT): #ACTION_HISTORY# Possible Actions: #POSSIBLE_ACTIONS# Output (strictly one line, no reasoning): Next action: <one of the liste...

work page

[16] [16]

Copy the action string **verbatim** from the list if you can act

work page

[17] [17]

Do **not** add extra words, numbers, or multi-line explanations

work page

[18] [18]

Do **not** invent or rephrase actions

work page

[19] [19]

Always think step by step, but output only one final line in the required format

work page

[20] [20]

Figure 14.Prompt for BEHA VIOR-1K

If you finished the oracle instruction, output [wait]. Figure 14.Prompt for BEHA VIOR-1K

work page