AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents
Pith reviewed 2026-05-08 04:26 UTC · model grok-4.3
The pith
Embodied AI agents reuse cached plans to skip most LLM calls and cut latency by 65 percent while raising success rates 22 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. AgenticCache maintains a runtime cache of frequent plan transitions that agents query directly, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Evaluated across four multi-agent embodied benchmarks and three models, this yields an average 22 percent gain in task success rate, a 65 percent reduction in simulation latency, and a 50 percent drop in token usage.
What carries the argument
A runtime cache of frequent plan transitions, queried by agents at each step and kept fresh by an asynchronous background Cache Updater that validates entries without blocking execution.
If this is right
- Agents finish tasks with far fewer direct LLM queries per episode.
- Simulation runs finish much faster because planning no longer waits on model responses each step.
- Token budgets stretch further, supporting longer or more complex scenarios at the same cost.
- Task success improves on average, showing that reused cached plans can be more consistent than fresh generations.
Where Pith is reading between the lines
- Similar locality patterns may appear in other sequential tasks such as web navigation or game playing, allowing the same cache approach outside embodied settings.
- Cache size and update rate could be adjusted on the fly based on observed hit rates to balance freshness against overhead.
- Pairing the cache with smaller local models for initial population might reduce dependence on large LLMs even more.
Load-bearing premise
Embodied tasks display strong plan locality, so cached transitions can be reused safely without frequent errors or drift from good behavior.
What would settle it
A new embodied benchmark where plans shift unpredictably at every step, causing AgenticCache to lose its latency and token savings or to post lower success rates than per-step LLM planning.
Figures
read the original abstract
Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per-step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per-step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi-agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache-based plan reuse thus offers a practical path to low-latency, low-cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that embodied AI tasks exhibit strong plan locality, enabling reuse of cached plan transitions to avoid per-step LLM calls. It introduces AgenticCache, a framework with a runtime cache queried by agents and a background Cache Updater that asynchronously validates and refines entries via LLM calls. On four multi-agent embodied benchmarks across 12 configurations (4 benchmarks x 3 models), it reports average gains of 22% in task success rate, 65% reduction in simulation latency, and 50% lower token usage. Code is released at a GitHub link.
Significance. If substantiated, the results could meaningfully advance practical LLM-based planning for embodied agents by addressing latency and cost bottlenecks through cache reuse. The premise of plan locality in embodied tasks is a plausible and useful insight, and releasing code supports reproducibility. However, the absence of key experimental details limits the immediate assessability of the contribution.
major comments (2)
- [Abstract] Abstract: The headline claims of 22% success-rate improvement, 65% latency reduction, and 50% token reduction are presented without any description of the baselines, the precise definition of success rate or latency, error bars, statistical significance tests, or the handling of cache misses and validation failures. These omissions are load-bearing for the central empirical claim.
- [Abstract] Abstract: No cache hit-rate statistics, invalidation frequency, or ablation that disables the Cache Updater while retaining the cache are reported. This leaves open whether the observed gains are causally due to safe plan reuse under strong locality or to unablated factors such as extra LLM work by the updater.
minor comments (1)
- [Abstract] Abstract: The parenthetical '(4 benchmarks x 3 models)' would benefit from naming the specific benchmarks and models in the summary paragraph for immediate readability, even if they are detailed later.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. The feedback identifies important omissions that affect how readily the empirical claims can be evaluated. We address each point below and will revise the manuscript accordingly to improve clarity and substantiation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of 22% success-rate improvement, 65% latency reduction, and 50% token reduction are presented without any description of the baselines, the precise definition of success rate or latency, error bars, statistical significance tests, or the handling of cache misses and validation failures. These omissions are load-bearing for the central empirical claim.
Authors: We agree that the abstract would be strengthened by including these details. In the revision we will briefly define the baselines as standard per-step LLM planning without any caching mechanism. Success rate is the fraction of tasks completed successfully within the allotted step budget, latency is the average wall-clock simulation time per task, and token usage counts all LLM tokens consumed. Error bars (standard deviation over five random seeds) and statistical significance (paired t-tests) appear in the main results; we will note this in the abstract. Cache misses fall back to an immediate LLM call, while validation failures are resolved asynchronously by the Cache Updater without blocking agent execution. Concise versions of these clarifications will be added to the abstract. revision: yes
-
Referee: [Abstract] Abstract: No cache hit-rate statistics, invalidation frequency, or ablation that disables the Cache Updater while retaining the cache are reported. This leaves open whether the observed gains are causally due to safe plan reuse under strong locality or to unablated factors such as extra LLM work by the updater.
Authors: We acknowledge that these supporting analyses are absent from the abstract. We will add cache hit-rate statistics and invalidation frequencies to the results section and provide a brief summary in the abstract. We will also incorporate an ablation that disables the asynchronous Cache Updater while retaining the static cache; this will quantify the updater's contribution and confirm that the reported gains derive primarily from plan locality and safe reuse. The abstract will be updated to reference these additions. revision: yes
Circularity Check
No circularity: empirical measurements on benchmarks
full rationale
The paper introduces AgenticCache as a practical caching system for LLM-based planning in embodied agents and reports measured gains (22% success, 65% latency, 50% tokens) across four benchmarks and three models. These outcomes are obtained by direct execution on the benchmarks rather than by any derivation, equation, or fitted parameter that reduces to the input by construction. The central premise of plan locality is stated as an empirical observation that motivates the design; it is not derived from prior self-citations or defined circularly in terms of the cache hit rate itself. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the presented material.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one.
invented entities (1)
-
AgenticCache framework with runtime cache and background Cache Updater
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Clone the repository with submodules: git clone -recursive https:// github.com/hojoonleokim/MLSys26_ AgenticCache.git cd MLSys26_AgenticCache
-
[2]
Create conda environments from the environment.yml in each submodule (on thebaselinebranch): P=MLSys26_AgenticCache conda env create \ -f $P-COHERENT/environment.yml conda env create \ -f $P-CoELA/environment.yml conda env create \ -f $P-COMBO/environment.yml This creates conda environments named coherent, tdw, andcombo, respectively
-
[3]
Set the OPENAI_API_KEY environment variable for GPT-5 access
-
[4]
(CoELA & COMBO only) Set up the X server for TDW. Kill any existing display server processes, then start Xorg: # Kill existing Xorg / gnome-shell sudo kill -9 <PID_of_Xorg> sudo kill -9 <PID_of_gnome-shell> # Start X server on display :1 sudo nohup Xorg :1 \ -config /etc/X11/xorg-1.conf & See the TDW server setup guide for generating xorg.conffiles
-
[5]
The final checkpoint modl-100.ptis used by all evaluation branches
(COMBO only) To reproduce the vision model from scratch, switch to the training-code branch and run the training pipeline: cd MLSys26_AgenticCache-COMBO git checkout training-code cd AVDC/flowdiffusion bash train_all.sh The pipeline consists of four steps: (1) conda env setup, (2) training data generation via TDW (re- quires DISPLAY=:1), (3) text embeddin...
-
[6]
AgenticCache (agenticcache branch) matches or outperforms the baseline in task success rate
-
[7]
AgenticCache reduces simulation latency compared to the synchronous baseline
-
[8]
AgenticCache reduces total token usage
-
[9]
The parallel and speculative variants show distinct trade-offs compared to the baseline and AgenticCache. A.7 Experiment Customization Reviewers may customize the evaluation as follows: • Run a single branch:Instead of the auto- mated scripts, manually check out a specific branch and run the per-benchmark script (e.g., scripts/test_LMs-gpt-5.shfor CoELA)....
-
[10]
- You can identify the needed ingredient by checking your recipe, action history and dish
Prefer actions that immediately advance your own recipe. - You can identify the needed ingredient by checking your recipe, action history and dish
-
[11]
- You can identify the needed ingredient by checking the other agent’s recipe and dish
Otherwise, help the other agent without blocking the cutting board. - You can identify the needed ingredient by checking the other agent’s recipe and dish
-
[12]
Use your private region to temporarily store items or to prevent congestion on the shared cutting board. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Recipe (stack in order): - Yours: {recipe_strs[agent_id]} - Other agent: {recipe_strs[1 - agent_id]} Action History (excluding WAIT): #ACTION_HISTORY#...
-
[13]
- You can decide which piece to place by checking your puzzle box
Prefer placing your correct piece into the puzzle box. - You can decide which piece to place by checking your puzzle box
-
[14]
- You can choose which border to use by checking the other agents’ puzzle boxes
If (1) is not possible, pass the piece via a shared border. - You can choose which border to use by checking the other agents’ puzzle boxes
-
[15]
### Progress You’ve taken **{steps_taken-1}/60** steps
Use private regions to temporarily store pieces or to prevent congestion of the shared borders. ### Progress You’ve taken **{steps_taken-1}/60** steps. Steps remaining: **{60 - (steps_taken-1)} **. Action History (excluding WAIT): #ACTION_HISTORY# Possible Actions: #POSSIBLE_ACTIONS# Output (strictly one line, no reasoning): Next action: <one of the liste...
-
[16]
Copy the action string **verbatim** from the list if you can act
-
[17]
Do **not** add extra words, numbers, or multi-line explanations
-
[18]
Do **not** invent or rephrase actions
-
[19]
Always think step by step, but output only one final line in the required format
-
[20]
Figure 14.Prompt for BEHA VIOR-1K
If you finished the oracle instruction, output [wait]. Figure 14.Prompt for BEHA VIOR-1K
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.