Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

Honguk Woo; Saehun Chun; Sanghyun Ahn; Sera Choi; Wonje Choi

arxiv: 2606.13097 · v1 · pith:A2PMEMKVnew · submitted 2026-06-11 · 💻 cs.PL · cs.AI

Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

Saehun Chun , Wonje Choi , Sera Choi , Sanghyun Ahn , Honguk Woo This is my paper

Pith reviewed 2026-06-27 05:25 UTC · model grok-4.3

classification 💻 cs.PL cs.AI

keywords functional cache graftingcode policy synthesisembodied agentsKV cacheCodeLLMstitching and patchingpolicy generation

0 comments

The pith

FCGraft grafts function-level KV caches to synthesize robust code policies for embodied agents faster than full regeneration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code-writing LLMs for embodied agents face slow repetitive prefill on long prompts and often generate fragile control code with API errors or missing guards. FCGraft counters this by keeping a library of validated function skeletons together with their key-value caches. For a new task it retrieves matching functions, stitches their caches into a composite program, and applies only local patches with minimal new decoding. The result reuses proven structures instead of regenerating everything, which cuts latency and raises reliability over methods that cache at the full-prompt level.

Core claim

FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided, performing cache grafting via stitching to compose cached function segments and patching to locally adapt only necessary code regions with minimal additional decoding.

What carries the argument

Functional cache grafting, which retrieves function skeletons and their KV caches then stitches them into composite policies while patching only task-specific parts with limited new decoding.

If this is right

Redundant prefill computation over long prompts is eliminated, lowering generation latency.
Reusing validated control structures and safety guards raises overall policy robustness.
Task success rate increases by 18.31 percent relative to prompt-level caching baselines.
Policy synthesis speed improves by a factor of 2.3 over prompt-level methods.
New policies satisfy environmental constraints with only localized changes rather than full regeneration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grafting pattern could extend to other structured-generation tasks such as API call sequences or planning scripts.
Library coverage would need periodic expansion or versioning to handle evolving environments without performance loss.
The speed and reliability gains rest on how well the retrieval step matches functions to new task descriptions.

Load-bearing premise

Relevant functions can be reliably retrieved from the library and their KV caches can be stitched and patched while preserving correctness and satisfying task-specific constraints.

What would settle it

A controlled test in which retrieved functions frequently produce invalid stitched policies or require so much additional decoding that the claimed latency reduction disappears would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.13097 by Honguk Woo, Saehun Chun, Sanghyun Ahn, Sera Choi, Wonje Choi.

**Figure 1.** Figure 1: Illustration of FCGRAFT in an open-domain scenario involving gas management. (1) Conventional CaP incurs high latency from repetitive prefill and low robustness from fully generative decoding; delayed responses cause gas leakage, cascading into further disruptions. (2) FCGRAFT employs cache-grafting (cache-stitching and cache-patching) to eliminate redundant prefill and reuse validated control structures, … view at source ↗

**Figure 2.** Figure 2: Overall architecture of FCGRAFT. Top: End-to-end robotic programming workflow. Bottom: Process of function-level KV caching and cache-grafting code policy synthesis. 4. FCGRAFT: Functional Cache Grafting As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of FCGRAFT’s operation in real-world robotic manipulation. Robot tests. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Environment examples set of RLBench [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of four scene types: kitchen, living room, bedroom, and bathroom. turned on may switch off. Such perturbations require the agent to detect inconsistencies between expected and observed states, update its internal representation, and adapt its execution accordingly. This scenario evaluates robustness against environmental uncertainty while still preserving the original task goals. (3) In open-evolu… view at source ↗

**Figure 6.** Figure 6: Examples of four task types: slice, clean, pick up, and boil. asymmetric information structure, with task knowledge and execution distributed across agents, makes dialogue-based coordination essential in all three scenarios. Task. TEACH tasks are grounded in a library of action primitives (APIs) that enable interaction with objects and receptacles in the scene. As summarized in [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 7.** Figure 7: Environment examples set of Real-world. on the environment and task objectives. Real-world experiments were conducted in two environments, each consisting of multiple tasks with varying object configurations. For both environments, objects were sampled from a global object pool, and a random subset (typically 3-5 objects) was placed in randomized positions at the beginning of each trial. This randomization… view at source ↗

**Figure 8.** Figure 8: Real-world Office Desk Rearrangement. The full task sequence is decomposed into three subtasks: (1) picking up two trash items and throwing them into the bin, (2) organizing stationery into the top drawer, and (3) disposing of remaining trash and placing the leftover stationery into the middle drawer. C.2.2. COOKING WORKSTATION PREPARATION The second real-world environment is a cooking workstation setup. T… view at source ↗

**Figure 9.** Figure 9: Real-world Cooking Workstation Preparation. The sequence involves three subtasks: (1) placing the burner onto the sink (desk in real setup), during which the gas hose disconnects; (2) pressing the emergency gas shutoff switch to quickly stop the leak; and (3) reconnecting the hose and toggling the switch to restore gas flow. C.3. Analysis on code cache warm-up [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Analysis on code cache warm-up, with SR, PSL, HR, and MU over 40 tasks. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation on model size and cache-patching. Evaluation of CodeLLMs with different scales (3B, 7B, 14B), contrasting cache-patching against ablated settings [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

read the original abstract

Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FCGraft tries to cut latency and errors in robot code policies by grafting function-level KV caches, but the stitching compatibility and missing experiment details make the gains hard to trust.

read the letter

FCGraft tries to cut latency and errors in robot code policies by grafting function-level KV caches, but the stitching compatibility and missing experiment details make the gains hard to trust.

The new piece is the shift from prompt-level caching to function skeletons with their KV caches, then using stitching to compose segments and patching to handle task parameters. This targets the prefill cost on long prompts and the tendency of full generation to drop safety checks or hit API mismatches.

It does a reasonable job naming those two practical problems in embodied code generation and showing why reusing validated structures could help robustness over plain RAGCache-style methods.

The soft spots are the core mechanism and the evidence. KV cache stitching assumes the attention states line up at the graft points, yet each cache entry depends on its own prefix; different function lengths or bindings can change the queries and break the logits without a full recompute. The abstract gives no derivation or check that the grafted output matches a normal forward pass. On top of that, the 18.31% success lift and 2.3x speed claim appear with zero experimental setup, no task count, no environment description, and no error analysis, so it is impossible to judge whether the numbers are real or cherry-picked.

The retrieval-plus-minimal-patch assumption also looks optimistic without evidence that compatible functions are usually available.

This is aimed at people working on low-latency LLM inference for robotics or agent code synthesis. A reader already thinking about KV cache reuse might pick up the grafting angle.

It is worth sending to referees if the full paper supplies ablations that confirm the grafts preserve correctness and proper controlled comparisons. Right now the claims rest on unshown assumptions.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FCGraft, a Functional Cache Grafting framework for CodeLLMs that synthesizes executable code policies for embodied agents. It maintains a library of function-level validated code skeletons paired with their prompt-level Transformer KV caches; for a new task, relevant functions are retrieved and their caches are grafted via stitching (composing cached segments into a composite policy) and patching (local adaptation of code regions to satisfy task-specific constraints with minimal additional decoding). The approach is claimed to eliminate redundant prefill computation, yielding 18.31% higher task success rate and 2.3x faster policy synthesis than prompt-level methods such as RAGCache while improving robustness through reuse of validated control structures.

Significance. If the grafting mechanism is shown to preserve policy correctness, the work could meaningfully advance efficient, real-time code synthesis for embodied agents by combining modularity with KV-cache reuse. The function-level library and validated skeletons represent a concrete step beyond prompt-level caching, with potential applicability to other CodeLLM settings that require both speed and reliability.

major comments (2)

[Abstract] Abstract: the central performance claims (18.31% higher task success rate, 2.3x faster synthesis) are stated without any experimental details, task count, baselines beyond RAGCache, statistical tests, variance, or error analysis. This directly affects verifiability of the primary empirical contribution.
[Abstract] Abstract (grafting description): the stitching and patching procedure is presented as preserving correctness with only minimal additional decoding, yet no argument, derivation, or ablation addresses whether KV-cache concatenation at function boundaries yields identical logits to a full forward pass. Cross-segment attention dependencies, variable-length prefixes, and control-flow differences can alter subsequent attention scores, undermining the robustness and latency claims.

minor comments (1)

[Abstract] Abstract: the library-construction process (how validated skeletons and caches are initially obtained and stored) is not described even at a high level.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (18.31% higher task success rate, 2.3x faster synthesis) are stated without any experimental details, task count, baselines beyond RAGCache, statistical tests, variance, or error analysis. This directly affects verifiability of the primary empirical contribution.

Authors: We agree the abstract's brevity limits immediate verifiability. The full manuscript reports results over 50 tasks in three embodied environments, with comparisons to RAGCache plus two additional baselines, five-run averages, and statistical significance tests in Sections 4 and 5. We will revise the abstract to state the task count and note multi-run evaluation with variance. revision: yes
Referee: [Abstract] Abstract (grafting description): the stitching and patching procedure is presented as preserving correctness with only minimal additional decoding, yet no argument, derivation, or ablation addresses whether KV-cache concatenation at function boundaries yields identical logits to a full forward pass. Cross-segment attention dependencies, variable-length prefixes, and control-flow differences can alter subsequent attention scores, undermining the robustness and latency claims.

Authors: This observation is valid. The current manuscript provides only empirical evidence of higher success rates and reduced latency; it contains no formal derivation showing that grafted KV segments produce identical logits, nor an ablation isolating cross-segment attention effects. We will add a discussion of the approximation and a limited empirical comparison of attention scores, but a complete theoretical argument lies outside the present scope. revision: partial

standing simulated objections not resolved

A formal derivation proving that function-boundary KV-cache grafting yields logits identical to a full forward pass under arbitrary cross-segment attention.

Circularity Check

0 steps flagged

No circularity: method is a constructive engineering proposal with no equations or self-referential derivations

full rationale

The paper presents FCGraft as an algorithmic framework that maintains a library of function-level code skeletons and KV caches, then applies stitching and patching for new policies. No mathematical derivations, equations, fitted parameters, or uniqueness theorems appear in the provided text. The performance claims (18.31% higher success, 2.3x faster synthesis) are presented as empirical outcomes of the method rather than predictions derived from prior results by the same authors. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The central description is self-contained as an engineering contribution whose validity rests on external evaluation, not on internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5761 in / 1117 out tokens · 25687 ms · 2026-06-27T05:25:04.023424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 28 canonical work pages · 11 internal anchors

[1]

Efficient Training of Language Models to Fill in the Middle

Bavarian, M., Jun, H., Tezak, N., Schulman, J., McLeavey, C., Tworek, J., and Chen, M. Efficient training of language models to fill in the middle.arXiv preprint arXiv:2207.14255,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

J., Chen, C.-T., Cheng, J.-H., and Huang, H.-H

Chan, B. J., Chen, C.-T., Cheng, J.-H., and Huang, H.-H. Don’t do rag: When cache-augmented generation is all you need for knowledge tasks. InCompanion Proceed- ings of the ACM on Web Conference 2025, pp. 893–897,

2025
[3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870,

Choi, W., Park, J., Ahn, S., Lee, D., and Woo, H. Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870,

work page arXiv
[5]

Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504,

Chon, H., Lee, S., Yeo, J., and Lee, D. Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504,

work page arXiv
[6]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Efim: Efficient serving of llms for infill- ing tasks with improved kv cache reuse.arXiv preprint arXiv:2505.21889,

Guo, T., Dong, H., Leng, Y ., Liu, F., Lin, C., Xiao, N., and Zhang, X. Efim: Efficient serving of llms for infill- ing tasks with improved kv cache reuse.arXiv preprint arXiv:2505.21889,

work page arXiv
[8]

Contextual Markov Decision Processes

Hallak, A., Di Castro, D., and Mannor, S. Con- textual markov decision processes.arXiv preprint arXiv:1502.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Let the code llm edit itself when you edit the code.arXiv preprint arXiv:2407.03157,

He, Z., Zhang, J., Luo, S., Xu, J., Zhang, Z., and He, D. Let the code llm edit itself when you edit the code.arXiv preprint arXiv:2407.03157,

work page arXiv
[10]

Hu et al

Hu, J., Huang, W., Wang, W., Wang, H., Hu, T., Zhang, Q., Feng, H., Chen, X., Shan, Y ., and Xie, T. Epic: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332,

work page arXiv
[11]

Huang, S., Jiang, Z., Dong, H., Qiao, Y ., Gao, P., and Li, H. Instruct2act: Mapping multi-modality instructions to 10 Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents robotic actions with large language model.arXiv preprint arXiv:2305.11176, 2023a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models ...

work page arXiv
[12]

Ragcache: Efficient knowledge caching for retrieval- augmented generation.arXiv preprint arXiv:2404.12457,

Jin, C., Zhang, Z., Jiang, X., Liu, F., Liu, X., Liu, X., and Jin, X. Ragcache: Efficient knowledge caching for retrieval- augmented generation.arXiv preprint arXiv:2404.12457,

work page arXiv
[13]

J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y

Kagaya, T., Yuan, T. J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y . Rap: Retrieval- augmented planning with contextual memory for mul- timodal llm agents.arXiv preprint arXiv:2402.03610,

work page arXiv
[14]

Think before you act: Decision transformers with working memory.arXiv preprint arXiv:2305.16338,

Kang, J., Laroche, R., Yuan, X., Trischler, A., Liu, X., and Fu, J. Think before you act: Decision transformers with working memory.arXiv preprint arXiv:2305.16338,

work page arXiv
[15]

Structured chain-of- thought prompting for code generation.ACM Transac- tions on Software Engineering and Methodology, 34(2): 1–23, 2025a

Li, J., Li, G., Li, Y ., and Jin, Z. Structured chain-of- thought prompting for code generation.ACM Transac- tions on Software Engineering and Methodology, 34(2): 1–23, 2025a. Li, Y ., Wang, L., Piao, S., Yang, B.-H., Li, Z., Zeng, W., and Tsung, F. Mccoder: streamlining motion control with llm-assisted code generation and rigorous verification. arXiv pre...

work page arXiv
[16]

Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy.arXiv preprint arXiv:2502.19902, 2025b

Li, Z., Xie, Y ., Shao, R., Chen, G., Jiang, D., and Nie, L. Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy.arXiv preprint arXiv:2502.19902, 2025b. Liang, C., Feng, Z., Liu, Z., Jiang, W., Xu, J., Chen, Y ., and Wang, Y . Textualized agent-style reasoning for complex tasks by multiple round llm generation.arXiv prep...

work page arXiv
[17]

Mell: Memory-efficient large language model serving via multi- gpu kv cache management

Liu, Q., Hong, Z., Li, P., Chen, F., and Guo, S. Mell: Memory-efficient large language model serving via multi- gpu kv cache management. InIEEE INFOCOM 2025- IEEE Conference on Computer Communications, pp. 1–

2025
[18]

Turborag: Accelerating retrieval-augmented generation with pre- computed kv caches for chunked text.arXiv preprint arXiv:2410.07590,

Lu, S., Wang, H., Rong, Y ., Chen, Z., and Tang, Y . Turborag: Accelerating retrieval-augmented generation with pre- computed kv caches for chunked text.arXiv preprint arXiv:2410.07590,

work page arXiv
[19]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. Codegen: An open large language model for code with multi-turn program synthe- sis.arXiv preprint arXiv:2203.13474,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Teach: Task-driven embodied agents that chat

Padmakumar, A., Thomason, J., Shrivastava, A., Lange, P., Narayan-Chen, A., Gella, S., Piramuthu, R., Tur, G., and Hakkani-Tur, D. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 2017–2025,

2017
[21]

Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics

Rana, K., Xu, M., Tidd, B., Milford, M., and S¨underhauf, N. Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics. In Conference on Robot Learning, pp. 2095–2104. PMLR,

2095
[22]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

J., and Fragkiadaki, K

Sarch, G., Wu, Y ., Tarr, M. J., and Fragkiadaki, K. Open-ended instructable embodied agents with memory- augmented large language models.arXiv preprint arXiv:2310.15127,

work page arXiv
[24]

J., and Fragki- adaki, K

Sarch, G., Somani, S., Kapoor, R., Tarr, M. J., and Fragki- adaki, K. Helper-x: A unified instructable embod- ied agent to tackle four interactive vision-language do- mains with memory-augmented language models.arXiv preprint arXiv:2404.19065,

work page arXiv
[25]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- prompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y ., and Alvarez, J. M. Omnidrive: A holis- tic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533,

work page arXiv
[27]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[28]

A-MEM: Agentic Memory for LLM Agents

Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y . A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Mpic: Position-independent multimodal context caching system for efficient mllm serving.arXiv preprint arXiv:2502.01960,

Zhao, S., Hu, J., Huang, R., Zheng, J., and Chen, G. Mpic: Position-independent multimodal context caching system for efficient mllm serving.arXiv preprint arXiv:2502.01960,

work page arXiv
[30]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y ., Li, Y ., Gao, H., Ma, S., et al. Deepseek-coder- v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Move a plunger to the cabinet

This diversity makes ALFRED a particularly suitable benchmark for evaluating agents on hierarchical reasoning, multi-step planning, and generalization across varied contexts. Environment.To better emulate open-domain deployment, we design three long-horizon evaluation scenarios on top of ALFRED. (1) Inopen-composition, tasks are organized into a curriculu...

2023
[33]

Water the plant

spans 109 unique scenes across all 30 kitchens and most of the 30 living rooms, bedrooms, and bathrooms in AI2-THOR, comprising 3215 successful human-human gameplay sessions with rich conversational data (˜45k utterances averaging 13.67 per session) and long action trajectories (averaging 131.8 Follower actions per session). The dataset covers 12 task fam...

2023
[34]

reasoning

introduces position-independent caching with the LegoLink algorithm to enable modular KV reuse beyond prefix matching, significantly improving LLM serving efficiency without sacrificing accuracy. Unless otherwise noted, all baselines use the same CodeLLM configuration (max new tokens = 2048, temperature = 0.0, i.e., 21 Functional Cache Grafting: Robust an...

2048
[35]

Table 12.Hyperparameters (decoding and framework-level). Model generation hyperparameters max new tokens 2048 temperature 0.0 (greedy) top-k, top-pN/A (greedy; not used) Framework-level hyperparameters Eviction threshold (perplexity-based)τ=15.0 Locality weights (α: temporal,β: spatial,γ: semantic)α= 0.4,β= 0.3,γ= 0.3 Execution trace length 20 Temporal de...

2048
[36]

We evaluate the default setting (CoT-guided cache-patching) against two ablated variants: without CoT and with expert guidance replaced

It compares FCGRAFT across different CodeLLM sizes and cache-patching configurations. We evaluate the default setting (CoT-guided cache-patching) against two ablated variants: without CoT and with expert guidance replaced. As model size increases from 3B to 14B, FCGRAFT consistently achieves higherSR(26.57%, 44.50%, 53.75%) andGC(38.36%, 58.14%, 66.54%), ...

work page arXiv

[1] [1]

Efficient Training of Language Models to Fill in the Middle

Bavarian, M., Jun, H., Tezak, N., Schulman, J., McLeavey, C., Tworek, J., and Chen, M. Efficient training of language models to fill in the middle.arXiv preprint arXiv:2207.14255,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

J., Chen, C.-T., Cheng, J.-H., and Huang, H.-H

Chan, B. J., Chen, C.-T., Cheng, J.-H., and Huang, H.-H. Don’t do rag: When cache-augmented generation is all you need for knowledge tasks. InCompanion Proceed- ings of the ACM on Web Conference 2025, pp. 893–897,

2025

[3] [3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870,

Choi, W., Park, J., Ahn, S., Lee, D., and Woo, H. Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870,

work page arXiv

[5] [5]

Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504,

Chon, H., Lee, S., Yeo, J., and Lee, D. Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504,

work page arXiv

[6] [6]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Efim: Efficient serving of llms for infill- ing tasks with improved kv cache reuse.arXiv preprint arXiv:2505.21889,

Guo, T., Dong, H., Leng, Y ., Liu, F., Lin, C., Xiao, N., and Zhang, X. Efim: Efficient serving of llms for infill- ing tasks with improved kv cache reuse.arXiv preprint arXiv:2505.21889,

work page arXiv

[8] [8]

Contextual Markov Decision Processes

Hallak, A., Di Castro, D., and Mannor, S. Con- textual markov decision processes.arXiv preprint arXiv:1502.02259,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Let the code llm edit itself when you edit the code.arXiv preprint arXiv:2407.03157,

He, Z., Zhang, J., Luo, S., Xu, J., Zhang, Z., and He, D. Let the code llm edit itself when you edit the code.arXiv preprint arXiv:2407.03157,

work page arXiv

[10] [10]

Hu et al

Hu, J., Huang, W., Wang, W., Wang, H., Hu, T., Zhang, Q., Feng, H., Chen, X., Shan, Y ., and Xie, T. Epic: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332,

work page arXiv

[11] [11]

Huang, S., Jiang, Z., Dong, H., Qiao, Y ., Gao, P., and Li, H. Instruct2act: Mapping multi-modality instructions to 10 Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents robotic actions with large language model.arXiv preprint arXiv:2305.11176, 2023a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models ...

work page arXiv

[12] [12]

Ragcache: Efficient knowledge caching for retrieval- augmented generation.arXiv preprint arXiv:2404.12457,

Jin, C., Zhang, Z., Jiang, X., Liu, F., Liu, X., Liu, X., and Jin, X. Ragcache: Efficient knowledge caching for retrieval- augmented generation.arXiv preprint arXiv:2404.12457,

work page arXiv

[13] [13]

J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y

Kagaya, T., Yuan, T. J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y . Rap: Retrieval- augmented planning with contextual memory for mul- timodal llm agents.arXiv preprint arXiv:2402.03610,

work page arXiv

[14] [14]

Think before you act: Decision transformers with working memory.arXiv preprint arXiv:2305.16338,

Kang, J., Laroche, R., Yuan, X., Trischler, A., Liu, X., and Fu, J. Think before you act: Decision transformers with working memory.arXiv preprint arXiv:2305.16338,

work page arXiv

[15] [15]

Structured chain-of- thought prompting for code generation.ACM Transac- tions on Software Engineering and Methodology, 34(2): 1–23, 2025a

Li, J., Li, G., Li, Y ., and Jin, Z. Structured chain-of- thought prompting for code generation.ACM Transac- tions on Software Engineering and Methodology, 34(2): 1–23, 2025a. Li, Y ., Wang, L., Piao, S., Yang, B.-H., Li, Z., Zeng, W., and Tsung, F. Mccoder: streamlining motion control with llm-assisted code generation and rigorous verification. arXiv pre...

work page arXiv

[16] [16]

Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy.arXiv preprint arXiv:2502.19902, 2025b

Li, Z., Xie, Y ., Shao, R., Chen, G., Jiang, D., and Nie, L. Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy.arXiv preprint arXiv:2502.19902, 2025b. Liang, C., Feng, Z., Liu, Z., Jiang, W., Xu, J., Chen, Y ., and Wang, Y . Textualized agent-style reasoning for complex tasks by multiple round llm generation.arXiv prep...

work page arXiv

[17] [17]

Mell: Memory-efficient large language model serving via multi- gpu kv cache management

Liu, Q., Hong, Z., Li, P., Chen, F., and Guo, S. Mell: Memory-efficient large language model serving via multi- gpu kv cache management. InIEEE INFOCOM 2025- IEEE Conference on Computer Communications, pp. 1–

2025

[18] [18]

Turborag: Accelerating retrieval-augmented generation with pre- computed kv caches for chunked text.arXiv preprint arXiv:2410.07590,

Lu, S., Wang, H., Rong, Y ., Chen, Z., and Tang, Y . Turborag: Accelerating retrieval-augmented generation with pre- computed kv caches for chunked text.arXiv preprint arXiv:2410.07590,

work page arXiv

[19] [19]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. Codegen: An open large language model for code with multi-turn program synthe- sis.arXiv preprint arXiv:2203.13474,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Teach: Task-driven embodied agents that chat

Padmakumar, A., Thomason, J., Shrivastava, A., Lange, P., Narayan-Chen, A., Gella, S., Piramuthu, R., Tur, G., and Hakkani-Tur, D. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 2017–2025,

2017

[21] [21]

Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics

Rana, K., Xu, M., Tidd, B., Milford, M., and S¨underhauf, N. Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics. In Conference on Robot Learning, pp. 2095–2104. PMLR,

2095

[22] [22]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

J., and Fragkiadaki, K

Sarch, G., Wu, Y ., Tarr, M. J., and Fragkiadaki, K. Open-ended instructable embodied agents with memory- augmented large language models.arXiv preprint arXiv:2310.15127,

work page arXiv

[24] [24]

J., and Fragki- adaki, K

Sarch, G., Somani, S., Kapoor, R., Tarr, M. J., and Fragki- adaki, K. Helper-x: A unified instructable embod- ied agent to tackle four interactive vision-language do- mains with memory-augmented language models.arXiv preprint arXiv:2404.19065,

work page arXiv

[25] [25]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- prompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y ., and Alvarez, J. M. Omnidrive: A holis- tic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533,

work page arXiv

[27] [27]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[28] [28]

A-MEM: Agentic Memory for LLM Agents

Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y . A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Mpic: Position-independent multimodal context caching system for efficient mllm serving.arXiv preprint arXiv:2502.01960,

Zhao, S., Hu, J., Huang, R., Zheng, J., and Chen, G. Mpic: Position-independent multimodal context caching system for efficient mllm serving.arXiv preprint arXiv:2502.01960,

work page arXiv

[30] [30]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y ., Li, Y ., Gao, H., Ma, S., et al. Deepseek-coder- v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Move a plunger to the cabinet

This diversity makes ALFRED a particularly suitable benchmark for evaluating agents on hierarchical reasoning, multi-step planning, and generalization across varied contexts. Environment.To better emulate open-domain deployment, we design three long-horizon evaluation scenarios on top of ALFRED. (1) Inopen-composition, tasks are organized into a curriculu...

2023

[33] [33]

Water the plant

spans 109 unique scenes across all 30 kitchens and most of the 30 living rooms, bedrooms, and bathrooms in AI2-THOR, comprising 3215 successful human-human gameplay sessions with rich conversational data (˜45k utterances averaging 13.67 per session) and long action trajectories (averaging 131.8 Follower actions per session). The dataset covers 12 task fam...

2023

[34] [34]

reasoning

introduces position-independent caching with the LegoLink algorithm to enable modular KV reuse beyond prefix matching, significantly improving LLM serving efficiency without sacrificing accuracy. Unless otherwise noted, all baselines use the same CodeLLM configuration (max new tokens = 2048, temperature = 0.0, i.e., 21 Functional Cache Grafting: Robust an...

2048

[35] [35]

Table 12.Hyperparameters (decoding and framework-level). Model generation hyperparameters max new tokens 2048 temperature 0.0 (greedy) top-k, top-pN/A (greedy; not used) Framework-level hyperparameters Eviction threshold (perplexity-based)τ=15.0 Locality weights (α: temporal,β: spatial,γ: semantic)α= 0.4,β= 0.3,γ= 0.3 Execution trace length 20 Temporal de...

2048

[36] [36]

We evaluate the default setting (CoT-guided cache-patching) against two ablated variants: without CoT and with expert guidance replaced

It compares FCGRAFT across different CodeLLM sizes and cache-patching configurations. We evaluate the default setting (CoT-guided cache-patching) against two ablated variants: without CoT and with expert guidance replaced. As model size increases from 3B to 14B, FCGRAFT consistently achieves higherSR(26.57%, 44.50%, 53.75%) andGC(38.36%, 58.14%, 66.54%), ...

work page arXiv