Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Bojie Li

arxiv: 2606.17107 · v1 · pith:GTZMOAMTnew · submitted 2026-06-14 · 💻 cs.LG · cs.AI

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Bojie Li This is my paper

Pith reviewed 2026-06-27 03:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cacheprefillprefix cachingeditable cachecomposable cachechain-of-thoughtattention mechanismslanguage model inference

0 comments

The pith

KV caches record field-conditioned conclusions at prefill, so the original field's vectors drive under 1% of later decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that during prefill a model writes the implications of each input field into later positions of the KV cache, turning those later entries into the primary carriers of the decision. The field's own key-value pair contributes negligibly once the notes are written. Treating the cache as a notebook of memoized conclusions makes two operations possible: an erratum can overwrite the relevant note to correct the output, and pre-written notes can be repositioned and spliced into new contexts. Both operations run at far lower cost than recomputing the full sequence while producing nearly identical logits.

Core claim

At prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, the KV cache is therefore editable by amending the notes and composable by RoPE-repositioning and splicing the notes into new contexts.

What carries the argument

Downstream KV-cache entries that hold the conclusions memoized during prefill; these entries, not the original field's vectors, determine the model's output.

If this is right

With chain-of-thought, editing the field alone recovers the original decision at accuracy 1.00 while using roughly 1 percent of the original compute.
Precompiled notes can be spliced into arbitrary contexts at O(L) cost and produce logits whose cosine similarity to full recompute lies between 0.90 and 0.999 across twelve models.
A single edit-plus-compose agent stays decision-identical to recompute while reducing time-to-first-token by up to 14.9 times.
The method preserves 98.5 percent prefix-cache hit rate in production vLLM workloads and cuts p90 time-to-first-token by 53-398 times.
The same edit-and-compose behavior holds under quantization, Mixture-of-Experts routing, multimodal inputs, and several attention variants after small adapters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems could maintain libraries of reusable prefilled modules that are spliced on demand rather than recomputed.
Error correction and fact updating could be performed by appending short errata instead of regenerating entire contexts.
The same notebook view may apply to other per-token state caches once the causal isolation experiments are repeated for those architectures.

Load-bearing premise

The causal interventions across four model families correctly isolate that the downstream cache entries carry the decision and that RoPE-repositioned splices remain indistinguishable from full recompute.

What would settle it

A controlled edit of only the downstream notes that leaves the model output unchanged, or a RoPE-repositioned splice whose logits show cosine similarity below 0.90 to the corresponding full recompute.

Figures

Figures reproduced from arXiv: 2606.17107 by Bojie Li.

**Figure 1.** Figure 1: Models take notes at prefill. (a) At prefill the model memoizes the field-conditioned conclusion onto downstream aggregator tokens (orange); at decode the decision reads those notes (blue). (b) Consequently, surgically editing the field’s own KV is ignored (without a reasoning chain), but the decision is recovered cheaply: recompute the affected downstream suffix, or—cheaper and robust— append a salient e… view at source ↗

**Figure 2.** Figure 2: Editing and composing, previewed (Qwen3-8B unless noted; detail in Sections 3, 4 and 6). (a) KV editing landscape: naive edits (stale, field-only without CoT, CacheBlend) fail, while the append-only erratum/field+erratum reach full-reprefill correctness cheaply and robustly (hoisting also works but needs prompt surgery). (b) Recompute the affected notes: recovery climbs as more of the post-field affected … view at source ↗

**Figure 3.** Figure 3: Four causal probes for memoized inference. (a) Refreshing the field’s own KV recovers ≈ 0 of the decision; recomputing the downstream recovers it fully. (b) Recovery is suffixconcentrated, accruing only as many post-field tokens are patched. (c) The decision reads almost entirely from downstream notes, not the field token. (d) Once the value is present, override wording is redundant and “re-evaluate” phr… view at source ↗

**Figure 4.** Figure 4: Two cache operations. (a) Editable, two ways to amend the stale notes: field+selective recomputes the field plus the top-K highest-effect downstream notes (O(K); recomputed cells outlined in red, the rest reused stale), while the erratum appends one salient correction (O(1), the whole prefix is reused). (b) Composable: precompute a skill’s KV in isolation, RoPE-reposition it to the target positions, and s… view at source ↗

**Figure 5.** Figure 5: Editing is model-dependent (the editing landscape and the chain-of-thought split are previewed in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Composing the cache. (a) Pasting a precompiled skill is O(L) vs. full reprefill’s O(L 2 ); TTFT speedup reaches 13.9× at 32k tokens. (b) The transplanted skill matches full recompute in next-token logits (cosine 0.90–0.999) across the model family. ily (twelve models; full roster in Section A)—Qwen3-1.7B through 32B (including FP8 and the 30B-A3B Mixture-of-Experts), Gemma-2/3, Mistral-7B, Llama-3.1-8B and… view at source ↗

**Figure 7.** Figure 7: Edit and compose can be combined. (a) Editing inside a transplant: a field edited inside a transplanted skill reproduces the editing mechanism, and the composed cache matches the recomputed cache for every method (points on the diagonal). (b) A unified edit+compose agent over thirteen models (10 domains × 10 instances, 300 decisions; 120 on the two Gemma models): unified-vs-full agreement (bars) and cumula… view at source ↗

**Figure 8.** Figure 8: E1 — placement and the pre-digestion cost. Decision accuracy vs. integration depth (n_facts) for memory read early (solid) vs. late (dashed), across all evaluated models, under (a) direct answering and (b) chain-of-thought. Direct answering is at chance for the reasoning-native Qwen3 models (so CoT is the operative regime); under CoT, late placement forgoes prefill-time pre-digestion and carries a small ac… view at source ↗

**Figure 9.** Figure 9: E2 — memory transplant is faithful, to 70B. (a) Decision agreement with full recompute under late placement; a single seam-repair token closes the start-of-chunk gap across ten models. (b) The re-rotation is necessary—the naive no-rotate control collapses, while rotated+1 seam token matches full recompute. stale in_place selective@4 selective@16 erratum recompile_chunk full_recompute 0.0 0.5 1.0 decision c… view at source ↗

**Figure 10.** Figure 10: E3 — memory is editable mid-session. Reusing stale memory recovers a toggled fact essentially never; every real edit recovers it. Consistent with Section 4, the near-free in-place edit suffices under chain-of-thought and strengthens with scale, with the append-only erratum as the robust fallback. cosine 0.997, and late placement beats early (0.93 vs. 0.83) exactly as the mechanism predicts (the decode rea… view at source ↗

**Figure 11.** Figure 11: LoCoMo — external validity on real conversations. Over all 1,540 answerable questions per model, transplanting the multi-session dialogue memory is statistically equivalent to full recompute in QA accuracy (TOST) on the Qwen3 models and within 2.7 points on Llama-3.1-8B. Editing inside transplanted memory reproduces the mechanism. Editing a field inside a transplanted memory chunk in direct mode (Llama-3… view at source ↗

**Figure 12.** Figure 12: End-to-end agent. (a) Per-decision median TTFT for the proposed compose+edit agent vs. oracle, front, and end-placement baselines. (b) Cumulative TTFT speedup of 2.3–4.3× over reprefill-every-turn, growing with model size, at faithful next-token decisions. End-to-end agent ( [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Applicability to multimodal and new attention mechanisms. (a) The operations work on any per-token attention KV representation; we map each attention variant as free / adapter / config fix / partial / open / out-of-scope. (b) Image-KV transplant is near-lossless across vision-language models—images are position-portable too. • Adapter (implemented, validated)—representation changes (diagrammed in [PITH_F… view at source ↗

**Figure 14.** Figure 14: Systems payoff. (a) Online vLLM serving (V1 engine, CUDA graphs, continuous batching, APC, Poisson arrivals): the append-only erratum keeps the prefix cache-aligned (98.5% vs. 1% APC hit-rate), so its throughput advantage grows with offered load—up to 14.5× at saturation— while p90 TTFT is 53–398× lower. (b) Reusing a cached image (skipping the vision tower and image-token prefill) accelerates time-to-fir… view at source ↗

**Figure 15.** Figure 15: A component-level circuit for memoized inference (deep dives Llama-3.1-8B; head, write, and scrubbing panels show all four families). (a) Cumulative decision recovery from jointly patching the top-k named read heads (concentrated, ∼0.78) vs. write heads (distributed). (b) A leave-scenario-out difference-of-means conclusion direction causally transfers the decision far above a random 1-D direction. (c) At … view at source ↗

**Figure 17.** Figure 17: field+selective@K across models [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Erratum recovery under reasoning across attention, sliding-window, hybrid, and pure [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 20.** Figure 20: MLA adapter fidelity (a) and transplant TTFT speedup across models (b). perception reasoning agentic 0.0 0.2 0.4 0.6 0.8 1.0 VQA accuracy Qwen2.5-VL-3B full transplant perception reasoning agentic Qwen2.5-VL-7B perception reasoning agentic Qwen3-VL-8B perception reasoning agentic Qwen2.5-VL-32B Image-KV transplant by task category [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Image-KV transplant by task category (perception / reasoning / agentic), full re-encode [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: The attention-variant adapters. (a) MLA caches a position-free latent ct plus a small decoupled-RoPE sub-vector k_pe; repositioning re-rotates only k_pe and copies the latent as-is. (b) M-RoPE factors the rotary channels into temporal/height/width axes—sectioned (Qwen2.5-VL) or interleaved (Qwen3-VL); moving an image within a trajectory re-rotates only the temporal channels, since the spatial axes are int… view at source ↗

**Figure 23.** Figure 23: No compounding error over a long trajectory. A gated field toggles every turn for 28 turns; one evolving leave-stale+erratum cache vs. full reprefill of the identical text. (a) Per-turn decision agreement stays high with boundary noise (no downward trend). (b) The decision-logit cosine stays flat at 0.99+—no drift with trajectory length. recompute accuracy 0.85 (n=80, balanced)—splitting drops agreement w… view at source ↗

**Figure 24.** Figure 24: E4 — granularity is a free knob (Section 7). Splitting memory into S independentlyprecompiled blocks makes a localized edit S× cheaper and stays decision-lossless to S=16; only genuinely cross-referential facts must be kept in one block (cross-referential test above). as a Poisson process at offered rates {2,4,8,16} req/s and unthrottled (saturation). We timestamp first-token and completion per request … view at source ↗

read the original abstract

Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that KV caches memoize field-conditioned conclusions at prefill, enabling direct edits and RoPE-based composition with big latency wins, but the causal isolation still needs more controls.

read the letter

The main thing to know is that this work treats the KV cache as a notebook of already-computed conclusions rather than raw token representations. At prefill the model writes the field's effect into downstream positions, so overwriting the original field's KV changes the output by under 1% while editing a downstream note recovers the decision. They show this holds across four model families, and that RoPE-repositioned splices match full recompute on logit cosine (0.90-0.999) at O(L) cost instead of O(L^2). The vLLM integration keeps prefix-cache hit rates high while cutting p90 TTFT by 53-398x and agent latency by 14.9x.

What the paper does well is demonstrate practical compatibility with existing prefix-caching systems and report cross-scale, cross-quantization, and multimodal checks. The edit-plus-compose agent result is a concrete efficiency claim that follows directly from the notebook view.

The soft spots are mostly around experimental transparency. The abstract gives recovery rates and cosine numbers but no error bars, exclusion criteria, or layer-wise activation checks. The stress-test point about possible side effects on attention patterns or intermediate layers is fair; logit cosine alone does not rule out that the interventions alter other parts of the forward pass. If the full paper has those controls and shows the effect is isolated to the downstream notes, the claim strengthens. Without them the causal story remains suggestive rather than airtight.

This is for inference engineers and researchers focused on long-context or agentic LLM serving who already use prefix caching. It deserves a serious referee because the latency numbers are large enough to matter if the mechanism holds, and the work engages directly with deployed systems rather than staying purely theoretical.

Referee Report

2 major / 2 minor

Summary. The paper claims that during prefill, transformer models write field-conditioned conclusions into downstream KV cache positions such that the original field's own key/value vectors drive under 1% of the final decision. This 'notebook' view enables two capabilities: (1) editability, where overwriting a field's KV and using chain-of-thought recovers the decision at 1.00 (8B scale, ~1% compute) while non-CoT ignores the edit; (2) composability, where RoPE-repositioned splices of precompiled notes are indistinguishable from full recompute (logit cosine 0.90-0.999 across twelve models) at O(L) rather than O(L^2) cost. The approach is validated across scale, quantization, MoE, multimodal caches, attention variants, and an online vLLM benchmark showing 53-398x p90 TTFT reduction at 98.5% prefix-cache hit rate.

Significance. If the causal isolation holds, the result offers a practical route to editable and composable KV caches that composes with existing prefix caching, delivering large latency gains while preserving decision fidelity. The cross-family empirical validation and production benchmark are concrete strengths.

major comments (2)

[Abstract / causal-experiments section] Abstract and causal-experiments section: the claim that downstream notes carry the decision (original KV <1%) and that RoPE-repositioned splices remain equivalent to recompute rests on interventions across four model families, yet no error bars, exclusion criteria, or explicit controls for side effects on attention patterns or intermediate activations are reported; logit cosine alone may not rule out such confounds, which is load-bearing for the notebook interpretation.
[Composability results] Composability results (logit cosine 0.90-0.999): the manuscript must demonstrate that the splicing procedure leaves attention masks, layer norms, and non-KV activations unchanged; without those controls the equivalence to full recompute cannot be taken as evidence that only the memoized notes matter.

minor comments (2)

[Abstract] Abstract states recovery '1.00 at 8B' and 'under 1%' without defining the exact decision metric or baseline comparison used for the percentage.
[vLLM benchmark] The vLLM benchmark reports 98.5% hit-rate and 53-398x TTFT reduction; clarify whether these figures include the overhead of the edit/compose agent or are measured end-to-end.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that strengthening the presentation of causal evidence and composability controls will improve the manuscript and will incorporate the requested additions in revision.

read point-by-point responses

Referee: [Abstract / causal-experiments section] Abstract and causal-experiments section: the claim that downstream notes carry the decision (original KV <1%) and that RoPE-repositioned splices remain equivalent to recompute rests on interventions across four model families, yet no error bars, exclusion criteria, or explicit controls for side effects on attention patterns or intermediate activations are reported; logit cosine alone may not rule out such confounds, which is load-bearing for the notebook interpretation.

Authors: We agree that error bars, explicit exclusion criteria, and controls for side effects on attention patterns and intermediate activations would strengthen the causal claims. The current results show consistent behavior across four model families and multiple scales, which we view as mitigating model-specific confounds, but this does not substitute for the requested statistical and control analyses. In the revised manuscript we will add error bars from repeated trials where applicable, clarify model and task exclusion criteria, and include direct comparisons of attention patterns and selected intermediate activations before and after interventions. revision: yes
Referee: [Composability results] Composability results (logit cosine 0.90-0.999): the manuscript must demonstrate that the splicing procedure leaves attention masks, layer norms, and non-KV activations unchanged; without those controls the equivalence to full recompute cannot be taken as evidence that only the memoized notes matter.

Authors: The splicing procedure replaces only the KV entries for the repositioned segments while preserving overall sequence length, token positions for non-spliced content, and the original attention mask structure; layer norms are likewise untouched because they operate on the same hidden states. Non-KV activations are recomputed identically to a full forward pass. We will add explicit verification in the revision by reporting attention mask equality, layer-norm statistics, and selected non-KV activation comparisons between spliced and full-recompute runs to confirm invariance. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical causal interventions, not derivations or self-referential fits

full rationale

The paper presents no derivation chain, equations, or fitted parameters that reduce to inputs by construction. Central claims (downstream notes carrying decisions, editability, composability via RoPE splicing) are justified by reported measurements: causal experiments across four model families showing <1% contribution from original KV, logit cosine 0.90-0.999 for splices vs recompute, and latency benchmarks. No self-citation load-bearing, no ansatz smuggled via citation, no uniqueness theorems, and no renaming of known results as new derivations. The approach is self-contained against external benchmarks (multiple models, quantization, MoE, multimodal) with falsifiable empirical tests. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work is framed as an empirical discovery rather than a derivation resting on new theoretical assumptions.

pith-pipeline@v0.9.1-grok · 5830 in / 1168 out tokens · 43547 ms · 2026-06-27T03:34:26.692150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages · 9 internal anchors

[1]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024. 19 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

DeepSeek-V4 technical report, 2026.https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

DeepSeek-AI. DeepSeek-V4 technical report, 2026.https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026
[6]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Prompt cache: Modular attention reuse for low-latency inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. InProceedings of Machine Learning and Systems (MLSys), 2024

2024
[8]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024

2024
[10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022
[11]

EPIC: Efficient position-independent caching for serving large language models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. EPIC: Efficient position-independent caching for serving large language models. InProceedings of the 42nd International Con- ference on Machine Learning (ICML), 2025

2025
[12]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, et al. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023

2023
[13]

Mistral 7B

Albert Q. Jiang et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, et al. RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

work page arXiv 2024
[15]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, et al. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[17]

On the biology of a large language model.Transformer Circuits Thread, Anthropic, 2025

Jack Lindsey et al. On the biology of a large language model.Transformer Circuits Thread, Anthropic, 2025

2025
[18]

CacheSlide: Unlocking cross position-aware KV cache reuse for accel- erating LLM serving

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. CacheSlide: Unlocking cross position-aware KV cache reuse for accel- erating LLM serving. InProceedings of the 24th USENIX Conference on File and Storage Technologies (FAST), 2026. 20 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

2026
[19]

CacheGen: KV cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, et al. CacheGen: KV cache compression and streaming for fast large language model serving. InProceedings of ACM SIGCOMM, 2024

2024
[20]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, et al. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[21]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceed- ings of the Association for Computational Linguistics (ACL), 2024

2024
[22]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[23]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

2023
[24]

RWKV: Reinventing RNNs for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, et al. RWKV: Reinventing RNNs for the transformer era. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023
[25]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024
[26]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[28]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

2024
[29]

Quest: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024
[30]

Function vectors in large language models

Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024
[31]

Investigating gender bias in lan- guage models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, et al. Investigating gender bias in lan- guage models using causal mediation analysis. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[32]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023. 21 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

2023
[33]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representa- tions (ICLR), 2024

2024
[34]

KVLink: Accelerating large language models via efficient KV cache reuse

Jingbo Yang et al. KVLink: Accelerating large language models via efficient KV cache reuse. arXiv preprint arXiv:2502.16002, 2025

work page arXiv 2025
[35]

CacheBlend: Fast large language model serving for RAG with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the European Conference on Computer Systems (EuroSys), 2025

2025
[36]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[37]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

KVCache-centric memory for LLM agents, 2025

Yuan Zeng, Pengfei Zuo, Min Lyu, Xingkun Yang, Huatao Wu, Yinlong Xu, and Zhou Yu. KVCache-centric memory for LLM agents, 2025. Submitted to ICLR 2026; OpenReview, 18 September 2025

2025
[39]

H2O: Heavy- hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, et al. H2O: Heavy- hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[40]

nothing else

Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, and Guihai Chen. MPIC: Position- independent multimodal context caching system for efficient MLLM serving.arXiv preprint arXiv:2502.01960, 2025. 22 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable Table 2: Model zoo. “role” indicates the experiments a model appears in. model family /...

work page arXiv 2025

[1] [1]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024. 19 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

DeepSeek-V4 technical report, 2026.https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

DeepSeek-AI. DeepSeek-V4 technical report, 2026.https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026

[6] [6]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Prompt cache: Modular attention reuse for low-latency inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. InProceedings of Machine Learning and Systems (MLSys), 2024

2024

[8] [8]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024

2024

[10] [10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022

[11] [11]

EPIC: Efficient position-independent caching for serving large language models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. EPIC: Efficient position-independent caching for serving large language models. InProceedings of the 42nd International Con- ference on Machine Learning (ICML), 2025

2025

[12] [12]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, et al. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023

2023

[13] [13]

Mistral 7B

Albert Q. Jiang et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, et al. RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

work page arXiv 2024

[15] [15]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[16] [16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, et al. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[17] [17]

On the biology of a large language model.Transformer Circuits Thread, Anthropic, 2025

Jack Lindsey et al. On the biology of a large language model.Transformer Circuits Thread, Anthropic, 2025

2025

[18] [18]

CacheSlide: Unlocking cross position-aware KV cache reuse for accel- erating LLM serving

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. CacheSlide: Unlocking cross position-aware KV cache reuse for accel- erating LLM serving. InProceedings of the 24th USENIX Conference on File and Storage Technologies (FAST), 2026. 20 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

2026

[19] [19]

CacheGen: KV cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, et al. CacheGen: KV cache compression and streaming for fast large language model serving. InProceedings of ACM SIGCOMM, 2024

2024

[20] [20]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, et al. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[21] [21]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceed- ings of the Association for Computational Linguistics (ACL), 2024

2024

[22] [22]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[23] [23]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

2023

[24] [24]

RWKV: Reinventing RNNs for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, et al. RWKV: Reinventing RNNs for the transformer era. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023

[25] [25]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024

[26] [26]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[28] [28]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

2024

[29] [29]

Quest: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024

[30] [30]

Function vectors in large language models

Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024

[31] [31]

Investigating gender bias in lan- guage models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, et al. Investigating gender bias in lan- guage models using causal mediation analysis. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[32] [32]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023. 21 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

2023

[33] [33]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representa- tions (ICLR), 2024

2024

[34] [34]

KVLink: Accelerating large language models via efficient KV cache reuse

Jingbo Yang et al. KVLink: Accelerating large language models via efficient KV cache reuse. arXiv preprint arXiv:2502.16002, 2025

work page arXiv 2025

[35] [35]

CacheBlend: Fast large language model serving for RAG with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the European Conference on Computer Systems (EuroSys), 2025

2025

[36] [36]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023

[37] [37]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

KVCache-centric memory for LLM agents, 2025

Yuan Zeng, Pengfei Zuo, Min Lyu, Xingkun Yang, Huatao Wu, Yinlong Xu, and Zhou Yu. KVCache-centric memory for LLM agents, 2025. Submitted to ICLR 2026; OpenReview, 18 September 2025

2025

[39] [39]

H2O: Heavy- hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, et al. H2O: Heavy- hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[40] [40]

nothing else

Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, and Guihai Chen. MPIC: Position- independent multimodal context caching system for efficient MLLM serving.arXiv preprint arXiv:2502.01960, 2025. 22 Models Take Notes at Prefill: KV Cache Can Be Editable and Composable Table 2: Model zoo. “role” indicates the experiments a model appears in. model family /...

work page arXiv 2025