Residual-Mass Accounting for Partial-KV Decoding

Daisuke Miyashita; Jun Deguchi; Yasuto Hoshi

arxiv: 2604.05438 · v2 · submitted 2026-04-07 · 💻 cs.LG · cs.CL

Residual-Mass Accounting for Partial-KV Decoding

Yasuto Hoshi , Daisuke Miyashita , Jun Deguchi This is my paper

Pith reviewed 2026-05-10 20:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords partial-KV decodingresidual mass accountingefficient attentionlong-context inferenceKV cache optimizationlearned feature mapsTop-K retrieval

0 comments

The pith

A residual-mass accounting rule using learned feature maps improves partial-KV decoding over pure Top-K selection at low exact-support budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a partial-KV decoding setup where exact softmax contributions are computed only for a small retrieved token set plus sink and tail anchors, while the bulk of prefill tokens are handled via a residual estimate. It introduces an accounting method that constructs fixed-size summary states from learned positive feature maps, subtracts the feature contributions of the retrieved tokens to prevent overlap, and merges the residual estimate with the exact branch under a single normalization. The backbone model and its exact KV tensors remain unchanged. At a 1% exact-support budget the method outperforms a selection-only Top-K baseline on RULER and BABILong across 1B and 3B Llama-3.2-Instruct models at every reported context length, with the pattern largely holding in 0.5-4% budget sweeps and mixed but often favorable results on LongBench.

Core claim

In controlled partial-KV decoding, exact unnormalized softmax terms are kept for sink/tail anchors and a retrieved set while the remaining prefill tokens are summarized by a residual estimate. The accounting rule builds fixed-size summary states (S, u) from learned positive feature maps φ, subtracts the retrieved-token feature contributions to keep the exact and residual partitions disjoint, and combines the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget this yields gains over Top-K on RULER and BABILong for frozen 1B and 3B Llama backbones at all lengths; the 0.5-4% sweeps largely preserve the trend, LongBench summar

What carries the argument

Fixed-size summary states (S, u) constructed from learned positive feature maps φ, with explicit subtraction of retrieved-token feature contributions before merging the residual estimate with the exact branch under unified normalization.

If this is right

At 1% exact-support budget the residual-completion method outperforms selection-only Top-K on RULER and BABILong for frozen 1B and 3B Llama-3.2-Instruct models at every reported context length.
The performance trend largely persists across 0.5-4% exact-support budget sweeps.
On LongBench the approach is mostly favorable for summarization tasks and mixed for multi-document QA.
Attention-output diagnostics confirm that retrieved-token subtraction is the partition-consistent accounting rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving the fidelity of the φ approximation would directly address the main remaining error source identified in the diagnostics.
The unchanged backbone and exact KV tensors allow the accounting rule to be layered on top of any retrieval selector beyond exhaustive Top-K.
Because the method preserves the original model weights, it can be inserted into existing long-context inference stacks without retraining the language model itself.

Load-bearing premise

The learned positive feature maps φ accurately approximate the unretrieved residual mass after subtraction of retrieved-token contributions.

What would settle it

Recompute the full attention outputs using exact residual mass instead of the learned φ estimate and check whether the accuracy advantage over Top-K disappears.

Figures

Figures reproduced from arXiv: 2604.05438 by Daisuke Miyashita, Jun Deguchi, Yasuto Hoshi.

**Figure 1.** Figure 1: Kernel approximation diagnostic (log–log scatter) for four decoding variants. Each point is a sampled query–key pair; the x-axis is the teacher kernel κi = exp(si) and the y-axis is the kernel value used by each variant. Instantiated with LLAMA-3.2-1B-INSTRUCT at layer 0 and query head 7. The diagonal (y = x) indicates perfect agreement. (a) All-exact lies on the diagonal. (b) Sink/tail+Top-K matches the t… view at source ↗

**Figure 2.** Figure 2: Budget-matched completion gain vs. mid-normalized attention entropy. Gain is esel − ehyb measured by mean relative ℓ1 error of the attention output under a fixed token-equivalent KVread budget (Appendix H.3). Gains concentrate in high-entropy (diffuse) heads where Top-K misses substantial mid mass. gets, but do not directly translate to wall-clock latency: decode time depends on how KV load time compares … view at source ↗

**Figure 3.** Figure 3: Trade-off map of Top-K+ϕ vs. Top-K-only in offloaded decode. Heatmaps show speedup(ξ, γ) = ttopk(ξ, k = γ · kϕ)/ttopkϕ(ξ, k = kϕ) over the I/O slowdown factor ξ (x-axis) and the quality-matched retrieval ratio γ = k ⋆ topk/kϕ (y-axis), with kϕ fixed to 1% of the prefill length. For reference, our Llama RULER 16k results indicate regimes with γ ≈ 1.8. Top-K-only timings at intermediate k are obtained by int… view at source ↗

**Figure 4.** Figure 4: M-entropy predicts where completion helps. (a) Head×layer map of M-normalized attention entropy Hmid (mean over recent queries; shown in %). (b) Head-wise scatter of Hmid vs decode ϕ completion mass share ρZ (each point is a head in a layer; color indicates layer) [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Mid-region mass curves CM(K) for representative heads. Each curve plots the mean fraction of attention mass within the mid region M captured by the Top-K keys selected from M (averaged over the recent 10 queries). Slower growth implies substantial residual mid mass beyond Top-K, motivating completion. M.3. Budget-matched errors by mid-region entropy quartile We test whether mid-region entropy predicts when… view at source ↗

**Figure 6.** Figure 6: Budget-matched errors by mid-entropy quartile. We group (layer, head) pairs into quartiles by mid-region entropy Hmid and sweep the token-equivalent KV-read budget n. Panels (a)–(d) correspond to increasing Hmid quartiles (0–3). For both methods, error decreases with larger budgets. Completion helps primarily in high-entropy quartiles: when Hmid is low, Top-K largely determines accuracy and Hybrid matches … view at source ↗

**Figure 7.** Figure 7: Fixed-K breakdown on Llama-3.2-1B-Instruct (same-K, ϕ read cost ignored). We plot head-wise scatter against mid-region entropy Hmid: (a) selection-only Top-K error, (b) Hybrid error with the same retrieved set, and (c) the resulting gain. Both errors increase with Hmid, reflecting that retrieval becomes harder for diffuse mid attention. However, Hybrid consistently achieves lower error in the high-entropy … view at source ↗

**Figure 8.** Figure 8: Fixed-K breakdown on Qwen3-1.7B (same-K, ϕ read cost ignored). We plot head-wise scatter against mid-region entropy Hmid: (a) selection-only Top-K error, (b) Hybrid error with the same retrieved set, and (c) the resulting gain. Qwen3 exhibits smaller Top-K-only errors even in high-entropy heads, so completion has limited headroom; mismatched ϕ can therefore cause a larger fraction of heads with negative ga… view at source ↗

read the original abstract

We study a controlled partial-KV decoding setting in which exact unnormalized softmax contributions are computed for sink/tail anchors and a retrieved token set, while the remaining prefill tokens are represented by a residual estimate. We focus on the accounting rule after the query-dependent exact support has been selected, and use exhaustive Top-K only as an oracle selector, not as a deployable retrieval system. The proposed rule leaves the backbone language model and the exact-branch KV tensors unchanged. It builds fixed-size summary states $(S,u)$ from learned positive feature maps $\phi$, subtracts retrieved-token feature contributions to keep the exact and residual sets non-overlapping, and merges the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget, our residual-completion method improves over the selection-only Top-K baseline on RULER and BABILong across frozen 1B and 3B Llama-3.2-Instruct backbones at all reported context lengths. In the 0.5-4% exact-support budget sweeps, this trend largely persists. On LongBench, summarization results are mostly favorable, while multi-document QA is mixed. Attention-output diagnostics support retrieved-token subtraction as the partition-consistent accounting rule, while indicating that the main remaining error is imperfect learned-$\phi$ approximation of the unretrieved residual mass.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is a subtraction-based residual estimator using learned positive feature maps to handle unretrieved tokens in partial KV decoding, but the gains over Top-K may trace more to the learned component than the accounting rule.

read the letter

The main takeaway is that this work gives a concrete accounting rule for the mass of tokens left out of the exact KV set. It builds a fixed-size summary from learned positive feature maps, subtracts the retrieved tokens' contributions to avoid double-counting, and folds the residual estimate into the exact branch with one normalization step. The backbone stays frozen and the exact KV tensors are left alone.

Referee Report

3 major / 2 minor

Summary. The paper proposes a residual-mass accounting rule for partial-KV decoding: after selecting an exact-support set via oracle Top-K, fixed-size summary states (S, u) are built from learned positive feature maps φ; retrieved-token contributions are subtracted to enforce partition consistency; the resulting residual numerator/denominator estimate is merged with the exact branch under a single softmax normalization. Experiments report consistent gains over the selection-only Top-K baseline on RULER and BABILong (and mixed results on LongBench) for frozen 1B/3B Llama-3.2 backbones at 1% exact-support budgets across context lengths, with attention diagnostics supporting the subtraction step.

Significance. If the residual estimate proves reliable, the approach could enable more accurate low-budget partial decoding than pure selection methods while leaving the backbone and exact KV tensors unchanged. The partition-consistent accounting and diagnostic support for subtraction are positive elements; however, the central gains rest on the learned φ component whose training, validation, and error characterization are not detailed.

major comments (3)

[Abstract] Abstract: the reported gains at 1% exact-support budget are attributed to the residual-completion step after subtraction, yet the abstract itself identifies “imperfect learned-φ approximation of the unretrieved residual mass” as the dominant remaining error without supplying quantitative bounds on that approximation error or an ablation that replaces learned φ with a non-learned surrogate (constant, mean-pool, etc.).
[Evaluation section] Evaluation (RULER/BABILong results): the residual estimate depends on φ fitted to data and the method is evaluated on the same long-context tasks used to tune it; while the frozen backbone and oracle Top-K provide some separation, this leaves open whether observed improvements arise from the accounting rule or from extra capacity in φ.
[Method section] Method description: no details are given on the training procedure, loss, data, or validation protocol for the positive feature maps φ, nor on whether φ is task-specific or shared across the 1B and 3B backbones; this information is load-bearing for assessing reproducibility and the strength of the central claim.

minor comments (2)

[Method section] The notation (S, u) for summary states is introduced without an explicit equation showing how they are constructed from φ and the prefill tokens.
[Evaluation section] LongBench results are described as “mostly favorable” for summarization and “mixed” for multi-document QA; a table or per-task breakdown would clarify the pattern.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and commit to revisions that supply the requested details and analyses without altering the core claims of the work.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gains at 1% exact-support budget are attributed to the residual-completion step after subtraction, yet the abstract itself identifies “imperfect learned-φ approximation of the unretrieved residual mass” as the dominant remaining error without supplying quantitative bounds on that approximation error or an ablation that replaces learned φ with a non-learned surrogate (constant, mean-pool, etc.).

Authors: We agree that the abstract would be strengthened by quantitative bounds on the approximation error and by an ablation against a non-learned surrogate. In the revised manuscript we will add a sentence reporting the observed residual-mass approximation error statistics from the attention diagnostics already present in the paper, and we will include a new ablation that substitutes a constant or mean-pool surrogate for the learned φ to isolate the contribution of the accounting rule. revision: yes
Referee: [Evaluation section] Evaluation (RULER/BABILong results): the residual estimate depends on φ fitted to data and the method is evaluated on the same long-context tasks used to tune it; while the frozen backbone and oracle Top-K provide some separation, this leaves open whether observed improvements arise from the accounting rule or from extra capacity in φ.

Authors: The referee correctly identifies that the current manuscript supplies insufficient information to rule out confounding from φ's capacity. We will revise the evaluation section to document the training and validation protocol used for φ and to discuss how the frozen backbone together with the oracle Top-K selector help isolate the effect of the residual accounting rule. We will also note any task overlap as a limitation and, where feasible, add a cross-task control experiment. revision: partial
Referee: [Method section] Method description: no details are given on the training procedure, loss, data, or validation protocol for the positive feature maps φ, nor on whether φ is task-specific or shared across the 1B and 3B backbones; this information is load-bearing for assessing reproducibility and the strength of the central claim.

Authors: We acknowledge the omission. The revised method section will contain a complete description of the training procedure, loss function, data sources, and validation protocol for φ, together with an explicit statement on whether φ is shared across the 1B and 3B backbones. These additions will directly address reproducibility and allow readers to evaluate the strength of the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper proposes an empirical method for residual-mass accounting using learned positive feature maps φ to build summary states, subtract retrieved contributions, and merge under normalization. Claims rest on reported improvements over oracle Top-K baseline on RULER, BABILong, and LongBench with frozen Llama backbones. No derivation chain reduces by construction to inputs; φ is a fitted component of the proposed technique rather than a self-referential loop. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked as load-bearing. The approach is self-contained against external task benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The claim rests on the learned feature maps φ for residual approximation and the domain assumption that subtraction keeps the exact and residual partitions non-overlapping before normalization; no independent verification of φ outside the reported experiments is described.

free parameters (1)

learned positive feature maps φ
Used to construct fixed-size summary states (S,u) that estimate residual mass from prefill tokens

axioms (2)

domain assumption Retrieved-token feature contributions can be subtracted from the summary to keep exact and residual sets non-overlapping
Invoked to maintain partition consistency in the accounting rule after exact support selection
domain assumption Estimated residual numerator and denominator can be merged with the exact branch under a single normalization
Central step that produces the final attention output

invented entities (2)

residual estimate no independent evidence
purpose: Represent contributions from unretrieved prefill tokens
Approximated via learned φ and summary states (S,u)
summary states (S,u) no independent evidence
purpose: Fixed-size representation of residual mass
Built from learned positive feature maps φ

pith-pipeline@v0.9.0 · 5542 in / 1622 out tokens · 47116 ms · 2026-05-10T20:11:10.279256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

What Does BERT Look at? An Analysis of BERT ' s Attention

URL https://openreview.net/forum ?id=Ua6zuk0WRH. Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does BERT look at? an analysis of BERT’s at- tention. In Linzen, T., Chrupała, G., Belinkov, Y ., and Hupkes, D. (eds.),Proceedings of the 2019 ACL Work- shop BlackboxNLP: Analyzing and Interpreting Neu- ral Networks for NLP, pp. 276–286, Florence...

work page doi:10.18653/v1/w19-4828 2019
[2]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Association for Computational Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclantho logy.org/N18-2097/. Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R ´e, C. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Process- ing Systems, NIPS ’22, Red Hook...

work page doi:10.18653/v1/n18-2097 2097
[3]

doi: 10.18653/v1/2021.sustainlp-1.5

Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp-1.5. URL https://acla nthology.org/2021.sustainlp-1.5/. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https: //openre...

work page doi:10.18653/v1/2021.sustainlp-1.5 2021
[4]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

URL https://proceedings.mlr.pres s/v119/katharopoulos20a.html. Kuratov, Y ., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D., Sorokin, A., and Burtsev, M. Babilong: testing the limits of llms with long context reasoning-in-a-haystack. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , US...

work page doi:10.1145/3600006.3613165 2024
[5]

In: Korhonen, A., Traum, D., Màrquez, L

URL https://dl.acm.org/doi/abs/10. 5555/3692070.3694025. Tatsuno, K., Miyashita, D., Ikeda, T., Ishiyama, K., Sumiyoshi, K., and Deguchi, J. AiSAQ: All-in-storage 10 Top-K with Linear-Attention Completion anns with product quantization for dram-free information retrieval, 2025. URL https://arxiv.org/abs/ 2404.06004. V oita, E., Talbot, D., Moiseev, F., Se...

work page doi:10.18653/v1/p19- 2025
[6]

ϕ-recompute =c×

Curran Associates Inc. 11 Top-K with Linear-Attention Completion A. Appendix Roadmap The appendix collects definitions and implementation details referenced by the main text. Training objective and loss.Appendix B defines the full distillation loss used in all reported experiments. Numerically stable ϕ-summary cache.Appendix C defines the max-shifted cach...

work page arXiv 2024

[1] [1]

What Does BERT Look at? An Analysis of BERT ' s Attention

URL https://openreview.net/forum ?id=Ua6zuk0WRH. Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does BERT look at? an analysis of BERT’s at- tention. In Linzen, T., Chrupała, G., Belinkov, Y ., and Hupkes, D. (eds.),Proceedings of the 2019 ACL Work- shop BlackboxNLP: Analyzing and Interpreting Neu- ral Networks for NLP, pp. 276–286, Florence...

work page doi:10.18653/v1/w19-4828 2019

[2] [2]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Association for Computational Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclantho logy.org/N18-2097/. Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R ´e, C. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. InProceedings of the 36th International Conference on Neural Information Process- ing Systems, NIPS ’22, Red Hook...

work page doi:10.18653/v1/n18-2097 2097

[3] [3]

doi: 10.18653/v1/2021.sustainlp-1.5

Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp-1.5. URL https://acla nthology.org/2021.sustainlp-1.5/. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https: //openre...

work page doi:10.18653/v1/2021.sustainlp-1.5 2021

[4] [4]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

URL https://proceedings.mlr.pres s/v119/katharopoulos20a.html. Kuratov, Y ., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D., Sorokin, A., and Burtsev, M. Babilong: testing the limits of llms with long context reasoning-in-a-haystack. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , US...

work page doi:10.1145/3600006.3613165 2024

[5] [5]

In: Korhonen, A., Traum, D., Màrquez, L

URL https://dl.acm.org/doi/abs/10. 5555/3692070.3694025. Tatsuno, K., Miyashita, D., Ikeda, T., Ishiyama, K., Sumiyoshi, K., and Deguchi, J. AiSAQ: All-in-storage 10 Top-K with Linear-Attention Completion anns with product quantization for dram-free information retrieval, 2025. URL https://arxiv.org/abs/ 2404.06004. V oita, E., Talbot, D., Moiseev, F., Se...

work page doi:10.18653/v1/p19- 2025

[6] [6]

ϕ-recompute =c×

Curran Associates Inc. 11 Top-K with Linear-Attention Completion A. Appendix Roadmap The appendix collects definitions and implementation details referenced by the main text. Training objective and loss.Appendix B defines the full distillation loss used in all reported experiments. Numerically stable ϕ-summary cache.Appendix C defines the max-shifted cach...

work page arXiv 2024