pith. sign in

arxiv: 2605.18856 · v2 · pith:A3I55DDTnew · submitted 2026-05-13 · 💻 cs.LG · cs.CL· cs.IT· math.IT

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT
keywords long-context inferenceKV cache compressionspherical parameterizationrate-distortion optimizationattention geometrymemory efficiencyHBM bandwidthpaged decoding
0
0 comments X

The pith

Spherical KV stores each key as a scalar radius plus compact angle codes so attention logits can be computed directly without reconstructing dense vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to cut the memory growth and repeated HBM streaming costs that limit long-context decoding in transformer models. It reframes KV cache allocation as a rate-distortion problem whose solution is driven by the geometry of attention scores. Angle-Domain Attention keeps keys in spherical form and evaluates dot products from the angle codes alone. Rate-Distortion Retention then picks which tokens and heads to retain and at what precision under a fixed budget, yielding pages that stay uniform in tier. If these steps work, the decode loop can stay paged and fused while the resident KV footprint shrinks.

Core claim

Spherical KV treats KV allocation as a rate-distortion problem grounded in attention geometry. Its Angle-Domain Attention component stores keys as a scalar radius together with compact angle codes and computes logits directly from those codes without ever reconstructing the dense key vectors. Its Rate-Distortion Retention component jointly decides keep-or-drop and precision tier for each token and head under a fixed budget, producing tier-homogeneous pages that carry only lightweight metadata and support coalesced reads.

What carries the argument

Angle-Domain Attention, which computes attention logits from spherical radius-plus-angle codes without dense reconstruction, paired with Rate-Distortion Retention that allocates keep/drop and precision tiers under a fixed budget.

If this is right

  • KV residency shrinks while the decode path remains paged, block-local, and fusion-friendly.
  • HBM traffic in realistic serving settings falls because dense key reconstruction is avoided in the hot loop.
  • Tier-homogeneous pages with lightweight metadata enable coalesced reads and simpler memory management.
  • Retention and precision decisions are made jointly per token and head under a single fixed budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spherical representation might be applied to value vectors or to attention patterns in other architectures without changing the core decode loop.
  • If the utility estimator inside RDR proves stable, the method could be extended to dynamic context lengths that grow or shrink during a single generation.
  • Combining the angle-code storage with existing eviction or offloading schemes could produce additive gains on hardware with very tight memory.

Load-bearing premise

That attention logits computed from the spherical angle codes and radius stay close enough to the original dense-key logits, and that future token utility can be estimated reliably enough to avoid quality loss when tokens are dropped or quantized.

What would settle it

A side-by-side run on a long-context benchmark in which the model using Spherical KV shows a clear drop in accuracy or coherence compared with an otherwise identical run that keeps the full dense KV cache.

Figures

Figures reproduced from arXiv: 2605.18856 by Aman Chadha, Amitava Das, Amit Dhanda, Anay Chauhan, Arion Das, Gurucharan Marthi Krishna Kumar, Vinija Jain.

Figure 1
Figure 1. Figure 1: Spherical KV Architecture. Dense keys from the incoming layer are mapped to a spherical param￾eterization (radius r and angles ϕ). A learned rate–distortion retention policy π enforces a strict memory budget B by predicting (i) a keep/drop decision zi ∈ {0, 1} and (ii) a per-item precision (rate allocation) bi . Joint quantization and filtering produce a sparse, quantized spherical stream that is stored in… view at source ↗
Figure 2
Figure 2. Figure 2: Angle-Domain Attention in Practice (Paged KV; kernel-realized, no reconstruction). This diagram shows how Angle-Domain Attention fits into a standard paged KV-cache decode loop without paying a hidden densification tax. Top (prefill): for each token, we write a paged/contiguous K-cache page that stores compact spherical K-codes—a scalar rk (radius) and c θ k (packed angle codes), plus lightweight tier/flag… view at source ↗
Figure 3
Figure 3. Figure 3: Rate–Distortion Retention: joint keep/drop + tiering under a strict KV budget. We treat KV residency as a constrained rate–distortion allocation problem: under a fixed byte budget, the controller spends preci￾sion on states that are likely to matter later and removes low-utility states first. (A) RD controller. Inputs in￾clude token/head/layer context plus a cost model; an RD score (e.g., predicted ∆loss v… view at source ↗
Figure 4
Figure 4. Figure 4: Iso-quality Pareto frontiers for memory-bounded decoding. Each panel plots decode throughput (tok/s; higher is better) versus effective KV budget bKV (bytes/token; lower is better) under paged/ragged serving, across three models and context lengths L ∈ {8K, 32K, 128K}. Let Q be the quality score (higher is better; defined in § [X]) and let Q⋆ dense denote the best Dense-KV quality in the panel. We enforce … view at source ↗
Figure 5
Figure 5. Figure 5: Ablations A0–A5 at extreme context (L=128K): frontier shift + mechanism evidence under iso￾quality (∆ = 0.8). Top row (A0–A3, three LLMs): For each model (columns), we plot decode throughput s (tok/s; higher is better) versus effective resident KV budget bKV (bytes/token; lower is better) measured on a paged/ragged KV substrate. Let Q⋆ dense be the best Dense KV quality for that model; we retain operating … view at source ↗
Figure 6
Figure 6. Figure 6: Stability phase diagram and bounded gating behavior. (A) Phase diagram (policy map). The x-axis is angular precision b θ (equivalently, the effective KV budget allocated to angular codes), and the y-axis is a brit￾tleness proxy Bt that increases when attention becomes sensitive to small logit perturbations (e.g., inverse￾margin or a normalized query-norm proxy. The plane separates three regimes: safe compr… view at source ↗
read the original abstract

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Spherical KV, a long-context inference technique that frames KV cache allocation as a rate-distortion problem grounded in attention geometry. It introduces Angle-Domain Attention (ADA), which stores keys via a scalar radius and compact angle codes and computes attention logits directly from these codes without reconstructing dense vectors, and Rate-Distortion Retention (RDR), which jointly selects per-token and per-head keep/drop decisions plus precision tiers under a fixed budget to yield tier-homogeneous pages with lightweight metadata.

Significance. If the spherical parameterization and retention policy can be shown to preserve attention distributions and model quality, the method would directly target HBM traffic and KV residency in paged serving systems, offering a deployment-friendly alternative to eviction, quantization, or offloading approaches that still require dense reconstruction.

major comments (2)
  1. [Abstract / ADA description] Abstract (ADA paragraph): the assertion that logits computed directly from angle codes and radius preserve model quality sufficiently close to full dense keys is load-bearing for both the efficiency and correctness claims, yet the manuscript supplies neither an algebraic identity establishing exact equivalence, a Lipschitz-style bound on the approximation error, nor analysis of how angle-code quantization interacts with the geometry of typical transformer key spaces.
  2. [Abstract / RDR description] Abstract (RDR paragraph): retention decisions rest on future-utility estimates derived from the approximate logits; without any verification that these estimates remain reliable enough to avoid performance degradation, the keep/drop and tier-selection policy lacks grounding and may undermine the claimed quality retention under fixed budgets.
minor comments (1)
  1. The description of 'tier-homogeneous pages with lightweight metadata and coalesced reads' would benefit from explicit quantification of metadata overhead and its impact on HBM traffic in realistic serving configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help us improve the clarity and rigor of our presentation. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / ADA description] Abstract (ADA paragraph): the assertion that logits computed directly from angle codes and radius preserve model quality sufficiently close to full dense keys is load-bearing for both the efficiency and correctness claims, yet the manuscript supplies neither an algebraic identity establishing exact equivalence, a Lipschitz-style bound on the approximation error, nor analysis of how angle-code quantization interacts with the geometry of typical transformer key spaces.

    Authors: We acknowledge the absence of a formal algebraic identity or Lipschitz bound in the current version. The spherical parameterization is motivated by the observation that attention logits depend primarily on the angular similarity between queries and keys, allowing us to store and compute using radius and angle codes. While exact equivalence does not hold due to quantization, our empirical results across multiple models demonstrate that the resulting attention distributions closely match those of dense keys, with negligible impact on downstream task performance. In the revised manuscript, we will include a dedicated analysis section deriving an upper bound on the logit error as a function of the angle code precision and discussing its implications for typical key vector distributions in transformers. revision: yes

  2. Referee: [Abstract / RDR description] Abstract (RDR paragraph): retention decisions rest on future-utility estimates derived from the approximate logits; without any verification that these estimates remain reliable enough to avoid performance degradation, the keep/drop and tier-selection policy lacks grounding and may undermine the claimed quality retention under fixed budgets.

    Authors: The referee raises a valid point regarding the grounding of the retention policy. The Rate-Distortion Retention (RDR) uses approximate logits to estimate future utility for keep/drop and tier decisions. To verify reliability, the full paper includes experiments showing that models using RDR maintain performance close to baselines under various budgets. We will revise the manuscript to add explicit comparisons of utility estimates computed from approximate versus full logits, including correlation metrics and ablation studies on how approximation errors affect retention decisions. This will provide the necessary verification that the estimates remain reliable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method presented as independent construction

full rationale

The provided abstract and summary describe Spherical KV as a new rate-distortion approach using Angle-Domain Attention (spherical parameterization with direct logit computation) and Rate-Distortion Retention (joint keep/drop and precision selection). No equations, fitted parameters, or self-citations are shown that reduce the claimed preservation of attention quality or retention decisions to quantities defined by the method's own inputs or outputs. The derivation chain is self-contained as a proposed algorithmic construction grounded in attention geometry, with no evidence of self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the spherical parameterization and utility estimation are introduced as part of the method but lack further specification.

pith-pipeline@v0.9.0 · 5809 in / 1156 out tokens · 86556 ms · 2026-05-20T20:25:34.663568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.