KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Anna Kuzina; Babak Ehteshami Bejnordi; Maciej Pioro; Paul N. Whatmough

arxiv: 2510.02312 · v2 · submitted 2025-10-02 · 💻 cs.LG

KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Anna Kuzina , Maciej Pioro , Paul N. Whatmough , Babak Ehteshami Bejnordi This is my paper

Pith reviewed 2026-05-18 10:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords latent reasoningKV-cache distillationknowledge distillationchain-of-thoughtefficient inferenceself-distillationcompressed representations

0 comments

The pith

Compressed KV-caches from chain-of-thought teachers supply effective supervision for latent-reasoning students without token-level matches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models rely on explicit chain-of-thought traces for complex reasoning, but these traces add memory and compute costs while introducing stylistic noise. Latent reasoning offers an efficient internal alternative yet struggles from weak supervision signals. The paper demonstrates that the unstructured knowledge stored in a compressed key-value cache of a teacher model can be distilled directly into a student's continuous latent tokens. Alignment occurs by matching stepwise trajectories between the teacher's cache states and the student's latent representations. This produces students that maintain higher accuracy on natural-language reasoning and degrade less than prior latent methods when moving beyond simple equations.

Core claim

KaVa performs self-distillation by transferring knowledge from the compressed KV-cache trajectories of a CoT-trained teacher into the continuous latent tokens of a student model, using stepwise alignment to supply supervisory signal even though no direct token correspondence exists.

What carries the argument

Stepwise alignment of compressed KV trajectories with the student's continuous latent tokens, which transfers abstract reasoning knowledge without requiring explicit token matches.

If this is right

The distilled students consistently outperform prior latent reasoning baselines on multi-step tasks.
Performance degrades less when moving from equation-only problems to full natural-language reasoning traces.
The approach continues to deliver efficiency gains when scaled to larger model backbones.
Compressed KV-cache distillation combines the accuracy of explicit CoT teachers with the deployability of latent inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression-based supervision might apply to other internal states such as attention patterns or activation caches.
Production systems could adopt these latent reasoners to reduce memory footprint during long inference chains.
Testing the method with teacher and student models of mismatched sizes would clarify how broadly the alignment transfers.

Load-bearing premise

Stepwise alignment between compressed KV trajectories and the student's continuous latent tokens is possible and sufficient to transfer useful reasoning supervision without explicit token-level correspondence or additional human labels.

What would settle it

A latent student trained with this KV-cache distillation performs no better than a strong non-distilled latent baseline on a set of natural-language multi-step reasoning tasks when compute budgets are matched.

read the original abstract

Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KaVa tries to supervise latent reasoning with compressed KV caches from teachers, but the abstract gives almost no evidence that the alignment step actually transfers coherent multi-step structure.

read the letter

Colleague, the new piece here is the claim that you can distill directly from a teacher's compressed KV cache into a student's continuous latent tokens as supervision for reasoning, without needing explicit CoT or token-level matches. They frame the KV states as carrying abstract unstructured knowledge that can still guide the student on natural-language traces where prior latent methods degrade more sharply. If the full experiments back that up with clean controls, it would be a practical way to keep some of the accuracy of CoT-trained teachers while dropping the generation cost at inference time. The abstract also says the method scales to larger backbones and shows smaller degradation moving from equation-only to natural language, which is the kind of result people working on efficient inference would notice. That part is worth checking. The soft spot is obvious from the abstract alone: there are no numbers, no tables, no ablations, and no error bars. Without those it is impossible to tell whether the gains come from the KV signal or from baseline choices and post-hoc tuning. The alignment between compressed KV trajectories and the student's latent tokens is presented as straightforward, yet KV compression usually collapses per-step dependencies. The stress-test concern about granularity mismatch therefore lands directly on the central claim; if the loss is just matching averaged states, it may supervise on collapsed rather than coherent reasoning steps. The paper avoids circularity by introducing a fresh objective, which is a plus, but that does not substitute for visible evidence that the alignment works. This is for people already following latent-reasoning and KV-cache work who want to see whether a new supervision source can close the gap. A reader who cares about reproducible efficiency gains would get something out of the full version if the experiments are solid. I would send it to peer review because the supervision idea is distinct enough from prior distillation to merit referee time, even though the current write-up is too thin to judge the result yet.

Referee Report

2 major / 1 minor

Summary. The paper proposes KaVa, a framework for distilling knowledge from the compressed KV-cache of a teacher model to a latent-reasoning student model through self-distillation. It argues that the abstract, unstructured knowledge in the compressed KV-cache, which lacks direct token correspondence, can act as a rich supervisory signal by aligning stepwise KV trajectories with the student's continuous latent tokens. The approach is shown to outperform strong latent baselines, exhibit smaller degradation on natural-language reasoning traces compared to equation-only ones, and scale effectively to larger model backbones while maintaining efficiency.

Significance. If the central claim holds, this work has significant implications for efficient reasoning in LLMs. By providing a supervision mechanism for latent reasoning that does not require explicit chain-of-thought or human annotations, it addresses a key limitation in the field. The use of compressed KV-cache as a distillation source is a novel idea that could lead to more accurate and deployable latent reasoning models, combining the strengths of CoT-trained teachers with the efficiency of latent inference.

major comments (2)

[§3.2] The stepwise alignment of compressed KV trajectories with continuous latent tokens is presented as the core mechanism for transferring reasoning supervision. However, given that KV compression often averages or discards per-token sequential dependencies, the paper should provide more rigorous justification or empirical evidence that this alignment preserves coherent multi-step reasoning rather than supervising on misaligned representations.
[§5] The experimental results claim consistent outperformance and smaller degradation, but without detailed ablations on the alignment loss or error bars, it is difficult to evaluate the robustness of these findings to post-hoc choices or baseline implementations.

minor comments (1)

The abstract would be strengthened by including specific quantitative improvements to support the claims of outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] The stepwise alignment of compressed KV trajectories with continuous latent tokens is presented as the core mechanism for transferring reasoning supervision. However, given that KV compression often averages or discards per-token sequential dependencies, the paper should provide more rigorous justification or empirical evidence that this alignment preserves coherent multi-step reasoning rather than supervising on misaligned representations.

Authors: We agree that a more explicit justification for the preservation of multi-step coherence under KV compression is valuable. The original §3.2 describes the alignment objective and its motivation, but we acknowledge it could be strengthened. In the revision we will expand this section with an information-theoretic argument showing that the compression operator retains the dominant reasoning-relevant directions across steps, together with new probing experiments that measure step-wise coherence on held-out reasoning traces. These additions will directly address the concern that supervision might occur on misaligned representations. revision: yes
Referee: [§5] The experimental results claim consistent outperformance and smaller degradation, but without detailed ablations on the alignment loss or error bars, it is difficult to evaluate the robustness of these findings to post-hoc choices or baseline implementations.

Authors: We concur that additional ablations and statistical reporting would improve confidence in the results. In the revised manuscript we will add a dedicated ablation study on the alignment loss (varying segment weighting, loss coefficients, and trajectory length) and will report all primary metrics with standard error bars computed over at least three random seeds. These changes will allow readers to assess sensitivity to implementation details and confirm the stability of the reported gains. revision: yes

Circularity Check

0 steps flagged

New distillation framework with self-contained empirical validation

full rationale

The paper proposes KaVa as a novel self-distillation framework that aligns compressed KV-cache trajectories with continuous latent tokens in a student model. No equations or claims reduce the reported performance gains or supervisory signal to quantities defined by the authors' own prior fits, self-citations, or ansatzes. The central results are empirical comparisons against latent baselines, with the method's novelty lying in the distillation objective itself rather than any derivation that loops back to its inputs by construction. The derivation chain consists of the framework definition followed by external validation on reasoning tasks, remaining independent of the circular patterns enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that KV-cache trajectories contain transferable reasoning information that can be aligned to latent tokens. No free parameters are explicitly named in the abstract; the method appears to introduce no new invented entities beyond standard latent tokens and KV caches.

axioms (1)

domain assumption Compressed KV-cache trajectories from a CoT teacher contain rich, alignable supervisory signal for latent reasoning without requiring direct token correspondence.
This premise is invoked to justify using the cache as the distillation target.

pith-pipeline@v0.9.0 · 5749 in / 1272 out tokens · 25771 ms · 2026-05-18T10:00:40.298986+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L_KV = 1/2M (||sg[eK_t] - K_s||_p^p + ||sg[eV_t] - V_s||_p^p) with R-KV redundancy-importance eviction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.