Recognition: unknown
YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference
Pith reviewed 2026-05-10 13:54 UTC · model grok-4.3
The pith
Adding weighted KV residual connections to YOCO restores capacity lost in cross-layer sharing, yielding SOTA results at 50% KV cache compression and beating the standard Transformer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
YOCO++ augments YOCO by inserting a weighted residual connection from the KV states of each bottom-half layer to the KV states of the bottom layer. The original YOCO already shares the middle-layer KVs with the top-half layers; the new residuals increase representational power without altering the compression schedule or computational graph. Consequently, at a fixed 50% KV cache compression rate, YOCO++ records state-of-the-art accuracy among cross-layer KV methods and exceeds the performance of the full, uncompressed Transformer.
What carries the argument
Weighted residual connections that route each bottom-half layer's KV representation back to the bottom layer's KV, layered on top of YOCO's middle-to-top-half KV sharing.
If this is right
- 50% KV cache compression becomes practical without sacrificing accuracy relative to the full model.
- Cross-layer KV reuse can be strengthened by lightweight residuals rather than by redesigning attention layers.
- Training and inference pipelines for compressed models remain identical to the baseline YOCO implementation.
- Memory savings scale directly with context length while preserving or improving quality.
Where Pith is reading between the lines
- Similar residual links might be added to other layer-sharing or token-compression schemes to recover capacity at higher ratios.
- The technique implies that KV cache design can be made layer-specific rather than uniform across the stack.
- Integration with quantization or pruning could compound the memory reductions beyond the 50% mark shown here.
Load-bearing premise
The added residual links increase model capacity without introducing measurable overhead or requiring any retraining or architectural changes beyond the residual weights themselves.
What would settle it
An evaluation on a previously untested model size or downstream task where YOCO++ at 50% compression falls below the standard Transformer's accuracy would falsify the performance claim.
Figures
read the original abstract
Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes YOCO++, an enhancement to the YOCO cross-layer KV compression technique for efficient LLM inference. YOCO++ adds weighted residual connections from the KV states of each bottom-half layer to the bottom layer's KV, increasing effective model capacity while preserving the original 50% KV cache compression ratio, per-token computation graph, and training/inference efficiency. Experiments demonstrate that YOCO++ achieves state-of-the-art results among cross-layer KV compression methods at 50% compression and outperforms the uncompressed standard Transformer baseline across reported benchmarks.
Significance. If the empirical results hold, the work offers a lightweight, parameter-efficient modification that mitigates performance degradation common to KV compression methods without sacrificing the core efficiency gains. The preservation of the original KV cache size and computation graph while adding capacity is a notable practical contribution for deploying large models under memory constraints.
minor comments (3)
- [§4] §4 (Experimental Setup): provide the exact model sizes, number of layers, and training datasets used for the main results to allow direct replication of the SOTA claim at 50% compression.
- [Table 2] Table 2: report standard deviations or error bars across multiple runs for the perplexity and downstream task metrics, as single-run numbers weaken the outperformance claim over the Transformer baseline.
- [§3.2] §3.2: explicitly state whether the residual connection weights are learned per-layer or shared, and include an ablation removing the weighting to quantify its contribution.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The referee's summary accurately captures the core contribution of YOCO++ in improving cross-layer KV compression while preserving efficiency.
Circularity Check
No significant circularity identified
full rationale
The paper describes YOCO++ as an empirical architectural enhancement to YOCO, adding weighted residual KV connections to increase capacity while preserving the original KV cache size, training, and inference efficiency. No equations, derivations, predictions, or first-principles results are presented that reduce to inputs by construction. Performance claims rest on experimental benchmarks at 50% compression, not on self-referential definitions or fitted parameters renamed as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing support. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- residual connection weights
axioms (2)
- domain assumption Cross-layer KV sharing as defined in YOCO preserves sufficient information for downstream tasks
- standard math Standard transformer attention and KV cache mechanics
Reference graph
Works this paper leans on
-
[1]
URL https://openreview.net/forum? id=mZn2Xyh9Ec. Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. The language model e...
-
[2]
mHC: Manifold-Constrained Hyper-Connections
URL https://openreview.net/forum? id=qkhgzNiEdj. Xie, Z., Wei, Y ., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Yu, K., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? InProc...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.