pith. machine review for the scientific record. sign in

arxiv: 2604.13556 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords KV cache compressioncross-layer KV sharingLLM efficient inferenceresidual connectionsYOCOtransformer memory optimization
0
0 comments X

The pith

Adding weighted KV residual connections to YOCO restores capacity lost in cross-layer sharing, yielding SOTA results at 50% KV cache compression and beating the standard Transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models consume large amounts of memory storing key-value caches during inference. Cross-layer compression techniques reduce this footprint by reusing KV states across layers, yet they typically lower model quality. The work starts from YOCO, which shares the middle layer's KVs with all top-half layers, and augments it with simple weighted residual links that feed each bottom-half layer's KV back to the bottom layer. These additions raise effective capacity while leaving training and inference costs unchanged. Experiments at a 50% compression ratio show the resulting YOCO++ model surpasses both prior cross-layer methods and the uncompressed Transformer baseline.

Core claim

YOCO++ augments YOCO by inserting a weighted residual connection from the KV states of each bottom-half layer to the KV states of the bottom layer. The original YOCO already shares the middle-layer KVs with the top-half layers; the new residuals increase representational power without altering the compression schedule or computational graph. Consequently, at a fixed 50% KV cache compression rate, YOCO++ records state-of-the-art accuracy among cross-layer KV methods and exceeds the performance of the full, uncompressed Transformer.

What carries the argument

Weighted residual connections that route each bottom-half layer's KV representation back to the bottom layer's KV, layered on top of YOCO's middle-to-top-half KV sharing.

If this is right

  • 50% KV cache compression becomes practical without sacrificing accuracy relative to the full model.
  • Cross-layer KV reuse can be strengthened by lightweight residuals rather than by redesigning attention layers.
  • Training and inference pipelines for compressed models remain identical to the baseline YOCO implementation.
  • Memory savings scale directly with context length while preserving or improving quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar residual links might be added to other layer-sharing or token-compression schemes to recover capacity at higher ratios.
  • The technique implies that KV cache design can be made layer-specific rather than uniform across the stack.
  • Integration with quantization or pruning could compound the memory reductions beyond the 50% mark shown here.

Load-bearing premise

The added residual links increase model capacity without introducing measurable overhead or requiring any retraining or architectural changes beyond the residual weights themselves.

What would settle it

An evaluation on a previously untested model size or downstream task where YOCO++ at 50% compression falls below the standard Transformer's accuracy would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.13556 by Bo Zheng, Chengting Yu, Haoyi Wu, Kewei Tu, Wenbo Su, Yizhen Zhang, You Wu, Yuchi Xu, Ziheng Chen.

Figure 1
Figure 1. Figure 1: Illustration of YOCO and YOCO++. KVs that need to be cached are indicated by red boxes. (a) YOCO caches the KVs of the bottom-half layers and shares the KVs of the middle layer with the top-half layers. (b) Based on YOCO, YOCO++ introduces a weighted residual connection between the KVs of each bottom￾half layer and the bottom layer and caches the combined KVs. KV cache is reduced by 50% because we only nee… view at source ↗
Figure 2
Figure 2. Figure 2: Learned residual weights of a 22-layer YOCO++ model without (left) and with (right) the scaling factor. In terms of cache I/O, during the training and prefilling stages, YOCO++ requires accessing K1, V1 and Ki , Vi in each layer i ∈ {2, ..., L 2 }, introducing additional I/O over￾head compared to YOCO. However, since the efficiency of the training and prefilling stages is bounded by computa￾tional overhead… view at source ↗
Figure 3
Figure 3. Figure 3: Prefilling latency (left) and decoding throughput (right) on an NVIDIA H20 (96GB) GPU at different sequence lengths. Model HellaSwag OBQA WinoGrande ARC-c ARC-e PIQA Avg Transformer 51.25 33.0 56.04 27.56 52.1 70.29 48.37 YOCO 51.21 32.0 53.59 29.01 51.43 70.62 47.98 FusedKV-Lite 52.28 32.8 54.93 28.24 52.19 70.13 48.43 FusedKV 52.04 32.0 55.48 28.67 51.98 70.35 48.42 YOCO++ 52.30 34.0 55.48 27.90 53.11 71… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves on a 100B subset of the SlimPajama dataset. 4.1. Inference Efficiency We report the prefilling latency (time to first token, TTFT) and the maximum decoding throughput at different batch sizes of all the models on an NVIDIA H20 (96GB) GPU in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes YOCO++, an enhancement to the YOCO cross-layer KV compression technique for efficient LLM inference. YOCO++ adds weighted residual connections from the KV states of each bottom-half layer to the bottom layer's KV, increasing effective model capacity while preserving the original 50% KV cache compression ratio, per-token computation graph, and training/inference efficiency. Experiments demonstrate that YOCO++ achieves state-of-the-art results among cross-layer KV compression methods at 50% compression and outperforms the uncompressed standard Transformer baseline across reported benchmarks.

Significance. If the empirical results hold, the work offers a lightweight, parameter-efficient modification that mitigates performance degradation common to KV compression methods without sacrificing the core efficiency gains. The preservation of the original KV cache size and computation graph while adding capacity is a notable practical contribution for deploying large models under memory constraints.

minor comments (3)
  1. [§4] §4 (Experimental Setup): provide the exact model sizes, number of layers, and training datasets used for the main results to allow direct replication of the SOTA claim at 50% compression.
  2. [Table 2] Table 2: report standard deviations or error bars across multiple runs for the perplexity and downstream task metrics, as single-run numbers weaken the outperformance claim over the Transformer baseline.
  3. [§3.2] §3.2: explicitly state whether the residual connection weights are learned per-layer or shared, and include an ablation removing the weighting to quantify its contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The referee's summary accurately captures the core contribution of YOCO++ in improving cross-layer KV compression while preserving efficiency.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes YOCO++ as an empirical architectural enhancement to YOCO, adding weighted residual KV connections to increase capacity while preserving the original KV cache size, training, and inference efficiency. No equations, derivations, predictions, or first-principles results are presented that reduce to inputs by construction. Performance claims rest on experimental benchmarks at 50% compression, not on self-referential definitions or fitted parameters renamed as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing support. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; the method introduces learned weights for residuals and assumes standard transformer KV mechanics plus the effectiveness of YOCO sharing.

free parameters (1)
  • residual connection weights
    Weighted residuals are added between bottom-half layers and the bottom layer; these scalars or vectors are almost certainly learned during training to balance capacity and compression.
axioms (2)
  • domain assumption Cross-layer KV sharing as defined in YOCO preserves sufficient information for downstream tasks
    YOCO++ directly builds on YOCO's middle-layer sharing for the top half; this premise is inherited without re-derivation.
  • standard math Standard transformer attention and KV cache mechanics
    The paper operates inside the conventional decoder-only transformer architecture and its memory layout.

pith-pipeline@v0.9.0 · 5465 in / 1279 out tokens · 38313 ms · 2026-05-10T13:54:43.949241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    URL https://openreview.net/forum? id=mZn2Xyh9Ec. Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. The language model e...

  2. [2]

    mHC: Manifold-Constrained Hyper-Connections

    URL https://openreview.net/forum? id=qkhgzNiEdj. Xie, Z., Wei, Y ., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Yu, K., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? InProc...