Lorc: Low-rank compression for llms kv cache with a progressive compression strategy

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen · 2024 · arXiv 2410.03111

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Compute Where it Counts: Self Optimizing Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

cs.NI · 2026-04-23 · unverdicted · novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

KVCapsule compresses KV cache in VLMs by 60% to deliver up to 2x higher tokens-per-second and 2.4x memory reduction with negligible accuracy loss.

HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

HCInfer recovers up to 5.2% accuracy over compressed LLMs and delivers 10.4x speedup versus full-precision models by offloading compensation parameters to CPU with async execution on resource-limited hardware.

citing papers explorer

Showing 4 of 4 citing papers.

Compute Where it Counts: Self Optimizing Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 24
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference cs.NI · 2026-04-23 · unverdicted · none · ref 53
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy cs.CV · 2026-05-14 · unverdicted · none · ref 32
KVCapsule compresses KV cache in VLMs by 60% to deliver up to 2x higher tokens-per-second and 2.4x memory reduction with negligible accuracy loss.
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices cs.LG · 2026-05-07 · unverdicted · none · ref 26
HCInfer recovers up to 5.2% accuracy over compressed LLMs and delivers 10.4x speedup versus full-precision models by offloading compensation parameters to CPU with async execution on resource-limited hardware.

Lorc: Low-rank compression for llms kv cache with a progressive compression strategy

fields

years

verdicts

representative citing papers

citing papers explorer