Effectively compress kv heads for llm

Yu, H · arXiv 2406.07056

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

cs.CL · 2024-10-17 · unverdicted · novelty 6.0

LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.

WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

cs.CV · 2026-04-02 · unverdicted · novelty 5.0

WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.

citing papers explorer

Showing 2 of 2 citing papers.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 36
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models cs.CV · 2026-04-02 · unverdicted · none · ref 17
WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.

Effectively compress kv heads for llm

fields

years

verdicts

representative citing papers

citing papers explorer