H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

Zhang, Z · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

cs.CL · 2024-06-16 · unverdicted · novelty 6.0

Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.

citing papers explorer

Showing 1 of 1 citing paper.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cs.CL · 2024-06-16 · unverdicted · none · ref 28
Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.

H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

fields

years

verdicts

representative citing papers

citing papers explorer