pith. sign in

S3: Increasing gpu utilization during genera- tive inference for higher throughput

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.CL 2 cs.AR 1

years

2026 1 2024 2

roles

background 1

polarities

background 1

representative citing papers

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

cs.CL · 2024-02-05 · conditional · novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 3 of 3 citing papers.

  • NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 13

    NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.

  • KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 9

    KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

  • A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 284

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.