S3: Increasing gpu utilization during genera- tive inference for higher throughput

· 2023 · arXiv 2306.06000

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

cs.CL · 2024-02-05 · conditional · novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 3 of 3 citing papers.

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 13
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 9
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 284
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

S3: Increasing gpu utilization during genera- tive inference for higher throughput

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer