Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim · 2024 · arXiv 2402.10076

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

cs.DC · 2025-08-21 · unverdicted · novelty 5.0

TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.

A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

cs.AI · 2026-04-16 · unverdicted · novelty 4.0

Combining pruning, quantization, and early exits in CNNs reduces inference latency and memory on real edge devices with minimal accuracy loss.

citing papers explorer

Showing 2 of 2 citing papers.

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind cs.DC · 2025-08-21 · unverdicted · none · ref 36
TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.
A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits cs.AI · 2026-04-16 · unverdicted · none · ref 65
Combining pruning, quantization, and early exits in CNNs reduces inference latency and memory on real edge devices with minimal accuracy loss.

Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference

fields

years

verdicts

representative citing papers

citing papers explorer