TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.
Quick: Quantization-aware interleaving and conflict-free kernel for efficient llm inference
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Combining pruning, quantization, and early exits in CNNs reduces inference latency and memory on real edge devices with minimal accuracy loss.
citing papers explorer
-
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
TurboMind delivers up to 61% lower latency and 156% higher throughput for mixed-precision LLM inference across 16 models and 4 GPU architectures via optimized weight packing, adaptive alignment, instruction parallelism, and KV memory pipelines.
-
A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits
Combining pruning, quantization, and early exits in CNNs reduces inference latency and memory on real edge devices with minimal accuracy loss.