LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.
Forecasting llm inference performance via hardware-agnostic analytical modeling,
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
WattGPU ML models predict LLM inference power and latency on unseen GPUs with median errors of 3.4-13.5% using public data and show better performance than baselines.
Recover-LoRA with synthetic-data distillation recovers 80-95% accuracy on most benchmarks after selective 2-bit quantization of MLP gate/up layers while delivering 7.5-23.3% throughput improvement.
citing papers explorer
-
Latency Prediction for LLM Inference on NPU Systems
LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.
-
WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs
WattGPU ML models predict LLM inference power and latency on unseen GPUs with median errors of 3.4-13.5% using public data and show better performance than baselines.