Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
Large language model inference acceler- ation: A comprehensive hardware perspective
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.
A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.
citing papers explorer
-
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
-
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
-
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
-
Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling
A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.
-
Secure eFPGA-Enabled Edge LLM Inference: Architectural and Hardware Countermeasures
A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.
-
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.