Large language model inference acceler- ation: A comprehensive hardware perspective

Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, et al · 2024 · arXiv 2410.04466

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.

A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

cs.CV · 2025-09-26 · conditional · novelty 6.0

FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.

Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

cs.AR · 2026-05-08 · unverdicted · novelty 5.0

A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.

Secure eFPGA-Enabled Edge LLM Inference: Architectural and Hardware Countermeasures

cs.CR · 2026-04-24 · unverdicted · novelty 5.0

A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

cs.LG · 2026-04-23 · unverdicted · novelty 2.0

The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

citing papers explorer

Showing 7 of 7 citing papers.

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility cs.LG · 2026-05-19 · unverdicted · none · ref 12 · 2 links
Empirical study shows LLM inference backends can shift benchmark scores by up to 16.6 percentage points and cause output disagreements due to optimizations like prefix caching and custom kernels.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 29 · 2 links
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need cs.LG · 2026-04-09 · unverdicted · none · ref 31
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing cs.CV · 2025-09-26 · conditional · none · ref 17
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling cs.AR · 2026-05-08 · unverdicted · none · ref 1
A new end-to-end modeling approach for latency-sensitive many-core architectures with globally shared L1 SPM tracks RTL golden models within 7% error while running up to 115x faster and supports profiling for design optimization.
Secure eFPGA-Enabled Edge LLM Inference: Architectural and Hardware Countermeasures cs.CR · 2026-04-24 · unverdicted · none · ref 10
A hybrid ASIC+eFPGA architecture is proposed to add adaptive security mechanisms to edge LLM inference while retaining ASIC efficiency.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models cs.LG · 2026-04-23 · unverdicted · none · ref 47
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

Large language model inference acceler- ation: A comprehensive hardware perspective

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer