Title resolution pending

Zhongzhi Yu, Zheng Wang, Yuhan Li, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reddy Bommu, Yang Zhao, Yingyan Lin · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

cs.DC · 2026-01-15 · unverdicted · novelty 6.0

WISP suppresses wasted drafting time and verification interference in edge-cloud speculative LLM serving through dynamic drafting and SLO-aware batching, delivering up to 2.1x capacity and 1.94x goodput gains over centralized and prior baselines.

ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving

cs.DC · 2026-04-08 · unverdicted · novelty 5.0

ConfigSpec shows that optimal configurations for speculative LLM inference conflict across goodput (favoring smallest drafters at device-specific K=2-10), cost (favoring largest drafters at K=2), and energy (favoring smallest drafters at K=2), requiring profiling-based selection instead of fixed or

citing papers explorer

Showing 3 of 3 citing papers.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 65
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching cs.DC · 2026-01-15 · unverdicted · none · ref 37
WISP suppresses wasted drafting time and verification interference in edge-cloud speculative LLM serving through dynamic drafting and SLO-aware batching, delivering up to 2.1x capacity and 1.94x goodput gains over centralized and prior baselines.
ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving cs.DC · 2026-04-08 · unverdicted · none · ref 17
ConfigSpec shows that optimal configurations for speculative LLM inference conflict across goodput (favoring smallest drafters at device-specific K=2-10), cost (favoring largest drafters at K=2), and energy (favoring smallest drafters at K=2), requiring profiling-based selection instead of fixed or

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer