FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
LEI framework uses a cloud LLM to dynamically create and update tailored lightweight programs for heterogeneous edge devices, shown on four sensor datasets to maintain low CPU and memory use while adapting to changes.
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
citing papers explorer
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
LLM-assisted Agentic Edge Intelligence Framework
LEI framework uses a cloud LLM to dynamically create and update tailored lightweight programs for heterogeneous edge devices, shown on four sensor datasets to maintain low CPU and memory use while adapting to changes.
-
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.