FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
arXiv preprint arXiv:2403.02181 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.
ART replaces uniform attention in shallow LLM layers with local attention patterns to reduce hallucinations across multiple model architectures.
citing papers explorer
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
Two-dimensional early exit optimisation of LLM inference
Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.
-
Uncovering the Latent Potential of Deep Intermediate Representations
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
-
The Generalization Ridge: Information Flow in Natural Language Generation
InfoRidge reveals a non-monotonic pattern in which predictive mutual information between hidden states and outputs peaks in intermediate layers before declining in final layers.
-
ART: Attention Replacement Technique to Improve Factuality in LLMs
ART replaces uniform attention in shallow LLM layers with local attention patterns to reduce hallucinations across multiple model architectures.