On 6000 Qwen3-8B AIME traces, late-clustered moderate-to-severe backtracks are more common in incorrect outputs, enabling prefix-causal burst-aware filtering that outperforms fixed-length cutoffs at shallow and intermediate depths.
hub
L ayer S kip: Enabling Early Exit Inference and Self-Speculative Decoding
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to
DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.
CascadeFormer tapers Transformer width with depth based on gradient fan-in asymmetry to match uniform baselines in perplexity while cutting latency 8.6%.
DABS is a single-pass framework that builds a depth-ordered substrate from one Transformer encoding and performs lightweight aspect-conditioned readout, cutting computation by up to 60% on multi-aspect ATSA benchmarks while matching prior accuracy.
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
AdaPLD adaptively mixes lexical and semantic retrieval with branched reuse to improve model-free speculative decoding and reports up to 3.10x speedup across benchmarks.
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
citing papers explorer
-
All is Not Lost: LLM Recovery without Checkpoints
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to