hub

L ayer S kip: Enabling Early Exit Inference and Self-Speculative Decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A · 2024 · DOI 10.18653/v1/2024.acl-long.681

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open at publisher browse 12 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

On 6000 Qwen3-8B AIME traces, late-clustered moderate-to-severe backtracks are more common in incorrect outputs, enabling prefix-causal burst-aware filtering that outperforms fixed-length cutoffs at shallow and intermediate depths.

Two-dimensional early exit optimisation of LLM inference

cs.CL · 2026-03-27 · unverdicted · novelty 7.0

Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.

All is Not Lost: LLM Recovery without Checkpoints

cs.DC · 2025-06-18 · conditional · novelty 7.0

CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to

Depth Exploration for LLM Decoding

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.

CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

CascadeFormer tapers Transformer width with depth based on gradient fan-in asymmetry to match uniform baselines in perplexity while cutting latency 8.6%.

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

DABS is a single-pass framework that builds a depth-ordered substrate from one Transformer encoding and performs lightweight aspect-conditioned readout, cutting computation by up to 60% on multi-aspect ATSA benchmarks while matching prior accuracy.

ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Sparse Layers are Critical to Scaling Looped Language Models

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

cs.CL · 2026-06-04 · unverdicted · novelty 4.0

AdaPLD adaptively mixes lexical and semantic retrieval with branched reuse to improve model-free speculative decoding and reports up to 3.10x speedup across benchmarks.

Token-Operations-Oriented Inference Optimization Techniques for Large Models

cs.SE · 2026-06-18 · unverdicted · novelty 3.0

The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.

citing papers explorer

Showing 1 of 1 citing paper after filters.

All is Not Lost: LLM Recovery without Checkpoints cs.DC · 2025-06-18 · conditional · none · ref 9
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to

L ayer S kip: Enabling Early Exit Inference and Self-Speculative Decoding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer