Set block decoding is a language model inference accelerator

Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman · 2025 · arXiv 2509.04185

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

FBS: Modeling Native Parallel Reading inside a Transformer

cs.AI · 2026-01-29 · unverdicted · novelty 6.0

FBS introduces a causal trainable loop via PAW, CH, and SG modules to model native parallel reading in Transformers, yielding better quality-efficiency on benchmarks with complementary ablations.

citing papers explorer

Showing 3 of 3 citing papers.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 26 · 2 links
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion cs.LG · 2026-05-12 · unverdicted · none · ref 10 · 2 links
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
FBS: Modeling Native Parallel Reading inside a Transformer cs.AI · 2026-01-29 · unverdicted · none · ref 1
FBS introduces a causal trainable loop via PAW, CH, and SG modules to model native parallel reading in Transformers, yielding better quality-efficiency on benchmarks with complementary ablations.

Set block decoding is a language model inference accelerator

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer