hub Mixed citations

Better & Faster Large Language Models via Multi-token Prediction

· 2024 · cs.CL · arXiv 2404.19737

Mixed citation behavior. Most common role is background (43%).

25 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 4 background 3

citation-polarity summary

background 3 use method 3 unclear 1

representative citing papers

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

cs.CL · 2026-05-10 · conditional · novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

A Markov Categorical Framework for Language Modeling

cs.LG · 2025-07-25 · unverdicted · novelty 7.0

A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.

FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

FastTab combines a Tiny Recursive Module and axial 1D Transformer encoders to predict table grids, headers, and cell spans directly, achieving competitive accuracy on four benchmarks with low-latency inference.

Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach

cs.IT · 2026-05-15 · unverdicted · novelty 6.0

Characterizes the exact capacity of multi-bit covert LLM watermarking via Gelfand-Pinsker and channel synthesis, then gives a polar-code algorithm achieving 0.375 bits/token at under 10% BER with negligible perplexity impact.

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

cs.CR · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

cs.CL · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

FusionCIM is a fusion-driven CIM accelerator for LLM inference that maps QKT to IP-CIM and PV to OP-CIM, uses QO-stationary dataflow, and applies pattern-aware online softmax, delivering up to 3.86x energy savings and 1.98x speedup on LLaMA-3 at 29.4 TOPS/W.

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

cs.AI · 2026-03-05 · unverdicted · novelty 6.0

Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.

Proxy Compression for Language Modeling

cs.CL · 2026-02-04 · conditional · novelty 6.0

Proxy compression trains language models on both raw bytes and compressed sequences to enable efficient training with raw-byte inference at test time.

Mirai: Autoregressive Visual Generation Needs Foresight

cs.CV · 2026-01-21 · conditional · novelty 6.0

Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

cs.LG · 2025-08-22 · unverdicted · novelty 6.0

In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

cs.CV · 2026-04-29 · unverdicted · novelty 5.0 · 3 refs

GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

cs.CL · 2025-07-02 · unverdicted · novelty 5.0

LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

cs.LG · 2026-05-14 · unverdicted · novelty 4.0

DT-Transformer predicts next disease events with median age- and sex-stratified AUC 0.871 across 896 categories on held-out and prospective data from a 1.7M-patient multi-hospital EHR dataset.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

cs.CL · 2025-08-08 · unverdicted · novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

citing papers explorer

Showing 25 of 25 citing papers.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 31 · internal anchor
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts cs.LG · 2026-05-01 · conditional · none · ref 35 · internal anchor
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 17 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
Training Agents Inside of Scalable World Models cs.AI · 2025-09-29 · conditional · none · ref 27 · internal anchor
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
A Markov Categorical Framework for Language Modeling cs.LG · 2025-07-25 · unverdicted · none · ref 13 · internal anchor
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers cs.CV · 2026-05-21 · unverdicted · none · ref 30 · internal anchor
FastTab combines a Tiny Recursive Module and axial 1D Transformer encoders to predict table grids, headers, and cell spans directly, achieving competitive accuracy on four benchmarks with low-latency inference.
Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach cs.IT · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
Characterizes the exact capacity of multi-bit covert LLM watermarking via Gelfand-Pinsker and channel synthesis, then gives a polar-code algorithm achieving 0.375 bits/token at under 10% BER with negligible perplexity impact.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 114 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs cs.LG · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection cs.CR · 2026-05-12 · unverdicted · none · ref 6 · 2 links · internal anchor
TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? cs.CL · 2026-04-29 · unverdicted · none · ref 5 · 2 links · internal anchor
KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture cs.AR · 2026-04-28 · unverdicted · none · ref 17 · internal anchor
FusionCIM is a fusion-driven CIM accelerator for LLM inference that maps QKT to IP-CIM and PV to OP-CIM, uses QO-stationary dataflow, and applies pattern-aware online softmax, delivering up to 3.86x energy savings and 1.98x speedup on LLaMA-3 at 29.4 TOPS/W.
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling cs.AI · 2026-03-05 · unverdicted · none · ref 19 · internal anchor
Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.
Proxy Compression for Language Modeling cs.CL · 2026-02-04 · conditional · none · ref 6 · internal anchor
Proxy compression trains language models on both raw bytes and compressed sequences to enable efficient training with raw-byte inference at test time.
Mirai: Autoregressive Visual Generation Needs Foresight cs.CV · 2026-01-21 · conditional · none · ref 12 · internal anchor
Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling cs.LG · 2025-08-22 · unverdicted · none · ref 21 · internal anchor
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents cs.CV · 2026-04-29 · unverdicted · none · ref 11 · 3 links · internal anchor
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 13 · internal anchor
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 19 · internal anchor
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation cs.CL · 2025-07-02 · unverdicted · none · ref 24 · internal anchor
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System cs.LG · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
DT-Transformer predicts next disease events with median age- and sex-stratified AUC 0.871 across 896 categories on held-out and prospective data from a 1.7M-patient multi-hospital EHR dataset.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models cs.CL · 2025-08-08 · unverdicted · none · ref 12 · internal anchor
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models cs.IR · 2026-01-30 · unreviewed · ref 19 · internal anchor

Better & Faster Large Language Models via Multi-token Prediction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer