hub Canonical reference

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, Stefano Ermon · 2023 · stat.ML · arXiv 2310.16834

Canonical reference. 80% of citing Pith papers cite this work as background.

80 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 80 citing papers arXiv PDF

abstract

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2

citation-polarity summary

background 8 use method 2

representative citing papers

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

cs.AI · 2026-06-28 · conditional · novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

Masked Language Flow Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

Prefilling-dLLM partitions prefixes into chunks, caches KV representations, and applies sparse top-K selection during decoding to cut dLLM inference complexity to quadratic in decode length only.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

AsyncLane decouples refinement from advancement in DLM decoding via lane forking at delimiters plus efficiency optimizations, yielding up to 3x throughput gains on math and code benchmarks without retraining.

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

cs.CR · 2026-06-01 · unverdicted · novelty 7.0

MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.

Variational Learning for Insertion-based Generation

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Introduces the Insertion Process model for variable-length non-monotonic sequence generation via a bijective permutation mapping and permutation-based variational inference.

Free energy Estimation on Any State Space

stat.ML · 2026-05-29 · unverdicted · novelty 7.0

Generalizes neural transport methods for free energy estimation to any state space with added algebraic and group-theoretic results on time reversal and h-transforms.

Machine Unlearning for Masked Diffusion Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

stat.ML · 2026-05-18 · unverdicted · novelty 7.0

FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.

Support Before Frequency in Discrete Diffusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

GD4: Graph-based Discrete Denoising Diffusion for MIMO Detection

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

GD4 is a graph-based discrete denoising diffusion method for MIMO detection that yields higher-quality suboptimal solutions than prior diffusion detectors and classical baselines under similar compute budgets in both under- and over-determined settings.

StyleShield: Exposing the Fragility of AIGC Detectors through Continuous Controllable Style Transfer

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

StyleShield uses flow matching in continuous token embeddings with a DiT backbone to achieve 94.6% evasion on trained detectors and over 99% on unseen ones in Chinese benchmarks, with 0.928 semantic similarity, plus a RateAudit method to arbitrarily control detection rates.

Simple Self-Conditioning Adaptation for Masked Diffusion Models

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

SCMDM is a post-training self-conditioning adaptation for masked diffusion models that reduces generative perplexity by nearly 50% on OWT and improves performance on images, molecules, and genomics.

citing papers explorer

Showing 24 of 24 citing papers after filters.

Masked Language Flow Models cs.CL · 2026-06-26 · unverdicted · none · ref 29 · internal anchor
MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 57 · internal anchor
Prefilling-dLLM partitions prefixes into chunks, caches KV representations, and applies sparse top-K selection during decoding to cut dLLM inference complexity to quadratic in decode length only.
Unified Energy for Invariant and Independent Decoding in Diffusion Language Models cs.CL · 2026-06-08 · unverdicted · none · ref 28 · internal anchor
The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.
AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding cs.CL · 2026-06-07 · unverdicted · none · ref 9 · internal anchor
AsyncLane decouples refinement from advancement in DLM decoding via lane forking at delimiters plus efficiency optimizations, yielding up to 3x throughput gains on math and code benchmarks without retraining.
Machine Unlearning for Masked Diffusion Language Models cs.CL · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 31 · 2 links · internal anchor
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling cs.CL · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 16 · internal anchor
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
MemDLM: Memory-Enhanced DLM Training cs.CL · 2026-03-23 · unverdicted · none · ref 3 · internal anchor
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs cs.CL · 2026-03-08 · unverdicted · none · ref 8 · internal anchor
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination cs.CL · 2026-06-16 · unverdicted · none · ref 30 · internal anchor
VoidPadding decouples padding from termination in MDLMs via a new [VOID] token, delivering +17.84 average benchmark points and 55.7% fewer decoding steps on Dream-7B-Instruct.
dMoE: dLLMs with Learnable Block Experts cs.CL · 2026-05-29 · unverdicted · none · ref 49 · internal anchor
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation cs.CL · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
Introduces TSPD with a trajectory-feature controller and training-free CE to reduce denoising steps in dLLMs while aiming to preserve quality.
When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models cs.CL · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
The paper proposes Suffix-Anchored Confidence Modulation, a training-free technique that mitigates misleading confidence signals from EOT tokens and anchor proximity to improve fully non-AR decoding in diffusion language models.
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space cs.CL · 2026-05-14 · unverdicted · none · ref 28 · 3 links · internal anchor
Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
DiffuMask: Diffusion Language Model for Token-level Prompt Pruning cs.CL · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
DiffuMask uses a diffusion language model for parallel token-level prompt pruning, achieving up to 80% length reduction with maintained or improved accuracy in reasoning tasks.
Differences in Text Generated by Diffusion and Autoregressive Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 21 · internal anchor
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising cs.CL · 2026-02-18 · conditional · none · ref 29 · 2 links · internal anchor
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens cs.CL · 2026-06-15 · unverdicted · none · ref 6 · internal anchor
ASRD is a training-free revocable decoding framework for diffusion LLMs that decouples context into trusted anchor tokens and uncertain candidates to improve accuracy by up to 6.4% and speed by up to 7.2x on math and coding benchmarks.
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks cs.CL · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
Chained rewrites by open-weight LLMs reduce watermark detection on diffusion LM outputs from 87.9% to 4.86% after five steps across multiple styles and models.
Attention-Based Sampler for Diffusion Language Models cs.CL · 2026-03-18 · unreviewed · ref 8 · internal anchor

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer