Dream 7B: Diffusion Large Language Models
Pith reviewed 2026-05-11 16:20 UTC · model grok-4.3
The pith
Dream 7B shows a 7B diffusion language model can outperform prior diffusion models on language, math, and coding tasks while supporting parallel iterative generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Unlike autoregressive models that generate tokens sequentially, Dream 7B consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling.
What carries the argument
Discrete diffusion modeling that iteratively denoises an entire token sequence in parallel, supported by initialization from an autoregressive LLM and per-token adaptive noise rescheduling during training.
If this is right
- Diffusion-based language models can reach competitive accuracy on math and coding problems without relying on left-to-right token prediction.
- A single trained model can produce valid output when tokens are generated in any chosen order or when sections of text are missing.
- Users can control the speed versus quality trade-off at inference time by selecting how many denoising steps to run.
- Releasing both a base model and an instruction-tuned version makes these flexible generation modes available for further experimentation.
Where Pith is reading between the lines
- The parallel refinement process may eventually allow diffusion models to handle long-range planning tasks with less accumulation of early errors than sequential models.
- If the adaptive noise technique generalizes, similar rescheduling could improve training stability for diffusion models in other domains such as images or audio.
- The ability to infill and reorder tokens suggests diffusion language models could serve as a natural fit for interactive editing interfaces where users revise parts of a draft.
- Further scaling of this approach might reveal whether diffusion models can close the remaining gap with autoregressive models on broad knowledge benchmarks.
Load-bearing premise
The combination of autoregressive model initialization and context-adaptive token-level noise rescheduling is sufficient to produce the reported performance gains and new generation capabilities at 7B scale.
What would settle it
Train a 7B-scale discrete diffusion language model using the same data and architecture but without autoregressive initialization or context-adaptive noise rescheduling, then compare its scores on the same general, math, and coding benchmarks to those reported for Dream 7B.
read the original abstract
We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dream 7B, a 7B-parameter discrete diffusion language model that generates via iterative parallel denoising rather than sequential autoregressive prediction. It claims consistent outperformance over prior diffusion LLMs on general, mathematical, and coding benchmarks, plus new inference capabilities (arbitrary-order generation, infilling, tunable quality-speed trade-offs) obtained through AR-based LLM initialization and context-adaptive token-level noise rescheduling. Dream-Base and Dream-Instruct variants are released.
Significance. If the performance and capability claims are substantiated with rigorous controls, the work would represent a meaningful step toward practical non-autoregressive LLMs at scale, demonstrating that diffusion models can achieve competitive results on reasoning-heavy tasks while offering inference flexibility unavailable to standard AR models. The release of the models would further enable community exploration of diffusion-based language modeling.
major comments (2)
- [Experiments] Experiments section: the manuscript presents overall benchmark results for Dream 7B but provides no controlled ablations that isolate the contribution of AR-based LLM initialization or context-adaptive token-level noise rescheduling at the 7B scale (e.g., training otherwise identical 7B diffusion models with these components disabled). Without such ablations, the central attribution of the reported gains and new capabilities to these specific techniques remains unsupported.
- [Results] Results tables (general/math/coding benchmarks): while aggregate outperformance is asserted, the paper does not report per-task breakdowns, statistical significance tests, or comparisons against strong AR baselines of comparable size and training compute, making it difficult to assess whether the diffusion approach truly closes the gap or merely matches prior diffusion models.
minor comments (2)
- [Abstract] Abstract: quantitative results, benchmark names, and exact metrics are omitted, forcing readers to consult the full text for any concrete evidence of the claimed outperformance.
- [Method] Method description: the precise formulation of the context-adaptive token-level noise rescheduling (e.g., the functional form of the schedule and how context length modulates it) should be given explicitly, ideally with pseudocode or an equation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions or experimental scope.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript presents overall benchmark results for Dream 7B but provides no controlled ablations that isolate the contribution of AR-based LLM initialization or context-adaptive token-level noise rescheduling at the 7B scale (e.g., training otherwise identical 7B diffusion models with these components disabled). Without such ablations, the central attribution of the reported gains and new capabilities to these specific techniques remains unsupported.
Authors: We agree that controlled ablations at the full 7B scale would offer the strongest isolation of each technique's contribution. Training multiple independent 7B diffusion models from scratch exceeds our available compute budget. However, we conducted systematic ablations at the 1B scale (reported in the appendix) that isolate the effects of AR initialization and context-adaptive noise scheduling, showing consistent gains that align with the 7B results. In the revised manuscript we will (i) move these 1B ablations into the main text, (ii) add a dedicated limitations paragraph discussing the computational constraints on 7B-scale ablations, and (iii) reference prior smaller-scale studies that motivated the design choices. We believe the combination of smaller-scale evidence, scaling behavior, and public model release still supports the attribution while transparently noting the limitation. revision: partial
-
Referee: [Results] Results tables (general/math/coding benchmarks): while aggregate outperformance is asserted, the paper does not report per-task breakdowns, statistical significance tests, or comparisons against strong AR baselines of comparable size and training compute, making it difficult to assess whether the diffusion approach truly closes the gap or merely matches prior diffusion models.
Authors: We accept that the current presentation can be improved. In the revision we will add (a) per-task score tables in an expanded appendix, (b) bootstrap-based statistical significance tests with 95% confidence intervals for all reported averages, and (c) a new subsection comparing Dream 7B against publicly documented 7B-scale AR models (e.g., Llama-2-7B, Mistral-7B) on the identical benchmark suites, while explicitly stating differences in training data and objective. Our primary claim remains outperformance over prior diffusion LLMs; we do not assert superiority over state-of-the-art AR models. These additions will allow readers to evaluate the gap-closing question directly. revision: yes
Circularity Check
No circularity; empirical model introduction with no derivations or self-referential reductions
full rationale
The manuscript presents Dream 7B as an empirical contribution, with performance claims resting on benchmark evaluations after applying AR-based LLM initialization and context-adaptive token-level noise rescheduling. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central attribution of gains to the listed training techniques is not shown to reduce by construction to prior fitted quantities or self-citations; it remains an empirical assertion open to external verification via ablations or reproduction. No steps match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Backdooring Masked Diffusion Language Models
SHADOWMASK backdoors MDLMs by modifying the forward corruption process with a trigger-mask mixture, achieving near-100% attack success while preserving clean utility on DiT-based and LLaDA models.
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
-
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
NPU Design for Diffusion Language Model Inference
Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
-
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
-
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...
-
Learnability-Informed Fine-Tuning of Diffusion Language Models
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
-
Drifting Objectives for Refining Discrete Diffusion Language Models
TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
-
Machine Unlearning for Masked Diffusion Language Models
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
-
Constrained Code Generation with Discrete Diffusion
Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to stee...
-
Dynamic Chunking for Diffusion Language Models
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
-
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
-
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
DiHAL uses geometry proxies to pick where to replace the lower layers of a pretrained transformer with a diffusion bridge for hidden-state reconstruction, improving over token-level diffusion baselines on 8B models.
-
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...
-
Support Before Frequency in Discrete Diffusion
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
-
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
-
Multi-Token Residual Prediction
MRP predicts logit residuals from hidden states to support dependency-aware multi-token denoising in a single forward pass for diffusion language models, yielding up to 1.42× lossless speedup on SDAR models.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...
-
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
DiffScore is a bidirectional masked-diffusion evaluation framework that measures text recoverability across masking rates and outperforms autoregressive baselines on ten benchmarks.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
-
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
-
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
-
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models
Diffusion LLMs hallucinate more than autoregressive models and display distinct failure modes including premature termination, incomplete denoising, and context intrusion.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
MARS: Enabling Autoregressive Models Multi-Token Generation
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
-
MemDLM: Memory-Enhanced DLM Training
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
-
Attention-Based Sampler for Diffusion Language Models
Attn-Sampler decodes diffusion language models by selecting tokens in descending order of attention column sums, yielding higher quality and more parallel generation than token-level greedy baselines.
-
Improving Sampling for Masked Diffusion Models via Information Gain
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
-
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration
TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
-
d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, ...
-
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
PartDiffuser is a semi-autoregressive discrete diffusion framework that generates high-fidelity 3D meshes from point clouds by combining inter-part autoregression with intra-part parallel diffusion using a part-aware ...
-
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit
CreditDecoding accelerates parallel decoding in diffusion LLMs by fusing accumulated Trace Credit with current logits to accept early-correct tokens sooner, yielding up to 5.48x speedup and accuracy gains.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
DiLaDiff augments masked diffusion LMs with latent space modeling and consistency distillation to improve token correlation capture and inference speed.
-
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
-
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
-
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
Reference graph
Works this paper leans on
-
[1]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
URL https://arxiv.org/abs/2502.02737. Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models,
work page internal anchor Pith review arXiv
-
[2]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
URLhttps://arxiv.org/abs/2503.09573. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021a. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen J...
work page internal anchor Pith review arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Continuous diffusion for categorical data
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089,
work page internal anchor Pith review arXiv
-
[7]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,
work page internal anchor Pith review arXiv
-
[8]
DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9868–9875. Association for Computational Linguistics, 20...
work page 2023
-
[9]
URLhttps://arxiv.org/abs/2407.21783. Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. InThirty- seventh Conference on Neural Information Processing Systems, 2023a. Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 20...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Measuring Massive Multitask Language Understanding
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.248. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.248 2023
-
[11]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
work page 2020
-
[12]
Opencoder: The open cookbook for top-tier code large language models
URLhttps://arxiv.org/abs/2411.04905. Inception Labs. Mercury: Ultra-fast language models based on diffusion. https://www.inceptionlabs.ai/introducing-mercury,
-
[13]
Accessed: 2025-06-16. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Interpolated estimation of markov source parameters from sparse data
14 Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. InProc. Workshop on Pattern Recognition in Practice, 1980,
work page 1980
-
[15]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683,
-
[16]
URLhttps://arxiv.org/abs/2411.15124. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Che...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393,
-
[18]
arXiv preprint arXiv:2410.18514 , year=
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text, 2025a. URLhttps://arxiv.org/abs/2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv...
-
[19]
URLhttps://arxiv.org/abs/2303.08774. Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
URL https://arxiv.org/abs/2412.15115. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Diffusion language models can perform many tasks with scaling and instruction-finetuning
Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning.International Conference on Learning Representations, 2025a. Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Implicit search via discrete diffusio...
-
[24]
Improving and unifying discrete&continuous-time discrete denoising diffusion
Lingxiao Zhao, Xueying Ding, Lijun Yu, and Leman Akoglu. Improving and unifying discrete&continuous-time discrete denoising diffusion.arXiv preprint arXiv:2402.03701,
-
[25]
Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, and Denny Zhou. Natural plan: Benchmarking llms on natural language planning, 2024a. URLhttps://arxiv.org/abs/2406.04520. Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for te...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.