DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Jiangtao Feng; Lingpeng Kong; Mukai Li; Shansan Gong; Zhiyong Wu

arxiv: 2210.08933 · v3 · pith:VA65K7WXnew · submitted 2022-10-17 · 💻 cs.CL · cs.LG

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong , Mukai Li , Jiangtao Feng , Zhiyong Wu , Lingpeng Kong This is my paper

Pith reviewed 2026-05-20 06:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords diffusion modelssequence-to-sequence generationtext generationconditional generationgenerative modelsnatural language processingoutput diversity

0 comments

The pith

DiffuSeq adapts diffusion processes to discrete text tokens for conditional sequence generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion models, successful in continuous domains, can be extended to sequence-to-sequence text generation by treating token sequences through a noise-adding and denoising process. It shows through broad evaluations that this yields results on par with or better than autoregressive and non-autoregressive baselines, including pre-trained models, while producing notably more diverse outputs. A sympathetic reader cares because dominant text generation methods often trade off diversity for quality, and a working diffusion alternative could open new paths for controllable and varied language output. The work supports its claims with both empirical benchmarks across multiple tasks and a theoretical analysis that links DiffuSeq to existing model families.

Core claim

DiffuSeq is a diffusion model for sequence-to-sequence text generation that adapts continuous diffusion to discrete tokens. Extensive tests on a wide range of Seq2Seq tasks show it matches or exceeds six baselines, including state-of-the-art pre-trained language models, while generating outputs with high diversity. Theoretical analysis further reveals connections between this approach and both autoregressive and non-autoregressive models.

What carries the argument

DiffuSeq, a diffusion model that performs conditional generation by reversing a gradual noising process applied to text token sequences.

If this is right

Diffusion models become a practical option for conditional text tasks that value output variety alongside accuracy.
Sequence generation can proceed without the sequential decoding constraints typical of autoregressive models.
Theoretical links to autoregressive and non-autoregressive families enable hybrid modeling strategies.
High diversity in generations offers a direct advantage for applications such as dialogue or paraphrasing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discretization technique might transfer to other structured discrete domains like source code or molecular sequences.
Greater output diversity could reduce the need for post-hoc sampling techniques in creative text applications.
Combining the diffusion backbone with large-scale pre-training might further close any remaining quality gaps.

Load-bearing premise

The continuous diffusion process can be adapted to discrete text tokens in a way that avoids artifacts and preserves strong performance on real conditional generation benchmarks.

What would settle it

If DiffuSeq produces markedly lower quality or less diverse outputs than the autoregressive baselines on standard benchmarks such as machine translation or summarization datasets, the central performance claim would be refuted.

read the original abstract

Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. Code is available at \url{https://github.com/Shark-NLP/DiffuSeq}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffuSeq adapts diffusion to conditional Seq2Seq text generation and reports competitive results plus higher diversity, but the continuous embedding to discrete token mapping needs closer checks.

read the letter

DiffuSeq shows that you can take diffusion models and get them to handle conditional sequence-to-sequence text generation at a level that competes with standard baselines, while also giving higher output diversity. The new part is applying this to the conditional Seq2Seq setting, which the abstract says hasn't been explored much in the cited work. They run experiments across a wide range of tasks and compare against six baselines, one of which is a strong pre-trained model. The results claim comparable or better performance, and they point out the diversity as an advantage for many Seq2Seq applications. There's also a theoretical section that connects this diffusion approach to autoregressive and non-autoregressive models, which gives some framing for how it fits into existing ideas. What they do well is the broad evaluation and the release of code, which lets others check the details. The theoretical linkage is a plus for understanding. The softer area is the way they handle the discrete nature of text. They appear to perform the diffusion in continuous embedding space and then round or take argmax to get back to tokens during the process. This could introduce artifacts, especially in how well the model conditions on the input for generation quality. The stress-test concern about whether this discretization degrades the reverse process more than acknowledged seems plausible, and the moderate soundness score reflects that we need the full methods to verify the evaluation protocol and data handling. This paper would interest researchers looking at new ways to do text generation outside the usual autoregressive transformer setup, particularly those focused on diversity or non-autoregressive sampling. A reader in that area could get useful ideas from the empirical results and the theoretical connections. It should go to peer review because the application is fresh, the claims are specific, and the work is grounded enough to benefit from referee feedback on the implementation choices.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiffuSeq, a diffusion model for conditional sequence-to-sequence text generation that performs the diffusion process in continuous embedding space before mapping back to discrete tokens via rounding or argmax. It reports empirical results across multiple Seq2Seq tasks showing performance comparable to or better than six baselines (including a pre-trained LM-based SOTA), highlights high generation diversity as an advantage, and provides a theoretical analysis connecting the approach to autoregressive and non-autoregressive models. Code is released publicly.

Significance. If the central empirical claims hold after addressing discretization effects, the work would establish diffusion models as a viable alternative paradigm for conditional text generation, with strengths in diversity and a reproducible implementation that enables further exploration. The theoretical connection to AR/NAR models is a useful framing, though its independence from the empirical results would benefit from clearer separation.

major comments (2)

[§3.2 and Algorithm 1] §3.2 and Algorithm 1: the reverse diffusion process maps continuous embeddings back to discrete tokens via argmax/rounding at each step; without reported analysis of embedding-to-token distances or ablations showing that this mapping does not systematically degrade conditional generation quality, it is unclear whether the performance gains over AR/NAR baselines are attributable to diffusion or to artifacts of the discretization procedure.
[§4] §4 (experimental results): the claim of comparable or superior performance to a pre-trained LM baseline requires explicit confirmation that data splits, hyperparameter tuning, and evaluation protocols match those of the baselines exactly; any post-hoc adjustments could affect the central performance comparison.

minor comments (2)

[§3] Notation for the forward and reverse processes could be aligned more consistently with standard diffusion literature to aid readability.
[Figures in §4] Figure captions and axis labels in the diversity and quality plots should explicitly state the metrics and number of samples used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3.2 and Algorithm 1] §3.2 and Algorithm 1: the reverse diffusion process maps continuous embeddings back to discrete tokens via argmax/rounding at each step; without reported analysis of embedding-to-token distances or ablations showing that this mapping does not systematically degrade conditional generation quality, it is unclear whether the performance gains over AR/NAR baselines are attributable to diffusion or to artifacts of the discretization procedure.

Authors: We appreciate the referee's emphasis on the discretization step. Section 3.2 and Algorithm 1 describe the forward diffusion in continuous embedding space followed by rounding or argmax in the reverse process. The manuscript reports strong empirical results across tasks, but we agree that explicit analysis of embedding-to-token distances and ablations isolating the discretization effect would better attribute performance to the diffusion process. We will add these analyses and ablations to the revised manuscript. revision: yes
Referee: [§4] §4 (experimental results): the claim of comparable or superior performance to a pre-trained LM baseline requires explicit confirmation that data splits, hyperparameter tuning, and evaluation protocols match those of the baselines exactly; any post-hoc adjustments could affect the central performance comparison.

Authors: We confirm that the experiments used identical data splits, hyperparameter tuning ranges, and evaluation metrics as the original baseline papers, including the pre-trained LM system. To eliminate any ambiguity, we will expand Section 4 with an explicit subsection documenting these matching protocols and citing the exact settings from each baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results and theoretical connections remain independent of fitted inputs.

full rationale

The paper grounds its claims in external Seq2Seq benchmark evaluations against six baselines (including pre-trained models) and presents a separate theoretical analysis connecting DiffuSeq to AR/NAR models. No derivation, equation, or self-citation reduces the reported performance or diversity properties to quantities defined by the same fitted parameters or by construction. The continuous-to-discrete embedding adaptation is a methodological choice whose effects are assessed empirically rather than tautologically assumed. This is the common case of a self-contained paper whose central results do not collapse into input definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit list of fitted parameters or new axioms; the central claim rests on the unstated assumption that a continuous diffusion process can be mapped to discrete token sequences without loss of conditional modeling power.

pith-pipeline@v0.9.0 · 5728 in / 1162 out tokens · 24215 ms · 2026-05-20T06:45:59.004996+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Modeling with Flux Matching
cs.LG 2026-05 unverdicted novelty 8.0

Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices be...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Masked Diffusion Decoding as $x$-Prediction Flow
cs.CL 2026-06 unverdicted novelty 7.0

Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
cs.CL 2026-06 unverdicted novelty 7.0

Prefilling-dLLM partitions prefixes into chunks, caches KV representations, and applies sparse top-K selection during decoding to cut dLLM inference complexity to quadratic in decode length only.
Continuous Language Diffusion as a Decoder-Interface Problem
cs.CL 2026-06 unverdicted novelty 7.0

Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated a...
Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster
stat.ML 2026-05 unverdicted novelty 7.0

FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
cs.AI 2026-05 unverdicted novelty 7.0

AnchorDiff is a topology-aware masked diffusion framework with RadGraph anchors and confidence-based rewriting that claims state-of-the-art results on MIMIC-CXR and MIMIC-RG4 for radiology report generation.
Fusing Urban Structure and Semantics: A Conditional Diffusion Model for Cross-City OD Matrix Generation
cs.LG 2026-05 unverdicted novelty 7.0

SEDAN fuses graph-based urban semantics and spatial structure inside a conditional diffusion model to generate behaviorally plausible and geographically coherent OD matrices, reporting a 7.38% RMSE gain over the WEDAN...
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
cs.LG 2026-02 unverdicted novelty 7.0

Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
cs.LG 2025-06 conditional novelty 7.0

BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
A PDE-Based Framework for Generative Modeling Beyond Classical Score-Based Diffusion
math.NA 2026-07 unverdicted novelty 6.0

Nonlinear modification of Ornstein-Uhlenbeck dynamics produces condensation in a Fokker-Planck equation and yields a stabilized reverse PDE that reconstructs the initial distribution for generative modeling.
SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing
cs.CL 2026-06 unverdicted novelty 6.0

SLIM-RL matches or exceeds TraceRL performance on MATH500, GSM8K, MBPP and HumanEval for diffusion LLMs by risk-budgeted random-masking RL without trajectory slicing.
Posterior Refinement: Fast Language Generation via Any-Order Flow Maps
cs.CL 2026-06 unverdicted novelty 6.0

FMLM+ with Posterior Refinement bridges masked diffusion and flow map models to match discrete baseline quality in language generation using 32x fewer neural function evaluations via posterior scoring and refinement.
Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models
cs.CL 2026-06 unverdicted novelty 6.0

AGDO improves dLLM reasoning performance by determining denoising order and emphasizing tokens based on attention-derived dependencies rather than random masking.
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
cs.LG 2026-06 unverdicted novelty 6.0

FAIR-Calib is a frontier-aware instability-reweighted calibration framework for PTQ of dLLMs that minimizes reweighted hidden-state MSE to reduce frontier decision flips.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Mixing Times of Glauber Dynamics on Masked Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Analysis of Glauber dynamics on masked language models shows O(n log n) mixing under bounded cross-token influence and metastability with exponential escape times at low temperatures, plus empirical phase transitions.
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 conditional novelty 6.0

Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers
cs.CE 2025-07 unverdicted novelty 6.0

DiffuMeta uses diffusion transformers and algebraic language representations to generate diverse 3D shell metamaterials with targeted stress-strain responses under large deformations including buckling and contact.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance
cs.LG 2026-06 unverdicted novelty 5.0

GCD uses diffusion model priors to guide suffix search, achieving higher attack success rates with better semantic adherence and lower detection than GCG-style methods.
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and vid...
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
cs.SE 2026-05 unverdicted novelty 4.0

Static checking rewards and moderate AST-based hints improve diffusion RL performance for code generation, with effectiveness varying by task difficulty across HumanEval, MBPP, and LiveCodeBench.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 31 Pith papers · 5 internal anchors

[1]

Diffusiondet: Diffusion model for object detection

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202,

work page arXiv
[2]

Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. Quasar: Datasets for question answer- ing by search and reading. arXiv preprint arXiv:1707.03904,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mask-predict: Parallel decoding of conditional masked language models

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6112–6121. Association for Computa-...

work page 2019
[4]

Classifier-Free Diffusion Guidance

10 Published as a conference paper at ICLR 2023 Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Statistical signiﬁcance tests for machine translation evaluation

Philipp Koehn. Statistical signiﬁcance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 388–395,

work page 2004
[6]

Hashimoto

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217,

work page arXiv
[7]

Hierarchical Text-Conditional Image Generation with CLIP Latents

11 Published as a conference paper at ICLR 2023 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022a. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mah...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Well-read students learn better: The impact of student initialization on knowledge distillation.CoRR, abs/1908.08962, 2019

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962,

work page arXiv 1908
[10]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural se- quence models. arXiv preprint arXiv:1610.02424,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

12 Published as a conference paper at ICLR 2023 Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,

work page 2023
[12]

Following Ho et al

13 Published as a conference paper at ICLR 2023 A O BJECTIVE DERIVATIONS OF DIFFU SEQ The diffusion model is well-known as its ability to achieve the trade-off between ﬂexibility and tractability of the models’ probability distributions, compared with GAN, V AE and Flow-based models. Following Ho et al. (2020); Nichol & Dhariwal (2021); Song et al. (2020)...

work page 2023
[13]

(19) To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al.,

learn the conditional probability given independent assumption for fast inference: pfully-NAR(wy 1:n|wx) = ∏ i=1,...,n p(wy i|wx). (19) To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al.,

work page 2019
[14]

This gap is mainly responsible for the performance drop from AR to NAR models

shows that there is a gap called conditional total correlation between AR and fully-NAR learning paradigms, because of the lossy decomposition of NAR mod- els. This gap is mainly responsible for the performance drop from AR to NAR models. However, when comparing iter-NAR, Eq. (20), with AR models, they both can be factorized into an initial prediction ter...

work page 2022
[15]

16 Published as a conference paper at ICLR 2023

C F ROM DIFFU SEQ TO ITERATIVE NAR AND DIFFUSION MODELS From D IFFU SEQ to Iterative NAR We show how to derive D IFFU SEQ to iterative non- autoregressive model on discrete space. 16 Published as a conference paper at ICLR 2023 ... ...... ... ... AR Fully-NAR Iter-NAR DiffuSeq ...... ...... ...... ...... ...... ...... ... ... ... ...... ... ... ... ... .....

work page 2023
[16]

For other baselines, there is no explicit factor to control the diversity generation, so we leave them as single points in the ﬁgure

For DIFFU SEQ, we choose trained models at different training steps to achieve different trade-off points. For other baselines, there is no explicit factor to control the diversity generation, so we leave them as single points in the ﬁgure. 4Including top-p sampling, temperature, diversity beam search (DBS) and etc. Implement using Hugging- Face Transform...

work page 2023

[1] [1]

Diffusiondet: Diffusion model for object detection

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202,

work page arXiv

[2] [2]

Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. Quasar: Datasets for question answer- ing by search and reading. arXiv preprint arXiv:1707.03904,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mask-predict: Parallel decoding of conditional masked language models

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6112–6121. Association for Computa-...

work page 2019

[4] [4]

Classifier-Free Diffusion Guidance

10 Published as a conference paper at ICLR 2023 Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Statistical signiﬁcance tests for machine translation evaluation

Philipp Koehn. Statistical signiﬁcance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 388–395,

work page 2004

[6] [6]

Hashimoto

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217,

work page arXiv

[7] [7]

Hierarchical Text-Conditional Image Generation with CLIP Latents

11 Published as a conference paper at ICLR 2023 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022a. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mah...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Well-read students learn better: The impact of student initialization on knowledge distillation.CoRR, abs/1908.08962, 2019

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962,

work page arXiv 1908

[10] [10]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural se- quence models. arXiv preprint arXiv:1610.02424,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

12 Published as a conference paper at ICLR 2023 Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,

work page 2023

[12] [12]

Following Ho et al

13 Published as a conference paper at ICLR 2023 A O BJECTIVE DERIVATIONS OF DIFFU SEQ The diffusion model is well-known as its ability to achieve the trade-off between ﬂexibility and tractability of the models’ probability distributions, compared with GAN, V AE and Flow-based models. Following Ho et al. (2020); Nichol & Dhariwal (2021); Song et al. (2020)...

work page 2023

[13] [13]

(19) To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al.,

learn the conditional probability given independent assumption for fast inference: pfully-NAR(wy 1:n|wx) = ∏ i=1,...,n p(wy i|wx). (19) To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al.,

work page 2019

[14] [14]

This gap is mainly responsible for the performance drop from AR to NAR models

shows that there is a gap called conditional total correlation between AR and fully-NAR learning paradigms, because of the lossy decomposition of NAR mod- els. This gap is mainly responsible for the performance drop from AR to NAR models. However, when comparing iter-NAR, Eq. (20), with AR models, they both can be factorized into an initial prediction ter...

work page 2022

[15] [15]

16 Published as a conference paper at ICLR 2023

C F ROM DIFFU SEQ TO ITERATIVE NAR AND DIFFUSION MODELS From D IFFU SEQ to Iterative NAR We show how to derive D IFFU SEQ to iterative non- autoregressive model on discrete space. 16 Published as a conference paper at ICLR 2023 ... ...... ... ... AR Fully-NAR Iter-NAR DiffuSeq ...... ...... ...... ...... ...... ...... ... ... ... ...... ... ... ... ... .....

work page 2023

[16] [16]

For other baselines, there is no explicit factor to control the diversity generation, so we leave them as single points in the ﬁgure

For DIFFU SEQ, we choose trained models at different training steps to achieve different trade-off points. For other baselines, there is no explicit factor to control the diversity generation, so we leave them as single points in the ﬁgure. 4Including top-p sampling, temperature, diversity beam search (DBS) and etc. Implement using Hugging- Face Transform...

work page 2023