Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
Pith reviewed 2026-05-22 06:40 UTC · model grok-4.3
The pith
Uniform diffusion models are optimized by a leave-one-out posterior rather than the direct denoising posterior, and an absorbing-state version matches masked diffusion performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In uniform diffusion models the standard plug-in bridge is optimized by a leave-one-out posterior that predicts each clean token without using its own noisy version. Exact conversions exist between the denoiser output, this leave-one-out posterior, and the score function. These conversions allow an informed predictor-corrector sampler and improved temperature sampling at inference time with no retraining. An absorbing-state reformulation preserves the original UDM joint law while reducing sampling to masked-diffusion-like operations that have simpler posteriors, natural carry-over unmasking, and a built-in remasking mechanism.
What carries the argument
The leave-one-out posterior, which predicts each clean token from all other noisy observations while ignoring its own.
If this is right
- Leave-one-out parameterizations improve generation quality for uniform diffusion on language modeling tasks.
- The absorbing-state construction achieves performance that matches or exceeds masked diffusion models.
- An informed predictor-corrector sampler and temperature sampling based on the leave-one-out predictor improve inference without any retraining.
- The conversions between denoiser, leave-one-out posterior, and score disentangle parameterization choices from the training objective.
Where Pith is reading between the lines
- The same mismatch between plug-in parameterization and true denoising posterior may appear in other discrete diffusion settings beyond language.
- The absorbing reformulation could reduce implementation complexity when porting uniform diffusion code to new data types.
- Improved sampling from the leave-one-out predictor might generalize to continuous diffusion models that use analogous bridge constructions.
Load-bearing premise
The absorbing-state reformulation keeps exactly the same joint probability law over sequences as the original uniform diffusion process.
What would settle it
Train a standard UDM and a leave-one-out version on the same language dataset, then measure whether the leave-one-out version produces lower perplexity or higher-quality generated text under the same sampling budget.
Figures
read the original abstract
Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in Uniform Diffusion Models (UDM) the standard plug-in bridge parameterization is optimized by a leave-one-out posterior rather than the denoising posterior, identifying a mismatch with the cross-entropy objective. It derives exact conversions between the denoiser, leave-one-out posterior, and score. It further introduces an absorbing-state reformulation that preserves the original UDM joint law while enabling masked-diffusion-like sampling operations with simpler posteriors, carry-over unmasking, and natural remasking. Empirical results on language modeling show consistent improvements from leave-one-out parameterizations and that the absorbing construction matches or surpasses masked diffusion.
Significance. If the derivations hold and the reformulation exactly preserves the joint law, the work clarifies why masked and uniform diffusion differ in practice, attributing gaps more to parameterization and sampling than to marginal choices. The leave-one-out predictor, informed predictor-corrector sampler, and temperature sampling offer training-free inference gains. Code and model release aids reproducibility.
major comments (2)
- Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.
- Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.
minor comments (2)
- Notation for the leave-one-out posterior should be introduced with an explicit equation early in the main text rather than only in the abstract.
- The experimental section would benefit from reporting variance across multiple runs or statistical tests for the reported generation improvements.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review of our manuscript. We address each major comment in detail below. Where the comments identify opportunities for greater clarity or additional supporting material, we have revised the paper accordingly.
read point-by-point responses
-
Referee: Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.
Authors: We thank the referee for this important observation. Section 4 of the manuscript constructs the absorbing-state reformulation by re-expressing the uniform forward process as a mixture of an absorbing state and a masked diffusion process, with the reverse process using carry-over unmasking and a natural remasking step. The construction is designed so that the joint law over clean and noisy sequences remains identical to the original UDM at every finite time. To make the equivalence fully rigorous in the continuous-time setting, we will add a new lemma (Lemma 4.1) that explicitly equates the infinitesimal generators of the two processes and proves that the finite-time marginal distributions coincide for all t, including when the remasking probability is strictly positive. The proof proceeds by showing that the transition rates match exactly and that the resulting Kolmogorov forward equations yield identical solutions. We agree this lemma strengthens the presentation and removes any ambiguity about O(dt) discrepancies. revision: yes
-
Referee: Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.
Authors: We appreciate the request for an explicit non-tautological check. The model is trained by minimizing cross-entropy against the denoising posterior. The leave-one-out posterior, however, is obtained by analytically removing the contribution of the token’s own noisy observation from the denoiser output (see Equations 3–5). This adjustment is not part of the training loss and therefore yields a distinct predictor. In the revised manuscript we will insert a short remark and a small numerical illustration in Section 3.3: we apply the conversion formulas to a trained denoiser on a validation batch and show that the resulting leave-one-out probabilities differ measurably from the raw denoising probabilities; we further demonstrate that substituting the leave-one-out predictor into the informed predictor-corrector sampler produces the reported generation improvements. These results confirm that the conversions extract usable information beyond what is directly optimized by the training objective. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives explicit conversions between the denoiser, leave-one-out posterior, and score, then introduces an absorbing-state reformulation asserted to preserve the original UDM joint law while enabling masked-diffusion-style operations. These steps are presented as independent technical results with stated characterizations and empirical outcomes on language modeling, rather than tautological reductions to fitted inputs, self-definitions, or load-bearing self-citations. No equations or claims in the abstract reduce a prediction or central result to its own construction by definition; the derivations appear self-contained against external benchmarks and falsifiable via the reported generation improvements.
Axiom & Free-Parameter Ledger
invented entities (1)
-
absorbing-state reformulation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior... We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the absorbing-state reformulation... preserves the UDM joint law
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
work page 2021
-
[2]
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022
work page 2022
-
[3]
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024. URLhttps://arxiv.org/abs/2402.04997
-
[4]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014. URLhttps://arxiv.org/abs/1312.3005
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[5]
Fast sampling via discrete non-markov diffusion models with predetermined transition time
Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37:106870–106905, 2024
work page 2024
-
[6]
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summariza- tion of long documents, 2018. URLhttps://arxiv.org/abs/1804.05685
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
The Diffusion Duality, Chapter II: $\Psi$-Samplers
Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum, 2026. URLhttps://arxiv.org/abs/2602.21185
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019
work page 2019
-
[9]
Google DeepMind. Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025. Accessed: 2026-05-06
work page 2025
-
[10]
Vector quantized diffusion model for text-to-image synthesis, 2022
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022. URL https://arxiv.org/abs/2111.14822
-
[11]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH. 13
work page 2020
-
[12]
Argmax flows: Learning categorical distributions with normalizing flows
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr ´e, and Max Welling. Argmax flows: Learning categorical distributions with normalizing flows. InThird Symposium on Advances in Approximate Bayesian Inference, 2021
work page 2021
-
[13]
Analyzing hogwild parallel gaus- sian gibbs sampling
Matthew J Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaus- sian gibbs sampling. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2013/file/b51a15f38...
work page 2013
-
[14]
Mercury: Ultra-fast language models based on diffusion,
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion,
-
[15]
URLhttps://arxiv.org/abs/2506.17298
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Think while you generate: Discrete diffusion with planned denoising
Sulin Liu, Juno Nam, Andrew Campbell, Hannes St¨ark, Yilun Xu, Tommi Jaakkola, and Rafael G´omez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024
-
[17]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning, pages 32819–32848. PMLR, 2024
work page 2024
-
[18]
Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large an- notated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,
-
[19]
URLhttps://aclanthology.org/J93-2004/
work page 2004
-
[20]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Scaling up masked diffusion models on text
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=WNvvwK0tut
work page 2025
-
[22]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URLhttps: //arxiv.org/abs/2502.09992
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps: //arxiv.org/abs/1606.06031
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Le-Tuyet-Nhi PHAM, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=biJiSMLGOV
work page 2025
-
[27]
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models, 2026. URLhttps://arxiv.org/abs/2604.02718
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 14
work page 2024
-
[29]
The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
-
[30]
Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024
Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024
-
[31]
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024
work page 2024
-
[32]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=St1giarCHLP
work page 2021
-
[33]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[34]
Score-based continuous- time discrete diffusion models
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=BYWWwSY2G5s
work page 2023
-
[35]
Scaling behavior of discrete diffusion language models
Dimitri von R ¨utte, Janis Fluri, Omead Pooladzandi, Bernhard Sch ¨olkopf, Thomas Hof- mann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=GDYaNzxt9T
work page 2026
-
[36]
Generalized interpolating discrete diffusion, 2025
Dimitri von R ¨utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch¨olkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion, 2025. URLhttps://arxiv.org/ abs/2503.04482
-
[37]
Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling, 2026. URLhttps://arxiv.org/ abs/2503.00307
-
[38]
Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning, 2025. URLhttps://arxiv.org/abs/2410.14157
-
[39]
Character-level Convolutional Networks for Text Classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
Informed correctors for discrete diffusion models, 2025
Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models, 2025. URLhttps://arxiv.org/abs/ 2407.21243
-
[41]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024
-
[42]
A reparameterized discrete diffusion model for text generation, 2024
Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation, 2024. URLhttps://arxiv.org/abs/2302.05737. 15 Appendix Outline The appendix is organized as follows. Appendix A proves the leave-one-out optimality result, gives the conversion formulas between the leave-one-out denoiser, denoiser and score, ...
-
[43]
Outside this support, the conditional densities may be defined arbitrarily. Proposition 5.It holds for anyx t such thatp t(xt)>0, pℓ 0|t(xℓ 0|xt) =p t(x−ℓ t )qℓ t|0(xℓ t|xℓ 0)p loo,ℓ 0|t (xℓ 0|x−ℓ t )/pt(xt).(24) Conversely, suppose thatq ℓ t|0(xℓ t|xℓ 0)>0for anyx 0 andx t, it holds for anyx −ℓ t ,p t(x−ℓ t )>0, ploo,ℓ 0|t (xℓ 0|x−ℓ t ) = pt(xt)pℓ 0|t(xℓ...
-
[44]
(26) Therefore, the conversion from the denoiser to the leave-one-out posterior is available exactly when the forward has full support. In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another. In MDM, by contrast, the condition fails for unmasked positions. Ifx ℓ t =m, the likelihood is const...
-
[45]
For MDM, the conversion is explicit only on the support of the forward process
=α t⟨xℓ t,x ℓ 0⟩+ 1−α t K , the first identity above yields pℓ 0|t(·|xt) = Cat (1−α t)ˆxloo 0|t (xt)ℓ +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩xℓ t 1−α t +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩ .(27) Conversely, the inverse relation can be written as ˆxloo 0|t (xt)ℓ = (1 + (K−1)α t) ˆx0|t(xt)ℓ −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩xℓ t 1 + (K−1)α t −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩ .(28) Where ˆx0|t(xt)...
-
[46]
If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ
= 1−α t for everyx ℓ 0 ∈V, so pℓ 0|t(·|xt) = Cat(ˆxloo 0|t (xt)ℓ). If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ
-
[47]
=α t1{xℓ 0 =x ℓ t}, hence pℓ 0|t(·|xt) = Cat(xℓ t). Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available. Remark 2.When these conversion formulas hold in both directions, as they do for UDM, the uniqueness of the denoiser as a minimizer of...
-
[48]
pℓ 0|t(xℓ 0|xt).(29) This already shows that the score can be parameterized from the denoiser. The same quantity can also be expressed directly in terms of the leave-one-out denoiser: ⟨yℓ,s t(xt)ℓ⟩= qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ) qℓt|0(xℓ t|ˆxloo 0|t (xt)ℓ) ,(30) for everyy∈Xsuch thaty −ℓ =x −ℓ t . Proof.By definition of the score, for everyy∈Xsuch thaty −ℓ =...
-
[49]
Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )
=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Since the mapν7→q ℓ t|0(yℓ|ν)is affine in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ). Takingy=x t yields in the same way pt(xt) =p t(x−ℓ t )qℓ t|0(xℓ t|ˆxloo 0|t (xt)ℓ). Dividing the two equations gives ⟨...
-
[50]
also considers a LOO denoiser parameterization which follows from Proposition 6. Therefore, if ˆxθ 0(xt, t)is a model for the leave-one-out denoiser, one may use the parameterization pθ(xt, t) =α tˆxθ 0(xt, t) + (1−α t)πℓ .(35) The same restriction is still needed after this reparameterization. If ˆxθ 0(xt, t)ℓ may depend onx ℓ t, the minimizer of (34) ma...
-
[51]
Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )
=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). 21 Figure 5: Comparison between the leave-one-out posterior ˆxloo 0|t (xt)and the denoising posterior ˆx0|t(xt). Using the affine extension of the forward kernel in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)...
-
[52]
Cat(xℓ ti+1;1/K) +1 τ ℓ>ti+1 X xℓ ti+1 Cat(xℓ ti;x ℓ ti+1) Cat(xℓ ti+1;x ℓ 0) =1 τ ℓ≤tiCat(xℓ ti;1/K) +1 ti<τ ℓ≤ti+1Cat(xℓ ti;x ℓ
-
[53]
This proves the induction step
+1 τ ℓ>ti+1Cat(xℓ ti;x ℓ 0) = Cat xℓ ti;1 τ ℓ>ti xℓ 0 +1 τ ℓ≤ti 1/K . This proves the induction step. Hence (50) holds for every time of the grid, and therefore for every t. Lemma 3.If ˜xt(τ) ℓ := xℓ t ifτ ℓ > t, mifτ ℓ ≤t, then p0|t(x0|xt,τ) =p mask 0|t (x0|˜xt(τ)). Proof.By Bayes’ rule and (50), p0|t(x0|xt,τ) = p0(x0)qt|0(xt|x0,τ)P ˜x0 p0(˜x0)qt|0(xt|˜x...
-
[54]
The lawj s|0 exactly reweights these two possibilities. The lifted reverse transition is then ¯ps|t(xs,τ s|xt,τ t) := X x0 js|0(τs|x0,x s)q s|0,t(xs|x0,x t)p 0|t(x0|xt,τ t).(53) For a grid0 =t 0 <· · ·< t n = 1, let¯p0:n denote the corresponding path law, with initialization ¯ptn(xtn ,τ tn) :=p tn(xtn)jtn(τtn |xtn). Ifα tn = 0, this reduces toX tn ∼υ ⊗L a...
-
[55]
and later used by [38]. The idea is to combine two autoregressive streams, one left-to-right and one right-to-left, and to offset the representations so that the output at positionℓnever attends to the input token at the same position, while still depending on all the other positions. Equivalently, in the continuous relaxation used to describe the archite...
-
[56]
at fixed checkpoint, and plotting generative perplexity against the resulting entropy. For top-pfrontiers we usep∈ {0.80,0.85,0.90,0.92,0.94,0.96,0.98,1.00}andNFE∈ {8,16,32,64,128,256,512,1024}. For temperature frontiers we useT∈ {0.80,0.82, . . . ,1.10} over the same NFE grid. The predictor-corrector experiments are run on OWT with the confidence- based ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.