Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

Grigory Bartosh; Javier Zazo; Sushrut Karmalkar; Teodora Pandeva

arxiv: 2605.18204 · v1 · pith:LZKXYAL5new · submitted 2026-05-18 · 📊 stat.ML · cs.LG

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

Grigory Bartosh , Teodora Pandeva , Sushrut Karmalkar , Javier Zazo This is my paper

Pith reviewed 2026-05-20 00:21 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords discrete diffusionlearnable forward processnon-Markovian diffusionfew-step generationfactorized reverse processvariational traininggenerative modeling

0 comments

The pith

By learning the forward noising process as non-Markovian, discrete diffusion keeps its reverse process factorized yet matches the target in fewer steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard discrete diffusion fixes a Markovian forward chain, which forces the factorized reverse process to need many steps to approximate the data distribution well. This paper instead makes the forward process itself learnable in a non-Markovian formulation, with trainable marginal and posterior distributions. The reverse process can therefore stay factorized while still matching the target that the learned noising defines. End-to-end training under the usual variational objective then produces higher-quality samples for any fixed number of sampling steps. A sympathetic reader would care because the change promises to cut the expensive long sampling runs that currently limit discrete diffusion on text, images, and other discrete data.

Core claim

Forward-Learned Discrete Diffusion replaces the fixed Markovian forward chain with a learnable non-Markovian noising process whose marginal and posterior distributions are also optimized. This construction keeps the generative reverse process factorized while allowing it to match the target distribution induced by the learned forward process. All parameters are trained jointly under the standard variational bound, and experiments across benchmarks show that the resulting model yields higher-quality samples than conventional discrete diffusion when both use the same reverse parameterization and the same small number of sampling steps.

What carries the argument

A learnable non-Markovian forward process whose marginal and posterior distributions are optimized so the factorized reverse process matches the induced target.

If this is right

For any fixed number of sampling steps the model produces higher-quality samples than conventional discrete diffusion using the same reverse parameterization.
The generative process remains factorized while matching the target distribution induced by the learned noising.
All forward and reverse parameters are trained end-to-end under the standard variational objective.
The gap between the model distribution and the target is reduced, enabling few-step generation without altering the reverse form.
The improvement holds across multiple benchmarks when the reverse parameterization is held constant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same learnable-forward idea could be tested in continuous diffusion to see whether it likewise shortens the required number of denoising steps.
Because the forward schedule is now data-driven, the method might automatically adapt diffusion length to dataset complexity without manual tuning.
Combining the approach with existing acceleration tricks such as step distillation could compound the reduction in sampling cost.
If the learned non-Markovian structure proves stable, it may extend naturally to other discrete generative settings such as molecular graphs or symbolic sequences.

Load-bearing premise

A non-Markovian formulation with learnable marginal and posterior distributions lets the factorized generative process match the target defined by the learned noising process.

What would settle it

On a standard benchmark, measure sample quality at a fixed small number of reverse steps; if the learned-forward model does not outperform the fixed-forward baseline under identical reverse parameterization, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.18204 by Grigory Bartosh, Javier Zazo, Sushrut Karmalkar, Teodora Pandeva.

**Figure 2.** Figure 2: Learned dynamics for Binarized MNIST dataset. Generative process starts from prior [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Discrete diffusion models are a powerful class of generative models with strong performance across many domains. For efficiency, however, discrete diffusion typically parameterizes the generative (reverse) process with factorized distributions, which makes it difficult for the model to learn the target process in a small number of steps and necessitates a long, computationally expensive sampling procedure. To reduce the gap between the target and model distributions and enable few-step generation, we propose Forward-Learned Discrete Diffusion (FLDD), which introduces discrete diffusion with a learnable forward (noising) process. Rather than fixing a Markovian forward chain, we adopt a non-Markovian formulation with learnable marginal and posterior distributions. This allows the generative process to remain factorized while matching the target defined by the noising process. We train all parameters end-to-end under the standard variational objective. Experiments on various benchmarks show that, for a given number of sampling steps, our approach produces a higher quality samples than conventional discrete diffusion models using the same reverse parameterization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's move to learn a non-Markovian forward process is a clean idea for tightening few-step discrete diffusion, but the consistency of the learned marginals and posteriors is the part that needs checking.

read the letter

The main thing here is that they drop the usual fixed Markov forward chain and instead learn the marginal and posterior distributions directly in a non-Markovian setup. This keeps the reverse process factorized while letting the target be defined by the learned noising process, and they optimize everything jointly under the standard variational objective. That is the actual novelty relative to the fixed-forward literature they cite. The experiments claim better sample quality at low step counts on standard benchmarks using the same reverse parameterization, which would be practically useful if the numbers hold.

Referee Report

1 major / 1 minor

Summary. The paper proposes Forward-Learned Discrete Diffusion (FLDD), which replaces the fixed Markovian forward process in discrete diffusion models with a non-Markovian formulation whose marginal q(x_t | x_0) and posterior q(x_{t-1} | x_t, x_0) are learned jointly with the reverse process. The claim is that this allows the unchanged factorized reverse parameterization to match a better-defined target distribution under the standard variational objective, yielding higher-quality samples for any fixed number of sampling steps. Experiments on various benchmarks are reported to support the improvement over conventional discrete diffusion.

Significance. If the learned forward distributions remain internally consistent and the reported gains prove robust across datasets and metrics, the method could improve the practical efficiency of discrete diffusion by reducing the number of reverse steps required without altering the reverse network architecture. The end-to-end training under a standard variational bound and the direct comparison to fixed-forward baselines constitute concrete, falsifiable contributions.

major comments (1)

[Method (non-Markovian forward process and training objective)] The non-Markovian formulation defines learnable marginal and posterior distributions whose consistency is required for the target distribution to be well-defined. The manuscript does not describe an explicit consistency constraint, reparameterization, or regularization term that enforces q(x_t | x_0) = ∫ q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0) dx_{t-1} during joint optimization. Without such a mechanism, the variational objective may optimize an ill-posed target, so that any observed improvement cannot be attributed to better target matching by the factorized reverse process. This issue is load-bearing for the central claim.

minor comments (1)

[Abstract] The abstract refers to 'various benchmarks' without naming the datasets or tasks; adding this information would allow readers to assess the breadth of the empirical support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying a key aspect of our non-Markovian formulation that requires clarification. We address the major comment below.

read point-by-point responses

Referee: The non-Markovian formulation defines learnable marginal and posterior distributions whose consistency is required for the target distribution to be well-defined. The manuscript does not describe an explicit consistency constraint, reparameterization, or regularization term that enforces q(x_t | x_0) = ∫ q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0) dx_{t-1} during joint optimization. Without such a mechanism, the variational objective may optimize an ill-posed target, so that any observed improvement cannot be attributed to better target matching by the factorized reverse process. This issue is load-bearing for the central claim.

Authors: We agree that consistency between the learned marginal q(x_t | x_0) and posterior q(x_{t-1} | x_t, x_0) is essential for the target distribution to be well-defined under the non-Markovian forward process. The current manuscript does not explicitly describe a consistency constraint, reparameterization, or regularization term enforcing the marginalization condition. This omission leaves open the possibility that the variational objective optimizes an ill-posed target, which would weaken attribution of gains to improved target matching by the factorized reverse process. To resolve this, we will revise the Methods section to add a dedicated paragraph and accompanying equations that specify how consistency is maintained. In the revised version we will introduce a lightweight consistency regularization term (estimated via Monte Carlo sampling over the learned posterior) that is added to the standard variational objective with a small fixed coefficient; we will also describe the parameterization chosen for the marginal and posterior so that the integral relation holds by construction wherever possible. We will report an ablation confirming that removing this term degrades performance, thereby supporting that the observed improvements stem from a well-defined target. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent learnable components

full rationale

The paper introduces newly learnable non-Markovian marginal and posterior distributions for the forward process, optimized jointly with the reverse model under the standard variational objective. This formulation does not reduce by construction to any pre-fitted quantity or prior result, as the learnable forward objects are defined and trained from scratch rather than derived from existing parameters. Experimental comparisons use the same reverse parameterization against conventional discrete diffusion baselines, providing external validation. No load-bearing self-citations, uniqueness theorems from the authors, or ansatzes smuggled via prior work are required for the central claim, making the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard variational lower bound for diffusion models plus the new assumption that jointly optimizing forward and reverse parameters will produce a useful non-Markovian schedule.

free parameters (1)

parameters of learnable marginal and posterior distributions
These are optimized end-to-end from data rather than preset.

axioms (1)

domain assumption The variational objective remains valid when both forward and reverse processes are parameterized and learned jointly.
Standard in diffusion literature but extended to a learnable forward process.

pith-pipeline@v0.9.0 · 5716 in / 1190 out tokens · 38952 ms · 2026-05-20T00:21:42.134831+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

non-Markovian formulation with learnable marginal and posterior distributions... q(z0:T|x)=q(zT|x)∏q(zs|zt,x)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Maximum Coupling... u^j_{s|t} = min(u^k_s,u^k_t)/u^k_t ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Spiking neural network hypergraphs with spike frequency data,

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, march 2021.URL https://doi. org/10.5281/zenodo, 5297715(5):3,

work page doi:10.5281/zenodo 2021
[2]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

work page internal anchor Pith review arXiv
[3]

Simpler diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion,

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Sali- mans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

work page arXiv
[4]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Meng Liu, Keqiang Yan, Bora Oztekin, and Shuiwang Ji

URLhttps://openreview.net/forum?id=PqvMRDCJT9t. Meng Liu, Keqiang Yan, Bora Oztekin, and Shuiwang Ji. GraphEBM: Molecular graph generation with energy-based models. InEnergy Based Models Workshop - ICLR 2021,

work page 2021
[6]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

URLhttps: //openreview.net/forum?id=Gc51PtL_zYw. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution, 2024.URL https://arxiv. org/abs/2310.16834. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

work page arXiv
[9]

Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170,

10 Published as a conference paper at ICLR 2026 Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170,

work page arXiv 2026
[10]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vander- wende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,...

work page 2016
[11]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and gener- alized masked diffusion for discrete data, 2024.URL https://arxiv. org/abs/2406.04329.(Cited on page 10). Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, and Tommi Jaakkola. Dirichlet flow matching with applications to DNA ...

work page arXiv 2024
[13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://arxiv.org/abs/2307.09288. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, V olkan Cevher, and Pascal Frossard. DiGress: Discrete Denoising diffusion for graph generation. InThe Eleventh Inter- national Conference on Learning Representations, September

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Learning- order autoregressive models with application to molecular graph generation.arXiv preprint arXiv:2503.05979,

Zhe Wang, Jiaxin Shi, Nicolas Heess, Arthur Gretton, and Michalis K Titsias. Learning- order autoregressive models with application to molecular graph generation.arXiv preprint arXiv:2503.05979,

work page arXiv
[15]

Williams

doi: 10.1007/BF00992696. Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto-regressive diffusion model for text generation. Advances in Neural Information Processing Systems, 36:39957–39974,

work page doi:10.1007/bf00992696
[16]

One-step diffusion models withf-divergence distribution matching.arXiv preprint arXiv:2502.15681,

Yilun Xu, Weili Nie, and Arash Vahdat. One-step diffusion models withf-divergence distribution matching.arXiv preprint arXiv:2502.15681,

work page arXiv
[17]

Seqdiffuseq: Text diffusion with encoder-decoder transformers.arXiv preprint arXiv:2212.10325,

11 Published as a conference paper at ICLR 2026 Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. Seqdiffuseq: Text diffusion with encoder-decoder transformers.arXiv preprint arXiv:2212.10325,

work page arXiv 2026
[18]

(2019); Luo et al

and ZINC250k (Irwin et al., 2012), we follow the standard setup in Shi et al. (2019); Luo et al. (2021); Vignac et al. (2022); Jo et al. (2022). We use the same hyperparameters for model parameterization as in Eijkelboom et al. (2024). For the experiment on the ROCStories (Mostafazadeh et al.,

work page 2012
[19]

(2023) with6layers and6heads and an embedding size of512to parameterize the model, and a pretrained GPT-2 for PPL calculation

dataset, we use LLaMA 2 Tou- vron et al. (2023) with6layers and6heads and an embedding size of512to parameterize the model, and a pretrained GPT-2 for PPL calculation. 12

work page 2023

[1] [1]

Spiking neural network hypergraphs with spike frequency data,

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, march 2021.URL https://doi. org/10.5281/zenodo, 5297715(5):3,

work page doi:10.5281/zenodo 2021

[2] [2]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

work page internal anchor Pith review arXiv

[3] [3]

Simpler diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion,

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Sali- mans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

work page arXiv

[4] [4]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Meng Liu, Keqiang Yan, Bora Oztekin, and Shuiwang Ji

URLhttps://openreview.net/forum?id=PqvMRDCJT9t. Meng Liu, Keqiang Yan, Bora Oztekin, and Shuiwang Ji. GraphEBM: Molecular graph generation with energy-based models. InEnergy Based Models Workshop - ICLR 2021,

work page 2021

[6] [6]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

URLhttps: //openreview.net/forum?id=Gc51PtL_zYw. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution, 2024.URL https://arxiv. org/abs/2310.16834. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion.arXiv preprint arXiv:2305.08379,

work page arXiv

[9] [9]

Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170,

10 Published as a conference paper at ICLR 2026 Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170,

work page arXiv 2026

[10] [10]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vander- wende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,...

work page 2016

[11] [11]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and gener- alized masked diffusion for discrete data, 2024.URL https://arxiv. org/abs/2406.04329.(Cited on page 10). Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, and Tommi Jaakkola. Dirichlet flow matching with applications to DNA ...

work page arXiv 2024

[12] [13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://arxiv.org/abs/2307.09288. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, V olkan Cevher, and Pascal Frossard. DiGress: Discrete Denoising diffusion for graph generation. InThe Eleventh Inter- national Conference on Learning Representations, September

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Learning- order autoregressive models with application to molecular graph generation.arXiv preprint arXiv:2503.05979,

Zhe Wang, Jiaxin Shi, Nicolas Heess, Arthur Gretton, and Michalis K Titsias. Learning- order autoregressive models with application to molecular graph generation.arXiv preprint arXiv:2503.05979,

work page arXiv

[14] [15]

Williams

doi: 10.1007/BF00992696. Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto-regressive diffusion model for text generation. Advances in Neural Information Processing Systems, 36:39957–39974,

work page doi:10.1007/bf00992696

[15] [16]

One-step diffusion models withf-divergence distribution matching.arXiv preprint arXiv:2502.15681,

Yilun Xu, Weili Nie, and Arash Vahdat. One-step diffusion models withf-divergence distribution matching.arXiv preprint arXiv:2502.15681,

work page arXiv

[16] [17]

Seqdiffuseq: Text diffusion with encoder-decoder transformers.arXiv preprint arXiv:2212.10325,

11 Published as a conference paper at ICLR 2026 Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. Seqdiffuseq: Text diffusion with encoder-decoder transformers.arXiv preprint arXiv:2212.10325,

work page arXiv 2026

[17] [18]

(2019); Luo et al

and ZINC250k (Irwin et al., 2012), we follow the standard setup in Shi et al. (2019); Luo et al. (2021); Vignac et al. (2022); Jo et al. (2022). We use the same hyperparameters for model parameterization as in Eijkelboom et al. (2024). For the experiment on the ROCStories (Mostafazadeh et al.,

work page 2012

[18] [19]

(2023) with6layers and6heads and an embedding size of512to parameterize the model, and a pretrained GPT-2 for PPL calculation

dataset, we use LLaMA 2 Tou- vron et al. (2023) with6layers and6heads and an embedding size of512to parameterize the model, and a pretrained GPT-2 for PPL calculation. 12

work page 2023