Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs

Hongliang He; Lun Du; Shuai Zhang; Zhenzhong Lan; Zijie Chen

arxiv: 2607.00588 · v1 · pith:4T2YOVWQnew · submitted 2026-07-01 · 💻 cs.CL

Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs

Shuai Zhang , Zijie Chen , Hongliang He , Lun Du , Zhenzhong Lan This is my paper

Pith reviewed 2026-07-02 13:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsrepetitionself-conditioningattractorgenerative perplexityACEcontinuous diffusion

0 comments

The pith

Continuous diffusion language models repeat because a single direction in the self-conditioning feedback loop forms a contractive attractor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continuous diffusion language models report record-low generative perplexity yet produce far more repetition than human text. The paper traces this to a one-dimensional contractive attractor inside the self-conditioning loop that feeds each clean estimate forward to the next step. Because the defect is confined to one direction, subtracting that single label-free direction at every step restores near-human repetition rates. The correction, estimated once on the 105M model, transfers unchanged to 342M and 652M models and across samplers while preserving competitive quality. Since generative perplexity itself rewards repetition, the authors instead track the compute required to reach human-clean output, where the fix proves 1.5 to 5 times cheaper.

Core claim

The repetition arises from a contractive attractor along a single direction in the self-conditioning feedback loop. ACE subtracts that single, label-free direction from the feedback at each step. Estimated once on the 105M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near-unchanged to the 342M and 652M models and across samplers; the same recipe recovers useful directions on other architectures.

What carries the argument

The contractive attractor along a single direction in the self-conditioning feedback loop, countered by subtracting that direction at each denoising step.

If this is right

Subtracting the identified direction reduces repetition to near human levels across tested scales.
The same direction works without retraining when models grow from 105M to 652M parameters.
The correction remains effective when switching between different sampling procedures.
The method recovers analogous directions on other model architectures.
Compute cost to reach human-clean text drops by a factor of 1.5 to 5 compared with baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard generative perplexity may systematically favor models that collapse to repetitive modes in diffusion settings.
One-dimensional attractors could appear in the conditioning loops of other non-autoregressive generators.
Estimating and removing such directions once on small models offers a cheap way to improve larger ones.
The stability across scales hints that the attractor arises from a scale-invariant property of the training objective.

Load-bearing premise

The direction estimated once on the 105M model stays stable enough to transfer to larger models and different samplers without degrading other aspects of generation quality.

What would settle it

Measuring repetition rates on the 342M or 652M model after subtracting the 105M-derived direction and finding no reduction or a drop in quality metrics would falsify the transfer result.

Figures

Figures reproduced from arXiv: 2607.00588 by Hongliang He, Lun Du, Shuai Zhang, Zhenzhong Lan, Zijie Chen.

**Figure 1.** Figure 1: Repetition is a basin; ACE escapes it. Even as ELF denoises toward a clean sample, self-conditioning drags its representation u along one direction d (the high- minus low-repetition gap) into a repetition state u ⋆ ; the baseline slides into this basin (red), while ACE subtracts d to hold u in the human-clean zone (blue). Background: measured repetition rate. that sharper sampling only deepens. This is spe… view at source ↗

**Figure 2.** Figure 2: Repetition concentrates on one dominant mode v1, with which the cheap d partially aligns. (a) mean feedback projection onto the repetition axis v1 (the leading Jacobian eigenvector, read once the basin has formed, at trajectory fraction ∼0.85) vs final repetition. (b) as the basin forms a dominant mode emerges (µ1/µ2); the cheap d partially aligns with this local Jacobian mode while integrating basin-entry… view at source ↗

**Figure 3.** Figure 3: ACE cancels the repetition drift. Self-conditioning feedback drifts along a single direction d into a repetition basin (a); ACE subtracts d and cancels that drift, holding repetition down (b). Baseline (red) vs steered (blue), ELF-B [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Baseline repetition grows steeply with steps and guidance; steering suppresses it throughout. Baseline (red) vs single d (λ=2, blue) over (a) denoising steps and (b) guidance scale. before the next step, and a fixed λ≈2 that pre-compensates the amplification does better (Tab. 9). The dose matches the loop’s Jacobian gain to an order of magnitude; the full sweep and usable window are in App. C.2. 6 THE DEFE… view at source ↗

**Figure 5.** Figure 5: Steering has a finite usable window, as the theory predicts. λ-sweep on ELF-B: repetition (blue), non-words (red, marking the latent leaving the token-embedding manifold); shaded: repetition below the human bar, non-words at baseline [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Steering makes human-clean text 1.5–5× cheaper. Generations to 1000 samples under the human bar, baseline vs steered, by model size (log scale); the per-method head-to-head at one size is Tab. 4. and resets their feedback to 0 for the remaining steps. Full-SC (post-hoc reject) runs the unmodified sampler followed by the same reject-to-1000 loop. ACE instead subtracts one direction, xˆ←xˆ −λd (§5), leaving … view at source ↗

read the original abstract

Continuous diffusion language models such as ELF report record-low generative perplexity (Gen-PPL). We find a catch: these models repeat far more than human text, and Gen-PPL rewards rather than penalizes that repetition, so its low scores overstate quality. Strip the repetition and ELF-B's Gen-PPL rises from $19.5$ to $27.7$; the smallest model even posts the best Gen-PPL because it repeats most. We trace the repetition to its source: a contractive attractor along a \emph{single direction} in the self-conditioning feedback loop, the loop that feeds each step's clean estimate into the next. Because the failure is one-dimensional, a one-dimensional fix suffices, and we propose one. \textbf{ACE} (Attractor-Contrast-Escape) subtracts that single, label-free direction from the feedback at each step. Estimated once on the $105$M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near-unchanged to the $342$M and $652$M models and across samplers; the same recipe recovers useful directions on other architectures. Since Gen-PPL itself rewards repetition, we instead measure the compute each fix needs to produce human-clean text, where ACE is $1.5$--$5\times$ cheaper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows Gen-PPL in continuous diffusion LMs rewards repetition due to a 1D attractor in self-conditioning, and ACE subtraction fixes it with transfer across scales.

read the letter

The main point is that record-low Gen-PPL in models like ELF comes from heavy repetition, and the metric actually favors that repetition rather than penalizing it. Removing the repetition raises Gen-PPL from 19.5 to 27.7 on ELF-B, and the smallest model wins on the metric because it repeats the most. They locate the source in a contractive attractor along one direction in the self-conditioning loop and show that subtracting a single estimated vector (ACE) brings repetition rates down to human levels while keeping other quality measures competitive.

What the paper does well is the direct empirical link between Gen-PPL and repetition, plus the practical demonstration that a one-dimensional correction estimated on the 105M model transfers to the 342M and 652M models and across samplers. The claim that the same recipe recovers useful directions on other architectures is also useful to see. The compute savings for producing human-clean text (1.5–5× cheaper) is a concrete advantage over alternatives.

The soft spot is the mechanistic part. The transfer results are presented as evidence that the direction is stable, but the paper reports only the downstream repetition and quality numbers after applying the 105M-derived vector. It does not show cosine similarity or other direct checks that the direction itself remains the same at larger scales. If the direction rotates or changes with model size, the one-dimensional fix would need per-model re-estimation and the generality claim would weaken. The estimation procedure itself is described as label-free, but more detail on how the direction is isolated and any sensitivity checks would help.

This is for researchers working on diffusion LMs or self-conditioning methods who care about generation reliability and evaluation. A reader focused on non-autoregressive language models would find the repetition diagnosis and the cheap mitigation worth testing. The work shows clear engagement with the mechanism and supplies testable claims, so it deserves a serious referee even with the open question on direction invariance.

Referee Report

2 major / 2 minor

Summary. The paper claims that record-low generative perplexity (Gen-PPL) in continuous diffusion LMs such as ELF stems from excessive repetition, which Gen-PPL rewards rather than penalizes. It traces this to a contractive attractor along a single direction in the self-conditioning feedback loop and proposes ACE, a one-dimensional correction that subtracts a label-free direction estimated once on the 105M model. This reduces repetition to near-human levels, preserves competitive quality, transfers to 342M/652M models and other samplers, and yields 1.5–5× efficiency gains when measured by compute to reach human-clean text (with Gen-PPL rising from 19.5 to 27.7 after repetition removal).

Significance. If the 1D-attractor claim and transfer results hold, the work supplies a mechanistic account of a concrete failure mode together with a simple, falsifiable fix whose effectiveness is quantified by before/after Gen-PPL numbers and efficiency metrics. The label-free estimation and cross-scale transfer constitute reproducible, parameter-light contributions that could be directly tested on other diffusion LMs.

major comments (2)

[Transfer results section] Transfer results section: the claim that the direction 'transfers near-unchanged' rests solely on downstream repetition and quality metrics after application to the 342M and 652M models. No direct verification—such as cosine similarity or Euclidean distance between the direction vector estimated on the 105M model and independently estimated vectors on the larger models—is reported. This measurement is load-bearing for the assertion that a single, scale-invariant direction suffices.
[Gen-PPL analysis] Gen-PPL analysis (abstract and § on evaluation): the statement that 'stripping the repetition' raises Gen-PPL from 19.5 to 27.7 lacks an explicit description of the repetition-identification procedure, the data-exclusion rules, and any statistical significance tests. Without these details the quantitative support for the claim that Gen-PPL rewards repetition remains incomplete.

minor comments (2)

[Methods] The exact optimization or estimation procedure used to obtain the single attractor direction vector should be stated with pseudocode or an equation in the methods section.
[Abstract] Notation for model sizes (105M, 342M, 652M) and sampler names should be introduced once and used consistently; the abstract uses 'ELF-B' without prior definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address each major point below and agree that the requested additions will strengthen the paper.

read point-by-point responses

Referee: [Transfer results section] Transfer results section: the claim that the direction 'transfers near-unchanged' rests solely on downstream repetition and quality metrics after application to the 342M and 652M models. No direct verification—such as cosine similarity or Euclidean distance between the direction vector estimated on the 105M model and independently estimated vectors on the larger models—is reported. This measurement is load-bearing for the assertion that a single, scale-invariant direction suffices.

Authors: We agree that direct vector-level verification would strengthen the transfer claim. In the revision we will independently estimate the direction on the 342M and 652M models, compute cosine similarities and Euclidean distances to the 105M direction, and report these quantities in the Transfer results section. revision: yes
Referee: [Gen-PPL analysis] Gen-PPL analysis (abstract and § on evaluation): the statement that 'stripping the repetition' raises Gen-PPL from 19.5 to 27.7 lacks an explicit description of the repetition-identification procedure, the data-exclusion rules, and any statistical significance tests. Without these details the quantitative support for the claim that Gen-PPL rewards repetition remains incomplete.

Authors: We agree that the repetition-identification procedure requires fuller documentation. The revised manuscript will add an explicit description of the repetition detection and removal method, the data-exclusion rules, and any statistical significance tests associated with the reported Gen-PPL increase from 19.5 to 27.7. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper empirically identifies a contractive direction via data-driven estimation on the 105M model and validates its transfer through direct application and measurement of repetition/quality metrics on 342M/652M models and across samplers. This constitutes an external, falsifiable test against human text baselines rather than any self-definitional loop, fitted input renamed as prediction, or load-bearing self-citation. No equations or claims in the provided text reduce the central result to its own inputs by construction; the transfer experiments supply independent grounding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The ledger contains one fitted direction (estimated on 105M model), one domain assumption that the attractor is one-dimensional and contractive, and one invented entity (the attractor itself) with no independent evidence outside the observed repetition behavior.

free parameters (1)

attractor direction vector
Estimated once from the 105M model to be subtracted at each step; the central fix depends on this fitted quantity.

axioms (1)

domain assumption The observed repetition originates from a contractive attractor along exactly one direction in the self-conditioning feedback loop
This premise is invoked to justify both the diagnosis and the sufficiency of the 1D subtraction fix.

invented entities (1)

one-dimensional self-conditioning attractor no independent evidence
purpose: To explain the source of excessive repetition in continuous diffusion LMs
Postulated on the basis of the repetition behavior; no independent falsifiable prediction or external evidence is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1520 out tokens · 31306 ms · 2026-07-02T13:15:00.659641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Simple Self-Conditioning Adaptation for Masked Diffusion Models

Michael Cardei, Huu Binh Ta, and Ferdinando Fioretto. Simple self-conditioning adaptation for masked diffusion models.arXiv preprint arXiv:2604.26985,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202,

work page arXiv
[3]

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffu- sion for categorical data.arXiv preprint arXiv:2211.15089,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text eval- uation needs distributional metrics.arXiv preprint arXiv:2606.08417,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows.arXiv preprint arXiv:2605.10938,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Breaking the Factorization Barrier in Diffusion Language Models

Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, and Anji Liu. Breaking the factor- ization barrier in diffusion language models.arXiv preprint arXiv:2603.00045,

work page arXiv
[8]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 110–119,

2016
[9]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 1797–1807,

2018
[10]

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models.arXiv preprint arXiv:2604.02718,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547,

Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin. Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547,

work page arXiv
[12]

Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, and Raz Lapid. Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

work page arXiv
[13]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,

Robin Strudel, Corentin Tallec, Florent Altch ´e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,

work page arXiv
[14]

Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

work page arXiv
[15]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InInternational Conference on Learning Representations, volume 2025, pp. 63186– 63227,

2025
[16]

Property ofdat the operating point Value Aligned with the feedback’s leading PC,|cos(d,PC 1)|0.73 Energy ofdin PC 1 / top-5PCs0.53/0.82 Estimate stable from few samples,cos(d n,d full)atn=50/200/800 0.69/0.94/1.00 Table 13:Non-words are a second, decode-axis defect, invisible to Gen-PPL.Out-of-dictionary tokens; top block: main-text generations; bottom: c...

1992
[17]

23 Preprint Table 17:The defect and the fix in one sample, along the trajectory.Same seed as Figure 3, baseline vs steered (λ=2); highlighted=the cell’s dominant looped4-gram; [

Table 17 shows one draw under baseline and steering at three trajectory stages: the baseline stays locked in the basin at 38–45%repeated4-grams, while the steered run stays near the human bar (1.0–3.1%). 23 Preprint Table 17:The defect and the fix in one sample, along the trajectory.Same seed as Figure 3, baseline vs steered (λ=2); highlighted=the cell’s ...

2025
[18]

and gameable end-to-end by naive samplers (Franca & Tong, 2026); this line is mostly on discrete or masked models. The decoderline targets per-position rounding and independence with stronger decoders (Li et al., 2022; Dieleman et al., 2022; Li et al., 2026; Shen et al., 2026): it addresses the discretization bottleneck, not the self-conditioning feedback...

2026
[19]

are poorly placed to reach it, whereas the orthogonal non-word defect is amenable to AR-style contextual decoding (App. C). Relation to masked diffusion.Two concurrent threads on masked diffusion bound our claims. Cardei et al. (2026) add self-conditioning to masked diffusion and report a large drop in generative perplexity without an explicit repetition ...

2026

[1] [1]

Simple Self-Conditioning Adaptation for Masked Diffusion Models

Michael Cardei, Huu Binh Ta, and Ferdinando Fioretto. Simple self-conditioning adaptation for masked diffusion models.arXiv preprint arXiv:2604.26985,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202,

work page arXiv

[3] [3]

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffu- sion for categorical data.arXiv preprint arXiv:2211.15089,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Antonio Franca and Alexander Tong. Hacking generative perplexity: Why unconditional text eval- uation needs distributional metrics.arXiv preprint arXiv:2606.08417,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, and Kaiming He. Elf: Embedded language flows.arXiv preprint arXiv:2605.10938,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Breaking the Factorization Barrier in Diffusion Language Models

Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, and Anji Liu. Breaking the factor- ization barrier in diffusion language models.arXiv preprint arXiv:2603.00045,

work page arXiv

[8] [8]

A diversity-promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity-promoting objective function for neural conversation models. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 110–119,

2016

[9] [9]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 1797–1807,

2018

[10] [10]

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models.arXiv preprint arXiv:2604.02718,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547,

Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin. Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547,

work page arXiv

[12] [12]

Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, and Raz Lapid. Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

work page arXiv

[13] [13]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,

Robin Strudel, Corentin Tallec, Florent Altch ´e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,

work page arXiv

[14] [14]

Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

work page arXiv

[15] [15]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InInternational Conference on Learning Representations, volume 2025, pp. 63186– 63227,

2025

[16] [16]

Property ofdat the operating point Value Aligned with the feedback’s leading PC,|cos(d,PC 1)|0.73 Energy ofdin PC 1 / top-5PCs0.53/0.82 Estimate stable from few samples,cos(d n,d full)atn=50/200/800 0.69/0.94/1.00 Table 13:Non-words are a second, decode-axis defect, invisible to Gen-PPL.Out-of-dictionary tokens; top block: main-text generations; bottom: c...

1992

[17] [17]

23 Preprint Table 17:The defect and the fix in one sample, along the trajectory.Same seed as Figure 3, baseline vs steered (λ=2); highlighted=the cell’s dominant looped4-gram; [

Table 17 shows one draw under baseline and steering at three trajectory stages: the baseline stays locked in the basin at 38–45%repeated4-grams, while the steered run stays near the human bar (1.0–3.1%). 23 Preprint Table 17:The defect and the fix in one sample, along the trajectory.Same seed as Figure 3, baseline vs steered (λ=2); highlighted=the cell’s ...

2025

[18] [18]

and gameable end-to-end by naive samplers (Franca & Tong, 2026); this line is mostly on discrete or masked models. The decoderline targets per-position rounding and independence with stronger decoders (Li et al., 2022; Dieleman et al., 2022; Li et al., 2026; Shen et al., 2026): it addresses the discretization bottleneck, not the self-conditioning feedback...

2026

[19] [19]

are poorly placed to reach it, whereas the orthogonal non-word defect is amenable to AR-style contextual decoding (App. C). Relation to masked diffusion.Two concurrent threads on masked diffusion bound our claims. Cardei et al. (2026) add self-conditioning to masked diffusion and report a large drop in generative perplexity without an explicit repetition ...

2026