Subliminal Clocks: Latent Time Modelling in Diffusion Language Models

(2) EPFL; (3) Independent Researcher); Alessio Devoto (3) ((1) Sapienza University of Rome; Andrea Santilli (3); Donato Crisostomi (1); Emanuele Rodol\`a (1); Federico Alvetreti (1); Giorgio Strano (1); Giorgos Nikolaou (2); Maximo Rulli (1)

arxiv: 2607.01774 · v1 · pith:OIYAZUX4new · submitted 2026-07-02 · 💻 cs.AI · cs.CL

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models

Maximo Rulli (1) , Thomas Fontanari (1) , Simone Petruzzi (1) , Federico Alvetreti (1) , Giorgio Strano (1) , Donato Crisostomi (1) , Giorgos Nikolaou (2) , Tommaso Mencattini (2)

show 6 more authors

Andrea Santilli (3) Emanuele Rodol\`a (1) Simone Scardapane (1) Alessio Devoto (3) ((1) Sapienza University of Rome (2) EPFL (3) Independent researcher)

This is my paper

Pith reviewed 2026-07-03 13:45 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords diffusion language modelslatent timestepresidual streamsprobingsteeringdenoising progressactivation geometry

0 comments

The pith

Diffusion language models encode a latent representation of denoising progress inside their residual streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether diffusion language models, which generate text without explicit timestep inputs, still track how far along the denoising process they are. It shows that a signal tied to the diffusion timestep lives in the model's residual streams and can be read out by probes at multiple layers. Steering activations along the low-dimensional direction tied to this signal changes the model's output confidence and entropy in consistent ways. The work maps the geometry of the representation to reveal how the signal is organized and used.

Core claim

DLMs encode a latent representation related to the diffusion timestep within their residual streams. This signal can be reliably extracted using probes across layers, indicating that denoising progress is decodable from internal activations. Steering the model along a low-dimensional subspace associated with the inferred timestep systematically modulates its notion of denoising progress, leading to predictable changes in model confidence and entropy. The representation exhibits structured and interpretable properties in activation space.

What carries the argument

Linear probes that extract a timestep-related direction from residual-stream activations, together with the low-dimensional subspace used to steer that direction.

If this is right

Denoising progress is decodable from internal activations across layers.
Steering along the timestep subspace produces systematic, predictable shifts in model confidence and entropy.
The representation shows structured geometry in activation space that can be interpreted.
The model maintains an implicit notion of time that influences its generation behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The latent signal could be used to control generation length or quality without changing the training procedure.
Similar internal clocks may appear in other diffusion or iterative generative models.
Interventions on this subspace might offer a new way to debug or regularize sampling trajectories.

Load-bearing premise

The direction found by the probes corresponds to a causal encoding of denoising progress rather than a byproduct of other generation statistics.

What would settle it

Steering along the identified subspace produces no consistent shift in next-token entropy or confidence, or the probe fails to predict the true timestep on new samples from the same model.

Figures

Figures reproduced from arXiv: 2607.01774 by (2) EPFL, (3) Independent Researcher), Alessio Devoto (3) ((1) Sapienza University of Rome, Andrea Santilli (3), Donato Crisostomi (1), Emanuele Rodol\`a (1), Federico Alvetreti (1), Giorgio Strano (1), Giorgos Nikolaou (2), Maximo Rulli (1), Simone Petruzzi (1), Simone Scardapane (1), Thomas Fontanari (1), Tommaso Mencattini (2).

**Figure 1.** Figure 1: 3D projection of latent denoising step modelling for LLaDA. LLaDA represents its τ subspace as a low-dimensional manifold-like curve, progressively modelling the denoising progress from all [MASK] (red) to no remaining [MASK] (purple). scale Masked Diffusion Language Models (Nie et al., 2025; Zhu et al., 2026; Ye et al., 2025) have emerged as an alternative paradigm for text generation. Rather than gener… view at source ↗

**Figure 2.** Figure 2: MLP probes recover the τ signal. The R2 coefficient (higher is better) degrades as we probe deeper into the model, to the point where the MSE almost reduces to half the variance of τ . Moreover, both [MASK] and non-[MASK] tokens seem to carry this information, attaining similarly high coefficients. We therefore interpret τt as an empirical measure of denoising progress and use it as a proxy for the notion … view at source ↗

**Figure 3.** Figure 3: ϕl(µt,l) and µt,l are highly correlated. For each of the found µt,l we compute ϕl(µt,l) and plot it against the corresponding τ = t/100. We observe high correlations for both LLaDA (left) and Dream (right). the part of the representation that is independent of denoising progress and intrinsic to the hidden states of layer l. Based on the results shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Average probe drift across layers and steps after steering in LLaDA. The model progressively compensates for the perturbation, reducing the discrepancy between clean and steered representations of τ . them against Equation (5) with matched norm. Unlike the τ -based perturbations, random perturbations do not induce coherent trends in entropy or confidence across denoising steps. Moreover, their downstrea… view at source ↗

**Figure 6.** Figure 6: PCA distribution and sampled projection on the top-2 principal components of the mean vectors. Left: PCA distribution of the mean vectors for LLaDA; notably, across layers, most of the variance of the identified set of mean vectors can be explained by fewer than three dimensions. Right: sampled projections for Dream. We use the mean vectors obtained from layer 25 of Dream. We observe a parabola-like geomet… view at source ↗

**Figure 7.** Figure 7: Low-dimensional subspace steering in LLaDA. Using Equation (7), we steer the model within the two-dimensional subspace (k = 2) spanned by the top principal components of the layer-29 mean vectors. Steering within the subspace closely resembles the unrestricted one, while the orthogonal perturbation produces incoherent effects. spanned by the first k principal components and the part orthogonal to it, resc… view at source ↗

**Figure 9.** Figure 9: Average cosine similarity of same-indexed vectors across layers in LLaDA. We observe that most layers maintain a highly correlated representation of the same index t, and that this relationship degrades with distance. Layer 32 instead maintains a representation that is largely independent of the other layers. studying the cross-layer alignment of the discovered mean vectors. 5.3 Cross-layer τ representati… view at source ↗

**Figure 11.** Figure 11: Average cosine similarity of self-attention and post-Wout vectors. Layer-29 vectors from LLaDA reveal a clear geometric organization over t: representations for distant indices become nearly anti-aligned, and a transition region emerges around t = 50, where vectors are approximately perpendicular. model correct the signal as shown in Section 4.3. ✓ Takeaway for RQ3 The models represent the mean vectors s… view at source ↗

**Figure 12.** Figure 12: MLP probe architecture. Each block is a Linear→LayerNorm→GELU unit; the probe stacks five such blocks around a single residual connection, and a sigmoid head bounds the output to (0, 1) to match the range of τ . The hidden width w is capped at 1024. AdamW (Loshchilov and Hutter, 2017) weight optimiser, setting the learning rate to α = 10−3 and the weight decay coefficient to 6 × 10−6 . All experiments int… view at source ↗

**Figure 13.** Figure 13: R2 coefficients of MLPs in Dream. The R2 coefficient slightly degrades as we probe deeper into the model, which remains consistent with the observations of [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Mean-steering downstream effects on layer 25 of Dream. We steered the activations in layer 25 using the mean-ratio vectors (blue) targeting different τ values, and measured the variation in entropy, confidence and the KL divergence. We compared it against random perturbations (red). Steering with the mean vectors has an effect that is consistent with what would be expected. Steering across intervention la… view at source ↗

**Figure 15.** Figure 15: Linear probe performance. The linear probes degrade markedly as we probe deeper into the model— more so than the MLP probes shown in [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: shows the per-step unrolled version of [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Low-dimensional geometry of τ ’s mean-vectors in Dream. Left: shared 2D mean-vector trajectory across layers. Right: 3D PCA projection for layer 25. like trajectory, supporting the idea of a shared geometry of τ across the whole model. Nonetheless, differently from LLaDA, Dream’s parabola has a bigger spread at the two endpoints t = 1 and t = 100, implying that, at the boundaries, the geometrical organisa… view at source ↗

**Figure 18.** Figure 18: Cosine-similarity structure of mean vectors in Dream. Left: average similarity across layers. Right: average similarity across intermediate layer components. 0 20 40 60 t 1.5 1.0 0.5 0.0 0.5 S t 0 20 40 60 t 0.1 0.0 0.1 0.2 0.3 c t 0 20 40 60 t 0.00 0.25 0.50 0.75 1.00 1.25 1.50 K L t t = 1 t = 16 t = 32 t = 48 t = 64 (a) L = 64, tˆ∈ {1, 16, 32, 48, 64}. 0 30 60 90 120 t 2.0 1.5 1.0 0.5 0.0 0.5 1.0 S t 0 … view at source ↗

**Figure 19.** Figure 19: Mean-steering downstream effects across generation lengths for LLaDA. Ablation over generation length L ∈ {64, 128}, with the number of denoising steps matched to L. Each row shows, left to right, ∆S¯ t, ∆¯ct and KLt versus the denoising step t, for five steering targets tˆ. Vertical axes use a symmetric-log scale to resolve the convergence near 0. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Layer-wise mean-steering effects in LLaDA. We apply mean vector steering at different intervention layers and target denoising-progress bins tˆ ∈ {1, 25, 50, 75, 100}. Each panel reports the downstream effect on entropy drift, confidence drift and KL divergence across denoising steps. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Layer-wise mean-steering effects in Dream. We apply mean vector steering at different intervention layers and target denoising-progress bins tˆ ∈ {1, 25, 50, 75, 100}. Each panel reports the downstream effect on entropy drift, confidence drift and KL divergence across denoising steps. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Low-dimensional steering on LLaDA. LLaDA’s mean vectors concentrate around a low-dimensional subspace. Steering across the top-1, top-2, and top-10 principal components yields results similar to using the unprojected mean vector. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

**Figure 23.** Figure 23: Low-dimensional steering on Dream. Dream’s mean vectors concentrate around a low-dimensional subspace as in LLaDA. Similar to [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗

read the original abstract

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive models. Unlike standard diffusion-based approaches, DLMs are not explicitly conditioned on a timestep, raising a natural question: do these models internally represent denoising progress, and how is such information used downstream? In this work, we show that DLMs do in fact encode a latent representation related to the diffusion timestep within their residual streams. We find that this signal can be reliably extracted using probes across layers, indicating that denoising progress is decodable from internal activations. We further demonstrate that steering the model along a low-dimensional subspace associated with the inferred timestep allows us to systematically modulate its notion of denoising progress, leading to predictable changes in model confidence and entropy. Finally, we analyse the geometry of the identified representation, showing that it exhibits structured and interpretable properties in activation space, and shedding light on how such a signal is processed by these models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DLMs encode a decodable latent timestep in activations, but the steering results rest on an untested causal assumption.

read the letter

The paper shows that diffusion language models carry a signal about denoising progress inside their residual streams even without explicit timestep conditioning. Probes recover this signal across layers, and steering along the extracted direction shifts model confidence and entropy in the direction expected for changed progress. The authors also describe the geometry of the representation as structured.

The new piece is the direct application of probing and activation steering to the latent timestep in DLMs. Most earlier work on diffusion interpretability stays with images, so this is a reasonable move to language models and gives a concrete internal handle on generation state.

The probing result looks like the stronger part. The steering experiments are weaker because the reported metrics can move for many reasons. Changes in confidence and entropy do not by themselves show that the model is using the probed direction to track actual denoising progress rather than some correlated property like token statistics or output length. The abstract supplies no controls that isolate the timestep effect or check whether the effective noise schedule actually changed, so the causal claim stays open.

The geometry analysis is mentioned but not quantified enough to evaluate here.

This is for people working on interpretability or control of non-autoregressive language models. The probing observation is likely to be useful to that group. The steering angle is worth following up but needs tighter evidence before it can be taken as functional control.

It deserves peer review because the core decodability finding is timely and the basic setup is clear enough for referees to give targeted feedback.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion language models (DLMs) internally encode a latent representation of the diffusion timestep within their residual streams despite lacking explicit timestep conditioning. This signal is reliably extractable via probes across layers. Steering along the associated low-dimensional subspace modulates the model's notion of denoising progress, producing predictable shifts in confidence and entropy. The work also examines the geometry of the representation, reporting structured and interpretable properties in activation space.

Significance. If the central empirical claims hold after addressing causality concerns, the results would advance interpretability research on diffusion language models by showing they develop implicit time representations. This could distinguish DLMs from autoregressive models and suggest new mechanisms for controlling generation dynamics. The combination of probing, steering, and geometric analysis follows established interpretability methods but applies them to a timely model class.

major comments (2)

[steering experiments] The steering experiments (described in the abstract) report changes in confidence and entropy after intervening along the probed direction, but these downstream metrics can be altered by many other activation-space directions (e.g., those correlated with token entropy, position, or output length). The manuscript provides no evidence of an intervention that isolates the inferred timestep while holding other generation properties fixed, nor a direct measurement that the effective noise schedule changes. This assumption is load-bearing for the claim that the subspace is a causal encoding of denoising progress that the model functionally uses.
[methods and experiments] No details are supplied on probe training procedures, steering implementation, datasets, statistical controls, or quantitative results (e.g., probe accuracies, steering magnitudes, or baseline comparisons). Without these, the central empirical claims that the signal is 'reliably extracted' and produces 'predictable changes' cannot be assessed for robustness or reproducibility.

minor comments (2)

The abstract summarizes geometric findings only at a high level ('structured and interpretable properties'); a brief quantitative description or reference to a specific figure would improve clarity.
Notation for the 'low-dimensional subspace' and 'inferred timestep' should be defined more precisely when first introduced to avoid ambiguity with related concepts such as noise level or generation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the manuscript requires additional methodological details and stronger controls in the steering experiments to support the causal claims. We outline our responses below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [steering experiments] The steering experiments (described in the abstract) report changes in confidence and entropy after intervening along the probed direction, but these downstream metrics can be altered by many other activation-space directions (e.g., those correlated with token entropy, position, or output length). The manuscript provides no evidence of an intervention that isolates the inferred timestep while holding other generation properties fixed, nor a direct measurement that the effective noise schedule changes. This assumption is load-bearing for the claim that the subspace is a causal encoding of denoising progress that the model functionally uses.

Authors: We acknowledge that the existing steering results do not include explicit controls to isolate the timestep direction from correlated factors such as token entropy or position, nor do they provide a direct measurement of changes to the effective noise schedule. The direction used for steering was obtained from probes trained specifically to recover the timestep, and the resulting shifts in confidence and entropy are consistent with altered denoising progress. In the revision we will add controlled experiments that hold other generation properties fixed and report any feasible measurements of the noise schedule to strengthen the causal interpretation. revision: yes
Referee: [methods and experiments] No details are supplied on probe training procedures, steering implementation, datasets, statistical controls, or quantitative results (e.g., probe accuracies, steering magnitudes, or baseline comparisons). Without these, the central empirical claims that the signal is 'reliably extracted' and produces 'predictable changes' cannot be assessed for robustness or reproducibility.

Authors: We agree that the current manuscript lacks the necessary methodological details. The revised version will include complete descriptions of probe training procedures, steering implementation, datasets, statistical controls, probe accuracies, steering magnitudes, and baseline comparisons to enable reproducibility and evaluation of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and steering results are self-contained.

full rationale

The paper's claims rest on experimental extraction of signals from model activations via probes and subsequent steering interventions. No equations, fitted parameters, or self-citations are presented that reduce any reported 'prediction' or 'result' to the inputs by construction. The derivation chain consists of observable measurements (decodability across layers, changes in confidence/entropy under steering) that do not loop back to definitions or prior author work as load-bearing premises. This is the standard non-circular outcome for an empirical interpretability study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete equations, training details, or modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5781 in / 1105 out tokens · 36771 ms · 2026-07-03T13:45:49.947736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 24 canonical work pages · 13 internal anchors

[1]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Your absorbing discrete diffusion secretly models the conditional distributions of clean data , author=. arXiv preprint arXiv:2406.03736 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023
[3]

Transformer Circuits Thread , year=

Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng ,Julius and Pearce, Adam and Olah, Chris and Batson, Joshua , title=. Transformer Circuits Thread , year=
[4]

2026 , eprint=

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders , author=. 2026 , eprint=

2026
[5]

2026 , eprint=

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs , author=. 2026 , eprint=

2026
[6]

2025 , eprint=

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models , author=. 2025 , eprint=

2025
[7]

2025 , eprint=

Geometry of Decision Making in Language Models , author=. 2025 , eprint=

2025
[8]

2026 , eprint=

Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling , author=. 2026 , eprint=

2026
[9]

2025 , eprint=

Transformers represent belief state geometry in their residual stream , author=. 2025 , eprint=

2025
[10]

arXiv preprint arXiv:2505.18235 , year=

The origins of representation manifolds in large language models , author=. arXiv preprint arXiv:2505.18235 , year=

work page arXiv
[11]

Symmetry in language statistics shapes the geometry of model representations

Symmetry in language statistics shapes the geometry of model representations , author=. arXiv preprint arXiv:2602.15029 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

International Conference on Learning Representations , volume=

Not all language model features are one-dimensionally linear , author=. International Conference on Learning Representations , volume=
[13]

2026 , eprint=

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations , author=. 2026 , eprint=

2026
[14]

2026 , eprint=

Do Sparse Autoencoders Capture Concept Manifolds? , author=. 2026 , eprint=

2026
[15]

arXiv preprint arXiv:2510.18871 , year=

How Do LLMs Use Their Depth? , author=. arXiv preprint arXiv:2510.18871 , year=

work page arXiv
[16]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[17]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

What does BERT learn about the structure of language? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[18]

arXiv preprint arXiv:2508.09138 , year=

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models , author=. arXiv preprint arXiv:2508.09138 , year=

work page arXiv
[19]

arXiv preprint arXiv:2510.15731 , year=

Attention Sinks in Diffusion Language Models , author=. arXiv preprint arXiv:2510.15731 , year=

work page arXiv
[20]

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models , author=. arXiv preprint arXiv:2601.14758 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2312.01037 , year=

Eliciting latent knowledge from quirky language models , author=. arXiv preprint arXiv:2312.01037 , year=

work page arXiv
[24]

Steering Llama 2 via Contrastive Activation Addition

Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Julian and Turner, Alexander , title =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2410.12299 , year =

Wang, Weixuan and Yang, Jingyuan and Peng, Wei , title =. arXiv preprint arXiv:2410.12299 , year =

work page arXiv
[26]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =
[27]

2024 , eprint =

The. 2024 , eprint =

2024
[28]

2025 , howpublished =

Introducing. 2025 , howpublished =

2025
[29]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =
[30]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

2025 , howpublished =

System Card:. 2025 , howpublished =

2025
[32]

2024 , howpublished =

Learning to Reason with Large Language Models , author =. 2024 , howpublished =

2024
[33]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019
[34]

2023 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

2023
[35]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

2025
[36]

2025 , eprint=

Dream 7B: Diffusion Large Language Models , author=. 2025 , eprint=

2025
[37]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Llada 1.5: Variance-reduced preference optimization for large language diffusion models , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[38]

Du, Zhenbang and Xia, Kejing and Zhong, Xinrui and Fu, Yonggan and Oswald, Nicolai and Ji, Binfei and Khailany, Brucek and Molchanov, Pavlo and Lin, Yingyan , journal=
[39]

arXiv preprint arXiv:2603.21342 , year=

Generalized Discrete Diffusion from Snapshots , author=. arXiv preprint arXiv:2603.21342 , year=

work page arXiv
[40]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

2024
[41]

2025 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

2025
[42]

2025 , eprint=

Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

2025
[43]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

2025
[44]

2025 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2025 , eprint=

2025
[45]

arXiv preprint arXiv:2510.06303 , year=

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation , author=. arXiv preprint arXiv:2510.06303 , year=

work page arXiv
[46]

International Conference on Learning Representations , volume=

Block diffusion: Interpolating between autoregressive and diffusion language models , author=. International Conference on Learning Representations , volume=
[47]

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models , author=. arXiv preprint arXiv:2605.10971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2603.06123 , year=

Diffusion Language Models Are Natively Length-Aware , author=. arXiv preprint arXiv:2603.06123 , year=

work page arXiv
[49]

Introspective Diffusion Language Models

Introspective Diffusion Language Models , author=. arXiv preprint arXiv:2604.11035 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Understanding and Accelerating the Training of Masked Diffusion Language Models

Understanding and Accelerating the Training of Masked Diffusion Language Models , author=. arXiv preprint arXiv:2605.13026 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

2025 , eprint=

Fast-dLLM v2: Efficient Block-Diffusion LLM , author=. 2025 , eprint=

2025
[52]

arXiv preprint arXiv:2508.13021 , year=

Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models , author=. arXiv preprint arXiv:2508.13021 , year=

work page arXiv
[53]

2024 , eprint=

Better & Faster Large Language Models via Multi-token Prediction , author=. 2024 , eprint=

2024
[54]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[56]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[57]

2021 , eprint=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

2021
[58]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=
[60]

arXiv preprint arXiv:2211.04236 , year=

Self-conditioned embedding diffusion for text generation , author=. arXiv preprint arXiv:2211.04236 , year=

work page arXiv
[61]

The Eleventh International Conference on Learning Representations , year=

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , author=. The Eleventh International Conference on Learning Representations , year=
[62]

Continuous diffusion for categorical data

Continuous diffusion for categorical data , author=. arXiv preprint arXiv:2211.15089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Your absorbing discrete diffusion secretly models the conditional distributions of clean data , author=. arXiv preprint arXiv:2406.03736 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023

[3] [3]

Transformer Circuits Thread , year=

Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng ,Julius and Pearce, Adam and Olah, Chris and Batson, Joshua , title=. Transformer Circuits Thread , year=

[4] [4]

2026 , eprint=

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders , author=. 2026 , eprint=

2026

[5] [5]

2026 , eprint=

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs , author=. 2026 , eprint=

2026

[6] [6]

2025 , eprint=

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models , author=. 2025 , eprint=

2025

[7] [7]

2025 , eprint=

Geometry of Decision Making in Language Models , author=. 2025 , eprint=

2025

[8] [8]

2026 , eprint=

Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling , author=. 2026 , eprint=

2026

[9] [9]

2025 , eprint=

Transformers represent belief state geometry in their residual stream , author=. 2025 , eprint=

2025

[10] [10]

arXiv preprint arXiv:2505.18235 , year=

The origins of representation manifolds in large language models , author=. arXiv preprint arXiv:2505.18235 , year=

work page arXiv

[11] [11]

Symmetry in language statistics shapes the geometry of model representations

Symmetry in language statistics shapes the geometry of model representations , author=. arXiv preprint arXiv:2602.15029 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

International Conference on Learning Representations , volume=

Not all language model features are one-dimensionally linear , author=. International Conference on Learning Representations , volume=

[13] [13]

2026 , eprint=

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations , author=. 2026 , eprint=

2026

[14] [14]

2026 , eprint=

Do Sparse Autoencoders Capture Concept Manifolds? , author=. 2026 , eprint=

2026

[15] [15]

arXiv preprint arXiv:2510.18871 , year=

How Do LLMs Use Their Depth? , author=. arXiv preprint arXiv:2510.18871 , year=

work page arXiv

[16] [16]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021

[17] [17]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

What does BERT learn about the structure of language? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[18] [18]

arXiv preprint arXiv:2508.09138 , year=

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models , author=. arXiv preprint arXiv:2508.09138 , year=

work page arXiv

[19] [19]

arXiv preprint arXiv:2510.15731 , year=

Attention Sinks in Diffusion Language Models , author=. arXiv preprint arXiv:2510.15731 , year=

work page arXiv

[20] [20]

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models , author=. arXiv preprint arXiv:2601.14758 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2312.01037 , year=

Eliciting latent knowledge from quirky language models , author=. arXiv preprint arXiv:2312.01037 , year=

work page arXiv

[24] [24]

Steering Llama 2 via Contrastive Activation Addition

Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Julian and Turner, Alexander , title =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2410.12299 , year =

Wang, Weixuan and Yang, Jingyuan and Peng, Wei , title =. arXiv preprint arXiv:2410.12299 , year =

work page arXiv

[26] [26]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =

[27] [27]

2024 , eprint =

The. 2024 , eprint =

2024

[28] [28]

2025 , howpublished =

Introducing. 2025 , howpublished =

2025

[29] [29]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =

[30] [30]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

2025 , howpublished =

System Card:. 2025 , howpublished =

2025

[32] [32]

2024 , howpublished =

Learning to Reason with Large Language Models , author =. 2024 , howpublished =

2024

[33] [33]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019

[34] [34]

2023 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

2023

[35] [35]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

2025

[36] [36]

2025 , eprint=

Dream 7B: Diffusion Large Language Models , author=. 2025 , eprint=

2025

[37] [37]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Llada 1.5: Variance-reduced preference optimization for large language diffusion models , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[38] [38]

Du, Zhenbang and Xia, Kejing and Zhong, Xinrui and Fu, Yonggan and Oswald, Nicolai and Ji, Binfei and Khailany, Brucek and Molchanov, Pavlo and Lin, Yingyan , journal=

[39] [39]

arXiv preprint arXiv:2603.21342 , year=

Generalized Discrete Diffusion from Snapshots , author=. arXiv preprint arXiv:2603.21342 , year=

work page arXiv

[40] [40]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

2024

[41] [41]

2025 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

2025

[42] [42]

2025 , eprint=

Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

2025

[43] [43]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

2025

[44] [44]

2025 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2025 , eprint=

2025

[45] [45]

arXiv preprint arXiv:2510.06303 , year=

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation , author=. arXiv preprint arXiv:2510.06303 , year=

work page arXiv

[46] [46]

International Conference on Learning Representations , volume=

Block diffusion: Interpolating between autoregressive and diffusion language models , author=. International Conference on Learning Representations , volume=

[47] [47]

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models , author=. arXiv preprint arXiv:2605.10971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2603.06123 , year=

Diffusion Language Models Are Natively Length-Aware , author=. arXiv preprint arXiv:2603.06123 , year=

work page arXiv

[49] [49]

Introspective Diffusion Language Models

Introspective Diffusion Language Models , author=. arXiv preprint arXiv:2604.11035 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Understanding and Accelerating the Training of Masked Diffusion Language Models

Understanding and Accelerating the Training of Masked Diffusion Language Models , author=. arXiv preprint arXiv:2605.13026 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

2025 , eprint=

Fast-dLLM v2: Efficient Block-Diffusion LLM , author=. 2025 , eprint=

2025

[52] [52]

arXiv preprint arXiv:2508.13021 , year=

Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models , author=. arXiv preprint arXiv:2508.13021 , year=

work page arXiv

[53] [53]

2024 , eprint=

Better & Faster Large Language Models via Multi-token Prediction , author=. 2024 , eprint=

2024

[54] [54]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021

[56] [56]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021

[57] [57]

2021 , eprint=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

2021

[58] [58]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=

[60] [60]

arXiv preprint arXiv:2211.04236 , year=

Self-conditioned embedding diffusion for text generation , author=. arXiv preprint arXiv:2211.04236 , year=

work page arXiv

[61] [61]

The Eleventh International Conference on Learning Representations , year=

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , author=. The Eleventh International Conference on Learning Representations , year=

[62] [62]

Continuous diffusion for categorical data

Continuous diffusion for categorical data , author=. arXiv preprint arXiv:2211.15089 , year=

work page internal anchor Pith review Pith/arXiv arXiv