pith. sign in

arxiv: 2605.16147 · v1 · pith:674KEV4Rnew · submitted 2026-05-15 · 💻 cs.CV

Registers Matter for Pixel-Space Diffusion Transformers

Pith reviewed 2026-05-20 19:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords register tokensdiffusion transformerspixel-space generationfeature mapshigh noise levelsdual-stream architecturevision transformers
0
0 comments X

The pith

Register tokens improve convergence and generation quality of pixel-space DiTs by producing cleaner feature maps at high noise levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether register tokens, which mitigate issues in vision transformers, also help diffusion transformers trained directly in pixel space. Although pixel-space DiTs lack the high-norm patch-token outliers common in ViTs, adding register tokens still leads to faster training convergence and better final image quality. Analysis of intermediate layers shows that register tokens keep feature maps cleaner especially during the high-noise phases of the diffusion process. This observation also accounts for the strong results of recent DiT designs that already embed similar mechanisms implicitly. The work introduces a parameter-efficient dual-stream architecture that processes register tokens separately to gain further quality with almost no added runtime cost.

Core claim

DiTs operating in pixel space do not display the patch-token outliers that plague ViTs, yet register tokens still deliver clear gains in convergence speed and generation quality. Representation analysis links these gains to cleaner feature maps at high noise levels. Recent strong pixel-space DiT models already contain implicit register-like behaviors. A dual-stream architecture that dedicates separate processing streams to register tokens yields additional quality improvements while adding negligible overhead.

What carries the argument

register tokens, which produce cleaner feature maps at high noise levels without participating in the main patch-token processing

If this is right

  • Faster convergence during training of pixel-space DiTs
  • Higher quality in final generated images
  • Cleaner intermediate representations during high-noise diffusion steps
  • Implicit register-like mechanisms explain the success of recent DiT architectures
  • A dual-stream design achieves further gains with almost no extra compute

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same register mechanism may help other transformer-based generative models beyond diffusion
  • Specialized token streams could become a standard design choice in future diffusion transformers
  • The benefit might be tested at different noise schedules or resolutions to isolate the effect
  • Register tokens could reduce the need for heavy regularization techniques in pixel-space training

Load-bearing premise

The performance gains come specifically from cleaner feature maps at high noise levels rather than from changes in optimization dynamics or other unmeasured factors.

What would settle it

An experiment that measures feature-map cleanliness at high noise levels and finds no correlation with the observed quality gains from register tokens would disprove the proposed mechanism.

Figures

Figures reproduced from arXiv: 2605.16147 by Artem Babenko, Dmitry Baranchuk, Ilia Sudakov, Ilya Drobyshevskiy, Nikita Starodubcev.

Figure 1
Figure 1. Figure 1: Diffusion transformers do not exhibit attention-map outliers. Unlike ViTs, where attention-map anomalies typically appear in low-information regions (e.g., background), DiT attention remains focused on the main objects. Contributions. We find that, unlike ViTs, diffusion transformers in both latent and pixel spaces do not exhibit noticeable high-norm outliers among patch tokens. Instead, patch-token norms … view at source ↗
Figure 2
Figure 2. Figure 2: Without Registers. (a) In DINOv2, anomalies are localized to few image tokens, which exhibit significantly higher norms than others. (b) In contrast, no outliers are observed for pDiTs, suggesting that registers may be unnecessary in this case. (a) (b) 0 50 100 150 200 250 300 token index 0 1 2 3 4 5 6 Token norm ×104 pDiT-B/16, With Registers block 5 block 10 0 50 100 150 200 250 token index 0.0 0.6 1.2 1… view at source ↗
Figure 3
Figure 3. Figure 3: With Registers. (a) As expected, introducing register tokens in DINOv2 shifts high-norm outliers into these tokens. (b) Interestingly, pDiTs also exhibit high-norm tokens in the added registers, even though such outliers are absent without registers. pDiT-B/16, 131M pDiT-L/16, 459M pDiT-H/16, 953M Epoch w/o reg. w/ reg. w/ in-context w/o reg. w/ reg. w/ in-context w/o reg. w/ reg. w/ in-context 200 7.39 5.… view at source ↗
Figure 4
Figure 4. Figure 4: Register tokens consistently reduce feature norms across patch tokens. We measure feature norms for image tokens only (excluding registers) at three diffusion timesteps and observe a consistent reduction across all tokens when registers are used. Original image pDiT-B/16 with registers pDiT-B/16 without registers (a) (b) 0 1 2 3 4 5 6 7 8 9 10 avg Block 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Timestep … view at source ↗
Figure 5
Figure 5. Figure 5: Register tokens improve intermediate representations. (a) We compute the Total Variation (TV) of intermediate features for models with and without register tokens. We report the ratio (with / without registers), where lower values indicate that the model with registers produces smoother features. We find that registers improve feature smoothness at high noise levels (t ∈ [0, 0.2]). (b) We visualize feature… view at source ↗
Figure 6
Figure 6. Figure 6: Register tokens act as both global information carriers and norm sinks. Linear probing reveals that low-norm register tokens encode meaningful global semantics and achieve strong classification accuracy, whereas low-accuracy registers exhibit extremely large norms, suggesting that they primarily function as norm sinks that absorb magnitude from patch tokens. Input image Attention maps for register tokens (… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Registers with high probing accuracy encode diverse semantic information about an image, whereas (b) low-accuracy norm sinks do not. We visualize attention maps for register tokens and observe that some attend to distinct semantic regions, such as foreground objects and background areas. In contrast, norm sinks with low probing accuracy do not exhibit meaningful semantic structure. First, we find that … view at source ↗
Figure 8
Figure 8. Figure 8: In-context class tokens act as registers. (a) Certain tokens acquire disproportionately high feature norms, functioning as norm sinks. (b) Some tokens encode broad global information, rather than purely class-specific features as originally intended. 2.5 Registers Are Effective in Deeper Layers Next, we ablate both the number of register tokens and the transformer blocks in which they are introduced. We co… view at source ↗
Figure 19
Figure 19. Figure 19: 15 [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 9
Figure 9. Figure 9: High-norm outliers consistently emerge within register tokens across timesteps. We visualize token-wise feature norms of pDiT-B/16 with registers for t = 0.0, 0.3, and 0.7, and observe the same behavior in all cases. 0 50 100 150 200 250 token index 1 2 3 4 5 Token norm ×103 DT-B/16, Without Registers block 2 block 5 block 8 block 10 0 50 100 150 200 250 300 token index 0 1 2 3 4 5 Token norm ×104 DT-B/16,… view at source ↗
Figure 10
Figure 10. Figure 10: Token-wise feature norms for pDiTs of varying scales on ImageNet 256 × 256, with and without registers. Without registers, patch-token norms remain uniform across scales. Introducing registers leads to the emergence of high-norm outliers within the register tokens. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Token-wise feature norms for VAE-space SiTs of varying scales on ImageNet 256 × 256, with and without registers. SiTs without registers exhibit uniform patch-token norms across scales, while adding registers produces high-norm register tokens. 0 50 100 150 200 250 300 token index 0.5 1.0 1.5 2.0 Token norm ×102 RAE-S, With Registers block 0 block 2 block 5 block 8 block 10 0 50 100 150 200 250 token index… view at source ↗
Figure 12
Figure 12. Figure 12: Token-wise feature norms for DINOv2-space RAEs of varying scales on ImageNet 256 × 256, with and without registers. RAEs without registers exhibit uniform patch-token norms across scales, while adding registers produces high-norm register tokens. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Register tokens consistently reduce feature norms across patch tokens. We measure feature norms for image tokens only (excluding register tokens) at three diffusion timesteps for pDiT models of different scales, and observe a consistent reduction in feature norms across nearly all tokens when register tokens are used. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Register tokens make intermediate representations cleaner by reducing noise. We compute the Total Variation of intermediate features for models with and without register tokens. We report the ratio (with registers / without registers), where lower values indicate that models with registers produce smoother feature representations. We find that register tokens improve feature smoothness at high noise level… view at source ↗
Figure 15
Figure 15. Figure 15: Registers improve spatial organization at high noise levels. In addition to the TV ratio (left), we also analyze the correlation decay slope [45] (right), where lower values indicate stronger spatial organization. Both metrics show the same trend: register tokens improve internal representations at high noise levels, starting from block 4, where the registers are introduced. 21 [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 16
Figure 16. Figure 16: Linear probing of register tokens under different configurations. (Left) Standard register tokens introduced from the 4th layer; (Middle) Register tokens used as in-context class embeddings introduced from the 4th layer; (Right) Standard register tokens introduced from the 0th layer. Across different timesteps, we find that introducing registers from the earliest layers produces substantially less informa… view at source ↗
Figure 17
Figure 17. Figure 17: Pixel-space pDiTs have the highest feature norms across all tokens for different timesteps compared to latent-space counterparts. We compare token-wise feature-map norms for pDiT, SiT, and RAE models, all without register tokens. 0 4 9 13 17 21 26 avg Block 0.0 0.1 0.2 0.4 0.6 0.7 1.0 Timestep 0.12 0.59 2.05 1.19 0.53 0.18 0.40 0.72 0.23 0.83 1.32 1.19 0.47 0.10 0.18 0.62 0.33 1.10 1.20 0.96 0.44 0.11 0.2… view at source ↗
Figure 18
Figure 18. Figure 18: Pixel-space pDiTs exhibit substantially higher Total Variation (TV) values than latent-space counterparts. We compare the TV ratio of intermediate feature maps across timesteps and transformer blocks for pixel-space pDiTs (pDiT-H) and latent-space models (SiT-XL and RAE￾XL) without registers. Pixel-space pDiTs consistently produce noisier intermediate representations. 22 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 19
Figure 19. Figure 19: For SSL ViTs such as DINOv2, register tokens do not reduce patch-token feature norms, unlike in DiTs. We measure feature norms across all tokens for different blocks and model sizes of DINOv2, and observe that register tokens do not consistently reduce feature norms of patch tokens. 0 1000 2000 3000 4000 token index 0.0 0.5 1.0 1.5 2.0 token norm ×104 SD3.5. Norms of text (left) and image tokens (right) b… view at source ↗
Figure 20
Figure 20. Figure 20: Text sequences in text-to-image diffusion models exhibit behavior similar to register tokens in ImageNet-based DiTs: some tokens become high-norm outliers and potentially act as registers. We measure token-wise feature norms in SD3.5 (left) and FLUX (right) for both text and image tokens. We observe that the outliers primarily emerge within the text sequence. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
read the original abstract

Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the role of register tokens in pixel-space Diffusion Transformers (DiTs). Unlike Vision Transformers, the authors find that DiTs lack high-norm patch-token outliers. Nevertheless, they report that adding register tokens improves convergence speed and generation quality. Intermediate representation analysis indicates that register tokens yield cleaner feature maps at high noise levels, which the authors suggest may explain the gains. They further note that recent pixel-space DiT designs appear to incorporate implicit register-like mechanisms and introduce a parameter-efficient dual-stream architecture that dedicates separate processing streams to register tokens, achieving quality improvements with negligible runtime cost.

Significance. If the empirical gains and mechanistic interpretation hold under rigorous controls, the work would offer actionable guidance for designing pixel-space diffusion transformers and clarify why register tokens remain useful even without the outlier problem that motivated them in ViTs. The dual-stream proposal is practically attractive because of its low overhead. The observation that state-of-the-art pixel-space models already embed register-like behavior is a useful retrospective insight. These contributions could influence architectural choices in future diffusion models, provided the causal attribution is strengthened.

major comments (2)
  1. [Representation analysis section] The central attribution in the representation analysis—that cleaner feature maps at high noise levels are responsible for the observed convergence and quality gains—is purely observational. No intervention, ablation, or controlled experiment isolates this mechanism from alternative explanations such as changes in optimization dynamics, gradient flow, or implicit regularization induced by the extra tokens. Because the motivating ViT outlier-suppression benefit is explicitly absent, this untested causal premise is load-bearing for the headline claim.
  2. [Experimental results] The experimental results reporting improved convergence and generation quality do not include statistical significance tests across multiple random seeds, detailed baseline comparisons that hold all other hyperparameters fixed, or ablations on register-token count and placement. Without these controls it is difficult to quantify the reliability and magnitude of the claimed benefits.
minor comments (2)
  1. The abstract and introduction would benefit from explicit statements of the datasets, metrics (e.g., FID, precision/recall), and training budgets used to measure generation quality.
  2. [Dual-stream architecture description] Notation for the dual-stream architecture (e.g., how the two streams interact at each layer) should be defined more formally, perhaps with a diagram or pseudocode, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the causal interpretation and experimental rigor.

read point-by-point responses
  1. Referee: [Representation analysis section] The central attribution in the representation analysis—that cleaner feature maps at high noise levels are responsible for the observed convergence and quality gains—is purely observational. No intervention, ablation, or controlled experiment isolates this mechanism from alternative explanations such as changes in optimization dynamics, gradient flow, or implicit regularization induced by the extra tokens. Because the motivating ViT outlier-suppression benefit is explicitly absent, this untested causal premise is load-bearing for the headline claim.

    Authors: We agree that the current analysis is observational and does not include interventions that would isolate the proposed mechanism from alternatives such as optimization dynamics or regularization effects. In the revision we will add controlled ablations that compare feature-map statistics and performance when register tokens are present versus absent while monitoring gradient norms and loss landscapes. We will also revise the language in the representation section to more clearly frame the cleaner feature maps as a consistent correlate rather than a proven causal driver, while retaining the empirical observation that register tokens improve results even in the absence of patch-token outliers. revision: yes

  2. Referee: [Experimental results] The experimental results reporting improved convergence and generation quality do not include statistical significance tests across multiple random seeds, detailed baseline comparisons that hold all other hyperparameters fixed, or ablations on register-token count and placement. Without these controls it is difficult to quantify the reliability and magnitude of the claimed benefits.

    Authors: We acknowledge these gaps in statistical rigor and controls. In the revised manuscript we will rerun the main experiments with at least three random seeds, reporting means and standard deviations for convergence curves and FID scores. We will also present baseline comparisons in which all other hyperparameters remain fixed and add ablations that vary both the number of register tokens and their placement within the transformer blocks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations remain independent of fitted inputs

full rationale

The paper is an empirical study demonstrating that register tokens improve convergence and generation quality in pixel-space DiTs despite the absence of ViT-style patch-token outliers. It supports this via direct experiments and observational representation analysis showing cleaner feature maps at high noise levels. No equations, derivations, or first-principles results are presented that reduce the reported performance gains to quantities defined, fitted, or predicted from within the same experiment. Claims rest on external benchmarks and measurements rather than self-referential definitions or self-citation chains that would force the outcome by construction. Any citations to prior register-token work are standard background and not load-bearing for the DiT-specific findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard deep-learning assumptions about the validity of FID or similar metrics, the representativeness of the training distribution, and that feature-map statistics at selected noise levels are causally linked to final generation quality; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Standard assumptions about optimization dynamics and evaluation metrics in diffusion model training hold for the tested architectures.
    Invoked implicitly when attributing performance gains to register tokens and cleaner feature maps.

pith-pipeline@v0.9.0 · 5717 in / 1283 out tokens · 51401 ms · 2026-05-20T19:03:01.995560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 16 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  2. [2]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  3. [3]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  4. [4]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  5. [5]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  6. [6]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  7. [7]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  8. [8]

    Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021

    Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021

  9. [9]

    Unsupervised semantic segmenta- tion by distilling feature correspondences,

    Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022

  10. [10]

    Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021

    Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021

  11. [11]

    Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self- supervised transformer and normalized cut.IEEE transactions on pattern analysis and machine intelligence, 45(12):15790–15801, 2023

  12. [12]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  13. [13]

    Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandelsman. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

  14. [14]

    Register and [cls] tokens induce a decoupling of local and global features in large vits

    Alexander Lappe and Martin A Giese. Register and [cls] tokens induce a decoupling of local and global features in large vits. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  15. [15]

    Vision Transformers Need More Than Registers

    Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.arXiv preprint arXiv:2602.22394, 2026

  16. [16]

    Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025

    Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and Andrew F Luo. Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025

  17. [17]

    Sinder: Repairing the singular defects of dinov2

    Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. InEuropean Conference on Computer Vision, pages 20–35. Springer, 2024. 10

  18. [18]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  19. [19]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  21. [21]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  22. [22]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  23. [23]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  24. [24]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  25. [25]

    PixelDiT: Pixel Diffusion Transformers for Image Generation

    Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645, 2025

  26. [26]

    One-step Latent-free Image Generation with Pixel Mean Flows

    Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026

  27. [27]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  28. [28]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  29. [29]

    Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, et al. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026

  30. [30]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  31. [31]

    When Attention Sink Emerges in Language Models: An Empirical View

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

  32. [32]

    Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

    Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025

  33. [33]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  34. [34]

    Diffusion transformers use sink registers

    Amna Jamal, Mika Tan, Clarissa Aurelia Nahid Saputra, Quan Huynh, Kevin Zhu, and Antonio Mari. Diffusion transformers use sink registers. InSecond Workshop on XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge, 2026

  35. [35]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 11

  36. [36]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

  37. [37]

    Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025

    Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025

  38. [38]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  39. [39]

    Diffusion transformers with representation autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InThe Fourteenth International Conference on Learning Repre- sentations, 2026

  40. [40]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  41. [41]

    Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992

    Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992

  42. [42]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  43. [43]

    FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison, 2025

    Black Forest Labs. FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison, 2025

  44. [44]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  45. [45]

    What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

  46. [46]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  47. [47]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  48. [48]

    Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

    Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V V o. Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

  49. [49]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  50. [50]

    Interpreting the repeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908, 2025

    Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, and Yossi Gandelsman. Interpreting the repeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908, 2025

  51. [51]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

  52. [52]

    One token is enough: Improving diffusion language models with a sink token.arXiv preprint arXiv:2601.19657, 2026

    Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu, and Shaosheng Cao. One token is enough: Improving diffusion language models with a sink token.arXiv preprint arXiv:2601.19657, 2026

  53. [53]

    Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025

    Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, and Ashwinee Panda. Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025. 12

  54. [54]

    Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

    Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

  55. [55]

    H., Nam, J., Yoon, H., and Kim, S

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

  56. [56]

    Towards understanding the working mecha- nism of text-to-image diffusion model.Advances in Neural Information Processing Systems, 37:55342–55369, 2024

    Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mecha- nism of text-to-image diffusion model.Advances in Neural Information Processing Systems, 37:55342–55369, 2024

  57. [57]

    Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

  58. [58]

    Unleashing diffusion transformers for visual correspondence by modulating massive activations.Advances in Neural Information Processing Systems, 38:114432–114462, 2026

    Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive activations.Advances in Neural Information Processing Systems, 38:114432–114462, 2026

  59. [59]

    Massive activations are the key to local detail synthesis in diffusion transformers.arXiv preprint arXiv:2510.11538, 2025

    Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, and Weiyao Lin. Massive activations are the key to local detail synthesis in diffusion transformers.arXiv preprint arXiv:2510.11538, 2025

  60. [60]

    Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

    Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026. 13 A Related Work Attention Sinks in Large Language Models.In autoregressive LLMs, attention sinks are a well- explored area [30, 49, 31, 50, 51]. [ 30] first analyzes anomalies in attention an...

  61. [61]

    We train the models using the same training and inference configuration as in JiT [ 24]

    The proposed architecture introduces an additional parameter overhead of approximately14%. We train the models using the same training and inference configuration as in JiT [ 24]. For both configurations, we use LoRA with rank128in AdaLN. C Additional Analysis Results C.1 Analysis on ImageNet Outliers in DiTs.In the main text, we show that DiTs are free f...