Registers Matter for Pixel-Space Diffusion Transformers
Pith reviewed 2026-05-20 19:03 UTC · model grok-4.3
The pith
Register tokens improve convergence and generation quality of pixel-space DiTs by producing cleaner feature maps at high noise levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiTs operating in pixel space do not display the patch-token outliers that plague ViTs, yet register tokens still deliver clear gains in convergence speed and generation quality. Representation analysis links these gains to cleaner feature maps at high noise levels. Recent strong pixel-space DiT models already contain implicit register-like behaviors. A dual-stream architecture that dedicates separate processing streams to register tokens yields additional quality improvements while adding negligible overhead.
What carries the argument
register tokens, which produce cleaner feature maps at high noise levels without participating in the main patch-token processing
If this is right
- Faster convergence during training of pixel-space DiTs
- Higher quality in final generated images
- Cleaner intermediate representations during high-noise diffusion steps
- Implicit register-like mechanisms explain the success of recent DiT architectures
- A dual-stream design achieves further gains with almost no extra compute
Where Pith is reading between the lines
- The same register mechanism may help other transformer-based generative models beyond diffusion
- Specialized token streams could become a standard design choice in future diffusion transformers
- The benefit might be tested at different noise schedules or resolutions to isolate the effect
- Register tokens could reduce the need for heavy regularization techniques in pixel-space training
Load-bearing premise
The performance gains come specifically from cleaner feature maps at high noise levels rather than from changes in optimization dynamics or other unmeasured factors.
What would settle it
An experiment that measures feature-map cleanliness at high noise levels and finds no correlation with the observed quality gains from register tokens would disprove the proposed mechanism.
Figures
read the original abstract
Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the role of register tokens in pixel-space Diffusion Transformers (DiTs). Unlike Vision Transformers, the authors find that DiTs lack high-norm patch-token outliers. Nevertheless, they report that adding register tokens improves convergence speed and generation quality. Intermediate representation analysis indicates that register tokens yield cleaner feature maps at high noise levels, which the authors suggest may explain the gains. They further note that recent pixel-space DiT designs appear to incorporate implicit register-like mechanisms and introduce a parameter-efficient dual-stream architecture that dedicates separate processing streams to register tokens, achieving quality improvements with negligible runtime cost.
Significance. If the empirical gains and mechanistic interpretation hold under rigorous controls, the work would offer actionable guidance for designing pixel-space diffusion transformers and clarify why register tokens remain useful even without the outlier problem that motivated them in ViTs. The dual-stream proposal is practically attractive because of its low overhead. The observation that state-of-the-art pixel-space models already embed register-like behavior is a useful retrospective insight. These contributions could influence architectural choices in future diffusion models, provided the causal attribution is strengthened.
major comments (2)
- [Representation analysis section] The central attribution in the representation analysis—that cleaner feature maps at high noise levels are responsible for the observed convergence and quality gains—is purely observational. No intervention, ablation, or controlled experiment isolates this mechanism from alternative explanations such as changes in optimization dynamics, gradient flow, or implicit regularization induced by the extra tokens. Because the motivating ViT outlier-suppression benefit is explicitly absent, this untested causal premise is load-bearing for the headline claim.
- [Experimental results] The experimental results reporting improved convergence and generation quality do not include statistical significance tests across multiple random seeds, detailed baseline comparisons that hold all other hyperparameters fixed, or ablations on register-token count and placement. Without these controls it is difficult to quantify the reliability and magnitude of the claimed benefits.
minor comments (2)
- The abstract and introduction would benefit from explicit statements of the datasets, metrics (e.g., FID, precision/recall), and training budgets used to measure generation quality.
- [Dual-stream architecture description] Notation for the dual-stream architecture (e.g., how the two streams interact at each layer) should be defined more formally, perhaps with a diagram or pseudocode, to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the causal interpretation and experimental rigor.
read point-by-point responses
-
Referee: [Representation analysis section] The central attribution in the representation analysis—that cleaner feature maps at high noise levels are responsible for the observed convergence and quality gains—is purely observational. No intervention, ablation, or controlled experiment isolates this mechanism from alternative explanations such as changes in optimization dynamics, gradient flow, or implicit regularization induced by the extra tokens. Because the motivating ViT outlier-suppression benefit is explicitly absent, this untested causal premise is load-bearing for the headline claim.
Authors: We agree that the current analysis is observational and does not include interventions that would isolate the proposed mechanism from alternatives such as optimization dynamics or regularization effects. In the revision we will add controlled ablations that compare feature-map statistics and performance when register tokens are present versus absent while monitoring gradient norms and loss landscapes. We will also revise the language in the representation section to more clearly frame the cleaner feature maps as a consistent correlate rather than a proven causal driver, while retaining the empirical observation that register tokens improve results even in the absence of patch-token outliers. revision: yes
-
Referee: [Experimental results] The experimental results reporting improved convergence and generation quality do not include statistical significance tests across multiple random seeds, detailed baseline comparisons that hold all other hyperparameters fixed, or ablations on register-token count and placement. Without these controls it is difficult to quantify the reliability and magnitude of the claimed benefits.
Authors: We acknowledge these gaps in statistical rigor and controls. In the revised manuscript we will rerun the main experiments with at least three random seeds, reporting means and standard deviations for convergence curves and FID scores. We will also present baseline comparisons in which all other hyperparameters remain fixed and add ablations that vary both the number of register tokens and their placement within the transformer blocks. revision: yes
Circularity Check
No significant circularity; empirical observations remain independent of fitted inputs
full rationale
The paper is an empirical study demonstrating that register tokens improve convergence and generation quality in pixel-space DiTs despite the absence of ViT-style patch-token outliers. It supports this via direct experiments and observational representation analysis showing cleaner feature maps at high noise levels. No equations, derivations, or first-principles results are presented that reduce the reported performance gains to quantities defined, fitted, or predicted from within the same experiment. Claims rest on external benchmarks and measurements rather than self-referential definitions or self-citation chains that would force the outcome by construction. Any citations to prior register-token work are standard background and not load-bearing for the DiT-specific findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions about optimization dynamics and evaluation metrics in diffusion model training hold for the tested architectures.
Reference graph
Works this paper leans on
-
[1]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[2]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
work page 2021
-
[3]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021
work page 2021
-
[4]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[5]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
work page 2021
-
[6]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021
-
[9]
Unsupervised semantic segmenta- tion by distilling feature correspondences,
Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022
-
[10]
Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021
Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021
-
[11]
Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self- supervised transformer and normalized cut.IEEE transactions on pattern analysis and machine intelligence, 45(12):15790–15801, 2023
work page 2023
-
[12]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025
Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandelsman. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025
-
[14]
Register and [cls] tokens induce a decoupling of local and global features in large vits
Alexander Lappe and Martin A Giese. Register and [cls] tokens induce a decoupling of local and global features in large vits. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[15]
Vision Transformers Need More Than Registers
Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.arXiv preprint arXiv:2602.22394, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025
Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and Andrew F Luo. Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025
-
[17]
Sinder: Repairing the singular defects of dinov2
Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. InEuropean Conference on Computer Vision, pages 20–35. Springer, 2024. 10
work page 2024
-
[18]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[19]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019
work page 2019
-
[20]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[21]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[22]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[23]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021
work page 2021
-
[24]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
One-step Latent-free Image Generation with Pixel Mean Flows
Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[28]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, et al. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
When Attention Sink Emerges in Language Models: An Empirical View
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025
Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, and Alessio Devoto. Attention sinks in diffusion language models.arXiv preprint arXiv:2510.15731, 2025
-
[33]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Diffusion transformers use sink registers
Amna Jamal, Mika Tan, Clarissa Aurelia Nahid Saputra, Quan Huynh, Kevin Zhu, and Antonio Mari. Diffusion transformers use sink registers. InSecond Workshop on XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge, 2026
work page 2026
-
[35]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 11
work page 2025
-
[36]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[37]
Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025
work page 2025
-
[38]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[39]
Diffusion transformers with representation autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InThe Fourteenth International Conference on Learning Repre- sentations, 2026
work page 2026
-
[40]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[41]
Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992
work page 1992
-
[42]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[43]
FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison, 2025
Black Forest Labs. FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison, 2025
work page 2025
-
[44]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025
-
[46]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[47]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[48]
Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V V o. Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026
-
[49]
Massive Activations in Large Language Models
Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, and Yossi Gandelsman. Interpreting the repeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908, 2025
-
[51]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu, and Shaosheng Cao. One token is enough: Improving diffusion language models with a sink token.arXiv preprint arXiv:2601.19657, 2026
-
[53]
Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025
Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, and Ashwinee Panda. Analysis of attention in video diffusion transformers.arXiv preprint arXiv:2504.10317, 2025. 12
-
[54]
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025
-
[55]
H., Nam, J., Yoon, H., and Kim, S
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025
-
[56]
Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mecha- nism of text-to-image diffusion model.Advances in Neural Information Processing Systems, 37:55342–55369, 2024
work page 2024
-
[57]
Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025
-
[58]
Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive activations.Advances in Neural Information Processing Systems, 38:114432–114462, 2026
work page 2026
-
[59]
Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, and Weiyao Lin. Massive activations are the key to local detail synthesis in diffusion transformers.arXiv preprint arXiv:2510.11538, 2025
-
[60]
Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026. 13 A Related Work Attention Sinks in Large Language Models.In autoregressive LLMs, attention sinks are a well- explored area [30, 49, 31, 50, 51]. [ 30] first analyzes anomalies in attention an...
-
[61]
We train the models using the same training and inference configuration as in JiT [ 24]
The proposed architecture introduces an additional parameter overhead of approximately14%. We train the models using the same training and inference configuration as in JiT [ 24]. For both configurations, we use LoRA with rank128in AdaLN. C Additional Analysis Results C.1 Analysis on ImageNet Outliers in DiTs.In the main text, we show that DiTs are free f...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.