arxiv: 2604.07397 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training

Jinhong Lin , Pan Wang , Zitong Zhan , Lin Zhang , Pedro Morgado

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords curriculum learningdiffusion trainingdata schedulingimage complexityImageNetSiTtraining efficiencyFID IS

0 comments

The pith

Scheduling images from simple to complex lets diffusion models reach baseline quality tens of thousands of steps earlier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion training wastes early steps because a random network cannot resolve gradients from complex images it has no capacity to handle yet. Data Warmup scores every image once by an offline metric that adds foreground dominance to foreground typicality, then uses a temperature-controlled sampler to draw easy examples first and gradually move to uniform sampling. On ImageNet 256x256 this ordering produces higher final IS and FID scores than uniform sampling and reaches the same quality far sooner. The paper shows the direction of the curriculum matters: starting with hard images instead lowers performance below the baseline. No model or loss change is required and the scoring cost is paid only once in preprocessing.

Core claim

Data Warmup schedules training images from simple to complex by scoring each offline with a semantic complexity metric of foreground dominance plus foreground typicality, then sampling via a temperature-controlled scheduler that prioritizes low scores early and anneals to uniform. On ImageNet 256x256 with SiT backbones this yields IS gains up to 6.11 and FID gains up to 3.41 while matching baseline quality tens of thousands of iterations sooner than uniform sampling.

What carries the argument

Offline semantic complexity metric (foreground dominance plus foreground typicality) that drives a temperature-controlled sampler to order images from low to high complexity.

If this is right

Equivalent IS and FID scores are reached tens of thousands of iterations earlier than uniform sampling.
Gains appear consistently across SiT backbones from S/2 through XL/2 on ImageNet 256x256.
The curriculum combines directly with other accelerators such as REPA.
Reversing the order to hard-first degrades performance below the uniform baseline.
Only a one-time offline preprocessing pass of roughly ten minutes is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline scoring idea could be tested on video or 3D data where early exposure to simple examples might likewise reduce wasted gradient steps.
If the complexity metric correlates with per-image loss curves in the first few thousand iterations, it could serve as a cheap proxy for online difficulty estimation.
Scaling the curriculum to much larger datasets would require checking whether the foreground-based scores remain predictive when class diversity and object scale vary more widely.

Load-bearing premise

The foreground-dominance-plus-typicality score correctly identifies which images a randomly initialized network can usefully learn from in the earliest training stages.

What would settle it

Applying the reversed hard-first curriculum on the same ImageNet and SiT setups and finding that it matches or exceeds the uniform baseline would show that the simple-to-complex order is not what drives the reported gains.

Figures

Figures reproduced from arXiv: 2604.07397 by Jinhong Lin, Lin Zhang, Pan Wang, Pedro Morgado, Zitong Zhan.

**Figure 1.** Figure 1: Image complexity. Data warmup is based on image complexity. We quantify image complexity via two factors: foreground dominance and foreground typicality. Foreground separation leverages features from DINO-v2 to isolate salient regions. Foreground dominance (Ωdom) measures the prominence of foreground elements, while foreground typicality (Ωprot) assesses how representative a foreground is relative to learn… view at source ↗

**Figure 2.** Figure 2: Data warmup. During diffusion training, images are sampled according to their complexity scores through a temperature-based scheduler. Initially, simpler images (low complexity scores) are prioritized, with the sampling distribution gradually shifting to uniformly cover the full dataset complexity. • Foreground dominance (Ωdom): how much of the image is occupied by salient objects. A centered golden retrie… view at source ↗

**Figure 3.** Figure 3: Illustration of the complexity correction function Ωdom(rbg; κ, vmin) (Eq. 3). Left: Varying steepness κ with fixed vmin = 0.002. Higher κ leads to a sharper transition. Right: Varying minimum value vmin with fixed steepness κ = 12.0 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: ImageNet samples organized into low and high complexity according to Ω scores. Each row displays images from the same category. Sampling probabilities. At training iteration t, image i is sampled with probability P(i|t) = exp − Ω˜ i/τ (t) PN j=1 exp − Ω˜ j/τ (t) , (7) where N = |D| is the number of training images. When τ is small, the distribution concentrates on low-Ω (simple) images; as τ → ∞, it fl… view at source ↗

**Figure 5.** Figure 5: Schedule. Evolution of the effective dataset size |Dτ(t) | (normalized by |D|) using the power schedule. Warmup iterations set to Tw = 400. where Ω k(i) min and Ω k(i) max are the extremes within cluster k(i). This ensures that at any temperature the sampler draws proportionally from all visual concepts, varying only the withincluster difficulty. Temperature annealing. Setting τ (t) directly is unintuit… view at source ↗

**Figure 6.** Figure 6: Training curve. Inception scores of an SiT model trained on ImageNet-500 with and without data warmup. Tw = 200k [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Generated samples for the class “Loggerhead Sea Turtle” from a SiT-XL/2 model trained with REPA and data warmup for 2M iterations (classifier-free guidance w = 4.0). informative loss region from which recovery is slow. This interpretation aligns with recent analyses of loss landscape geometry in diffusion models [36] and suggests that the early training phase has an outsized, potentially irreversible influ… view at source ↗

read the original abstract

A key inefficiency in diffusion training occurs when a randomly initialized network, lacking visual priors, encounters gradients from the full complexity spectrum--most of which it lacks the capacity to resolve. We propose Data Warmup, a curriculum strategy that schedules training images from simple to complex without modifying the model or loss. Each image is scored offline by a semantic-aware complexity metric combining foreground dominance (how much of the image salient objects occupy) and foreground typicality (how closely the salient content matches learned visual prototypes). A temperature-controlled sampler then prioritizes low-complexity images early and anneals toward uniform sampling. On ImageNet 256x256 with SiT backbones (S/2 to XL/2), Data Warmup improves IS by up to 6.11 and FID by up to 3.41, reaching baseline quality tens of thousands of iterations earlier. Reversing the curriculum (exposing hard images first) degrades performance below the uniform baseline, confirming that the simple-to-complex ordering itself drives the gains. The method combines with orthogonal accelerators such as REPA and requires only ~10 minutes of one-time preprocessing with zero per-iteration overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The curriculum delivers measurable speedups on SiT-ImageNet training with a clean reversal control, but the offline metric's claim to capture early learnability for random-init models rests on shaky ground.

read the letter

The paper's core result is that a simple-to-complex ordering of ImageNet images, scored by foreground dominance plus typicality to pretrained prototypes and fed through a temperature-annealed sampler, lets SiT models hit their baseline IS and FID scores tens of thousands of steps earlier. Gains appear across S/2 to XL/2 scales, the method adds negligible overhead after a one-time preprocess, and it stacks with REPA. The reversal experiment is the strongest piece of evidence: hard-first ordering falls below the uniform baseline, so the direction of the curriculum is doing real work rather than just any non-uniform sampling.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Data Warmup, a curriculum strategy for diffusion model training that orders images from low to high semantic complexity using an offline metric combining foreground dominance and foreground typicality (computed against pretrained prototypes). A temperature-controlled sampler prioritizes low-complexity images early and anneals to uniform sampling. On ImageNet 256x256 with SiT backbones (S/2 to XL/2), the method reports IS gains up to 6.11 and FID gains up to 3.41, faster convergence to baseline quality, a reversed curriculum that underperforms the uniform baseline, and compatibility with REPA, all with ~10 minutes of one-time preprocessing and no per-iteration cost.

Significance. If the central claim holds, the work offers a lightweight, model- and loss-agnostic approach to improving training efficiency for diffusion models by aligning data difficulty with early-stage model capacity. The scale-consistent empirical gains and the reversal control are positive features that support the value of ordered curricula. The approach could reduce wall-clock time for large-scale generative training without architectural changes.

major comments (2)

[§3.2] §3.2 (Complexity Metric): Foreground typicality is defined using prototypes from a pretrained model. This introduces external semantic knowledge unavailable to a randomly initialized SiT at step 0. The manuscript should show that the selected images yield higher gradient signal-to-noise or lower initial loss for the target architecture (e.g., via direct comparison of per-image loss or gradient statistics at initialization) rather than relying solely on downstream IS/FID gains.
[§4.2] §4.2 (Ablations and Controls): The reversal experiment demonstrates that ordering direction matters, but does not test whether the specific semantic metric outperforms simpler non-semantic heuristics (e.g., image variance, edge density, or low-frequency content) that might produce similar non-uniform sampling. Without such a control, the reported speedups could be reproduced by any sampler favoring the same low-variance statistics, weakening the claim that semantic complexity is the operative factor.

minor comments (2)

[Table 1] Table 1: Report the exact temperature schedule parameters (initial T, decay rate) used for each SiT scale to ensure full reproducibility.
[§5] §5 (Related Work): Expand discussion of prior curriculum and data-ordering methods in diffusion and generative modeling to better situate the contribution relative to existing empirical schedules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, providing our response and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Complexity Metric): Foreground typicality is defined using prototypes from a pretrained model. This introduces external semantic knowledge unavailable to a randomly initialized SiT at step 0. The manuscript should show that the selected images yield higher gradient signal-to-noise or lower initial loss for the target architecture (e.g., via direct comparison of per-image loss or gradient statistics at initialization) rather than relying solely on downstream IS/FID gains.

Authors: We acknowledge that the foreground typicality component relies on prototypes from a pretrained model, thereby incorporating semantic knowledge not available to a randomly initialized SiT. This is a deliberate design choice to create an offline, one-time complexity score (~10 minutes preprocessing) that remains model- and loss-agnostic during training. To directly address the request for evidence at initialization, we will add to the revised manuscript an analysis of per-image losses and gradient norms (signal-to-noise) computed on the randomly initialized SiT for images ranked low versus high by our metric. This will provide initial-training-signal evidence in addition to the reported IS/FID curves. revision: yes
Referee: [§4.2] §4.2 (Ablations and Controls): The reversal experiment demonstrates that ordering direction matters, but does not test whether the specific semantic metric outperforms simpler non-semantic heuristics (e.g., image variance, edge density, or low-frequency content) that might produce similar non-uniform sampling. Without such a control, the reported speedups could be reproduced by any sampler favoring the same low-variance statistics, weakening the claim that semantic complexity is the operative factor.

Authors: We agree that the reversal control, while demonstrating the importance of ordering direction, does not fully isolate whether semantic aspects of the metric are necessary versus simpler statistical heuristics. In the revised manuscript we will add ablations that replace our complexity metric with non-semantic alternatives (image variance and edge density) to construct curricula, then compare their IS/FID trajectories and convergence speed against both Data Warmup and the uniform baseline on the same SiT setups. This will clarify the contribution of the semantic components. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical offline curriculum with independent metric

full rationale

The paper presents an empirical curriculum method that scores images offline using a fixed semantic complexity metric (foreground dominance plus typicality to prototypes) and applies a temperature-controlled sampler. No derivation chain exists in which a central quantity is defined in terms of a fitted parameter later called a prediction, nor any self-citation load-bearing step, uniqueness theorem, or ansatz smuggled via citation. The reversal experiment and reported IS/FID gains are external empirical tests, not reductions to the method's own inputs. The approach is self-contained against benchmarks and does not reduce any claimed result to its own construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the proposed complexity metric correlates with early-stage learnability and that the temperature schedule can be chosen without introducing new fitting artifacts. No free parameters are explicitly fitted to the final metrics in the abstract description.

free parameters (1)

temperature schedule
Controls how quickly the sampler anneals from low-complexity to uniform sampling; its exact functional form and hyperparameters are not detailed in the abstract.

axioms (1)

domain assumption Foreground dominance and typicality together form a reliable proxy for image complexity that a randomly initialized network can resolve early.
Invoked when designing the offline scoring step that drives the curriculum ordering.

invented entities (1)

semantic-aware complexity metric no independent evidence
purpose: To assign each image a scalar score for curriculum ordering
Newly defined combination of foreground dominance and foreground typicality; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5508 in / 1415 out tokens · 71762 ms · 2026-05-10T18:51:04.809693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Variance reduction in sgd by distributed importance sampling.arXiv preprint arXiv:1511.06481,

Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance reduction in sgd by distributed importance sampling.arXiv preprint arXiv:1511.06481, 2015. 2

work page arXiv 2015
[2]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. 1, 2

2009
[3]

Structural and adversarial representation alignment for training-efficient diffusion models.arXiv preprint arXiv:2503.08253, 3,

Hesen Chen, Junyan Wang, Zhiyu Tan, and Hao Li. Sara: Structural and adversarial representation alignment for training-efficient diffusion models.arXiv preprint arXiv:2503.08253, 2025. 2

work page arXiv 2025
[4]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 5

2021
[5]

Automated curriculum learning for neural networks.International Conference on Machine Learning, 2017

Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks.International Conference on Machine Learning, 2017. 2

2017
[6]

Masked autoencoders are scalable vision learners, 2021

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021. 2

2021
[7]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 5

2017
[8]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

2020
[9]

Advancing ppg-based continuous blood pressure monitoring from a generative perspective

Hui Ji and Pengfei Zhou. Advancing ppg-based continuous blood pressure monitoring from a generative perspective. In Proceedings of the 22nd ACM Conference on Embedded Net- worked Sensor Systems, pages 661–674, 2024. 2

2024
[10]

Translation from wear- able ppg to 12-lead ecg.arXiv preprint arXiv:2509.25480,

Hui Ji, Wei Gao, and Pengfei Zhou. Translation from wear- able ppg to 12-lead ecg.arXiv preprint arXiv:2509.25480,

work page arXiv
[11]

Self-paced curriculum learning

Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelli- gence, 2015. 2

2015
[12]

Not all samples are created equal: Deep learning with importance sampling

Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. InInternational conference on machine learning, pages 2525–
[13]

Grad-match: Gra- dient matching based data subset selection for efficient deep model training

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gra- dient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 2

2021
[14]

Self- paced learning for latent variable models

M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self- paced learning for latent variable models. InAdvances in neural information processing systems, 2010. 2

2010
[15]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019. 5

2019
[16]

From prototypes to general distri- butions: An efficient curriculum for masked image modeling

Jinhong Lin, Cheng-En Wu, Huanran Li, Jifan Zhang, Yu Hen Hu, and Pedro Morgado. From prototypes to general distri- butions: An efficient curriculum for masked image modeling. arXiv preprint arXiv:2411.10685, 2024. 2, 4, 5

work page arXiv 2024
[17]

Efficient generative model training via embedded representation warmup.arXiv preprint arXiv:2504.10188,

Deyuan Liu, Peng Sun, Xufeng Li, and Tao Lin. Efficient gen- erative model training via embedded representation warmup. arXiv preprint arXiv:2504.10188, 2025. 2

work page arXiv 2025
[18]

arXiv preprint arXiv:1511.06343 , year =

Ilya Loshchilov and Frank Hutter. Online batch selec- tion for faster training of neural networks.arXiv preprint arXiv:1511.06343, 2015. 2

work page arXiv 2015
[19]

Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024. 1, 5, 6

2024
[20]

Prioritized training on points that are learnable, worth learning, and not yet learnt

Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Hölt- gen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. InInternational Conference on Machine Learning, pages 15630–15649. PMLR, 2022. 2

2022
[21]

Coresets for data-efficient training of machine learning mod- els

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning mod- els. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 2

2020
[22]

Curriculum learning for reinforcement learning domains: A framework and survey

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181):1–50, 2020. 2

2020
[23]

Battaglia

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 5

work page arXiv 2021
[24]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Competence- based curriculum learning for neural machine translation

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neu- big, Barnabás Póczos, and Tom M Mitchell. Competence- based curriculum learning for neural machine translation. In North American Chapter of the Association for Computa- tional Linguistics, 2019. 2

2019
[26]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 5

2022
[27]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015. 5

2015
[28]

Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

2022
[29]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 5

2016
[30]

Prioritized Experience Replay

Tom Schaul. Prioritized experience replay.arXiv preprint arXiv:1511.05952, 2015. 2

work page Pith review arXiv 2015
[31]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2011
[32]

From models to systems: A comprehensive survey of efficient multimodal learning

Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan (Celine) Lin, Beidi Chen, Mohit Bansal, Xiaoming Liu, Pengfei Zhou, Ming-Hsuan Yang, Tianlong Chen, and Jingtong Hu. From models to systems: A comprehensive survey of efficient multimodal learning. TechRxiv, 2026. 2

2026
[33]

A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021. 2

2021
[34]

Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024

ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, et al. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024. 2

work page arXiv 2024
[35]

Emergence of human-like attention in self-supervised vi- sion transformers: an eye-tracking study.arXiv preprint arXiv:2410.22768, 2024

Takuto Yamamoto, Hirosato Akahoshi, and Shigeru Kitazawa. Emergence of human-like attention in self-supervised vi- sion transformers: an eye-tracking study.arXiv preprint arXiv:2410.22768, 2024. 3

work page arXiv 2024
[36]

Fasterdit: Towards faster diffusion transformers training with- out architecture modification.Advances in Neural Informa- tion Processing Systems, 37:56166–56189, 2024

Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training with- out architecture modification.Advances in Neural Informa- tion Processing Systems, 37:56166–56189, 2024. 2, 8

2024
[37]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review arXiv
[38]

arXiv:2306.09305 , year=

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anand- kumar. Fast training of diffusion models with masked trans- formers.arXiv preprint arXiv:2306.09305, 2023. 2

work page arXiv 2023