GRASP: Guided Residual Adapters with Sample-wise Partitioning

Bernhard Kainz; Felix N\"utzel; Mischa Dombrowski

arxiv: 2512.01675 · v2 · submitted 2025-12-01 · 💻 cs.CV

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix N\"utzel , Mischa Dombrowski , Bernhard Kainz This is my paper

Pith reviewed 2026-05-17 02:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords long-tail distributiontext-to-image generationflow matchingresidual adapterssynthetic data augmentationmedical image synthesisclass imbalancegradient alignment

0 comments

The pith

GRASP partitions conditioning space and adds group residual adapters to fix long-tail collapse in text-to-image flow matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image flow matching transformers lose fidelity and diversity on rare classes because head and tail samples produce misaligned gradients during fine-tuning. GRASP counters this with a fixed partition of condition values that routes each sample to its own residual adapter module inside the transformer feedforward layers. The partition is deterministic, so tail samples are guaranteed to update dedicated parameters without altering the core flow-matching loss or the sampler. When the resulting synthetic images train a downstream DenseNet on NIH-CXR-LT, macro F1 matches real-data performance and nonzero scores appear on nine of thirteen classes instead of three. The same gains appear on ImageNet-LT, indicating the fix is not limited to medical data.

Core claim

In conditional flow matching each condition indexes its own family of probability paths, so a static partition along the conditioning variable supplies a structurally correct proxy for head-versus-tail gradient alignment. GRASP pairs this partition with group-specific residual adapters placed only in the feedforward layers; because assignment is deterministic every tail sample trains its assigned expert. On MIMIC-CXR-LT this yields up to 80 percent lower FID and 44 percent higher tail-class coverage than full fine-tuning, learned-routing MoE, or minority guidance alone. GRASP synthetics used for classifier training on NIH-CXR-LT match the macro F1 obtained from real data and produce nonzeroF

What carries the argument

A deterministic partition of the conditioning space together with group-specific residual adapters inserted into the transformer feedforward layers.

If this is right

GRASP synthetics train a DenseNet classifier on NIH-CXR-LT to the same macro F1 as real training data.
Nine of thirteen tail classes obtain nonzero F1 with GRASP synthetics versus only three with full fine-tuning.
Combining GRASP with self-guided minority sampling at inference produces the highest all-labels IRS observed on MIMIC-CXR-LT.
The same FID and coverage gains appear on ImageNet-LT, confirming the mechanism does not rely on medical-image structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The static partition idea could be tested in diffusion models that also condition on class labels to see whether gradient alignment improves there as well.
If the partition proves too coarse for very fine-grained subclasses, a small number of learnable boundaries might be introduced while keeping the deterministic guarantee for the bulk of tail samples.
Medical imaging pipelines could reduce reliance on scarce real rare-disease scans by substituting GRASP synthetics for the tail portion of training sets.

Load-bearing premise

That a static deterministic partition of the conditioning space reliably proxies head-versus-tail gradient alignment and that the added adapters do not create new optimization problems for head classes.

What would settle it

If GRASP-generated images, when used to train a DenseNet on NIH-CXR-LT, fail to match real-data macro F1 or produce nonzero F1 on most of the thirteen tail classes, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2512.01675 by Bernhard Kainz, Felix N\"utzel, Mischa Dombrowski.

**Figure 2.** Figure 2: Overview of the GRASP architecture: a) We want to minimize gradient conflicts during training partitioning the samples into subsets with aligned gradient directions. b) Based on this partitioning, we deterministically route samples to their designated expert, while keeping the base model frozen. 3. Method Sample-wise Partitioning. As illustrated in Figure 2a, our objective is to construct a partitioning fu… view at source ↗

**Figure 3.** Figure 3: Composition of the partitioning based on labels (top) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of expert specialization/resampling impact. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASP pairs a fixed conditioning-space partition with group-specific residual adapters to lift tail performance in flow-matching transformers, but the gradient-alignment explanation still lacks direct checks.

read the letter

The central idea is a deterministic partition of the conditioning space plus group-specific residual adapters inserted into the transformer feed-forward layers. This setup is meant to guarantee that tail samples update their own adapters while leaving the flow-matching objective and sampler unchanged. The authors argue that because condition values index distinct probability paths, the partition acts as a natural proxy for head-versus-tail gradient alignment. They report up to 80% FID reduction and 44% better tail coverage on MIMIC-CXR-LT and NIH-CXR-LT relative to full fine-tuning, learned-routing MoE, and minority guidance. The downstream DenseNet experiment is the most concrete part: GRASP synthetics match real-data macro F1 on NIH-CXR-LT and produce nonzero F1 on nine of thirteen classes instead of three. Results on ImageNet-LT indicate the pattern is not limited to medical data. They also show the method composes with self-guided minority sampling at inference time for further gains. That combination and the downstream evaluation are the clearest practical contributions. The main weakness is that the paper does not measure gradient alignment (cosine similarity or inner-product statistics) before and after the change, nor does it ablate the partition choice against simply adding adapter capacity. Without those checks it remains possible that the observed improvements come from extra parameters or the particular grouping rule rather than the claimed optimization fix. Error bars and statistical tests are also absent from the reported numbers. The work is aimed at groups that generate synthetic medical images for rare conditions or that adapt flow-matching models to long-tail data. A reader already working on adapters or conditional generation would get usable implementation ideas and a clear baseline comparison. It is worth sending to peer review so referees can examine the full experimental details and ask for the missing gradient or ablation evidence.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GRASP (Guided Residual Adapters with Sample-wise Partitioning) for text-to-image flow matching transformers in long-tail regimes. It attributes tail-class collapse to low head-versus-tail gradient alignment during fine-tuning and proposes a static deterministic partition of the conditioning space together with group-specific residual adapters inserted into the transformer feedforward layers. The method is presented as non-invasive, leaving the flow-matching objective and sampler unchanged. On MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT the authors report FID reductions of up to 80 % and tail-class coverage lifts of up to 44 % relative to full fine-tuning, learned-routing MoE and minority guidance; downstream DenseNet classification on NIH-CXR-LT synthetics is claimed to match real-data macro F1 and to yield nonzero F1 on 9 of 13 classes versus 3 from full fine-tuning.

Significance. If the reported gains are shown to be robust and the gradient-alignment mechanism is directly validated, the work would be a useful contribution to generative modeling under severe class imbalance, especially for medical imaging where rare conditions matter. The non-invasive, composable design and the extension beyond medical data are positive features. At present the absence of direct evidence for the central proxy assumption and the lack of statistical controls limit the strength of the conclusions.

major comments (2)

[Abstract / Method] Abstract and Method description: the claim that the deterministic conditioning-space partition serves as a structurally correct proxy for head-versus-tail gradient alignment is load-bearing for the motivation, yet no direct measurements (cosine similarity, inner-product statistics, or alignment curves before versus after GRASP) are reported. Without these the explanation that the partition 'guarantees tail updates' remains an untested assumption rather than an empirically supported mechanism.
[Experimental results] Experimental results on MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT: large quantitative gains are stated (FID reduction up to 80 %, coverage lift up to 44 %) but no error bars, number of runs, statistical tests, or ablations that isolate the partition choice from the added adapter capacity are provided. This directly affects confidence in whether the central improvements are robust or could be explained by extra parameters alone.

minor comments (1)

[Abstract] Abstract: the phrases 'up to 80 %' and 'up to 44 %' should be accompanied by the precise experimental configurations and baseline settings that achieve these maxima.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our design rationale while acknowledging areas where additional evidence and controls will strengthen the manuscript. We indicate planned revisions accordingly.

read point-by-point responses

Referee: [Abstract / Method] Abstract and Method description: the claim that the deterministic conditioning-space partition serves as a structurally correct proxy for head-versus-tail gradient alignment is load-bearing for the motivation, yet no direct measurements (cosine similarity, inner-product statistics, or alignment curves before versus after GRASP) are reported. Without these the explanation that the partition 'guarantees tail updates' remains an untested assumption rather than an empirically supported mechanism.

Authors: We agree that direct measurements of gradient alignment would provide stronger empirical grounding for the central mechanism. The motivation rests on the structural property of conditional flow matching, in which condition values index distinct sets of probability paths; a static deterministic partition along the conditioning space is therefore the natural factorization that guarantees every tail sample updates its assigned adapter. While this argument is theoretical, we acknowledge the absence of explicit validation such as cosine similarities or alignment curves. In the revised manuscript we will add these measurements, reporting gradient statistics for head versus tail classes before and after GRASP to directly test the proxy assumption. revision: yes
Referee: [Experimental results] Experimental results on MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT: large quantitative gains are stated (FID reduction up to 80 %, coverage lift up to 44 %) but no error bars, number of runs, statistical tests, or ablations that isolate the partition choice from the added adapter capacity are provided. This directly affects confidence in whether the central improvements are robust or could be explained by extra parameters alone.

Authors: We recognize that the lack of error bars, multiple runs, statistical tests, and targeted ablations reduces confidence in the robustness of the reported gains. Although comparisons to learned-routing MoE already control for added capacity to some extent, we did not isolate the deterministic partition from the residual adapters themselves. In revision we will rerun key experiments with multiple seeds to report means and standard deviations, include appropriate statistical tests, and add an ablation that applies residual adapters without the sample-wise partitioning to quantify the contribution of the deterministic conditioning-space partition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against stated assumptions

full rationale

The paper asserts that conditioning values index distinct probability paths in conditional flow matching and therefore a static partition serves as a structurally correct gradient-alignment proxy. This is an explicit modeling premise rather than a quantity fitted inside the experiment or a result that reduces to prior self-citations by construction. Reported gains in FID, tail coverage, and downstream macro F1 are measured against external baselines (full fine-tuning, learned-routing MoE, minority guidance) with no evidence that the performance numbers are forced by re-using the same fitted parameters or by a self-citation chain that itself lacks independent verification. The central claims therefore remain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that condition values index distinct probability paths in conditional flow matching and that a static partition therefore guarantees tail-sample updates. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption In conditional flow matching, condition values index distinct sets of probability paths, making partitioning along the conditioning the structurally correct factorization.
Directly stated in the abstract as justification for the sample-wise partition.

pith-pipeline@v0.9.0 · 5613 in / 1331 out tokens · 68127 ms · 2026-05-17T02:51:03.367349+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Learnability Gap in Medical Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architec...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Understanding hallucinations in diffusion models through mode interpolation

Sumukh K Aithal, Pratyush Maini, Zachary Chase Lipton, and J Zico Kolter. Understanding hallucinations in diffusion models through mode interpolation. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page
[2]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2

work page 2021
[3]

Image generation diversity issues and how to tame them

Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, and Bernhard Kainz. Image generation diversity issues and how to tame them. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3029–3039, 2025. 2, 5

work page 2025
[4]

Once read is enough: Domain-specific pretraining-free language mod- els with cluster-guided sparse experts for long-tail domain knowledge

Fang Dong, Mengyi Chen, Jixian Zhou, Yubin Shi, Yix- uan Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Xiaochen Yang, Rui Zhu, Robert Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Once read is enough: Domain-specific pretraining-free language mod- els with cluster-guided sparse experts for long-tail domain knowledge. InAdvances in Neural Inform...

work page 2024
[5]

Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning.arXiv preprint arXiv:2405.01705, 2024

Rafael Elberg, Denis Parra, and Mircea Petrache. Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning.arXiv preprint arXiv:2405.01705, 2024. 2

work page arXiv 2024
[6]

Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

Yixuan Feng and and others. Routing Matters in MoE: Scal- ing Diffusion Transformers with Explicit Routing Guidance. arXiv preprint arXiv:2510.24711, 2025. 2

work page arXiv 2025
[7]

A theoretical analysis of the learning dynamics under class imbalance

Emanuele Francazi, Marco Baity-Jesi, and Aurelien Lucchi. A theoretical analysis of the learning dynamics under class imbalance. InProceedings of the 40th International Con- ference on Machine Learning, pages 10285–10322. PMLR,

work page
[8]

Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection

Alireza Ganjdanesh, Yan Kang, Yuchen Liu, Richard Zhang, Zhe Lin, and Heng Huang. Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection. InEuropean Conference on Computer Vision (ECCV), 2024. 2

work page 2024
[9]

Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023. 2, 7

work page arXiv 2023
[10]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeural Information Processing Systems, 2017. 5

work page 2017
[11]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 2

work page 2021
[12]

Shen, George Shih, Ronald M

Gregory Holste, Song Wang, Ziyu Jiang, Thomas C. Shen, George Shih, Ronald M. Summers, Yifan Peng, and Zhangyang Wang. Long-tailed classification of thorax dis- eases on chest x-ray: A new benchmark study. InData Augmentation, Labelling, and Imperfections, pages 22–32, Cham, 2022. Springer Nature Switzerland. 3, 4, 5

work page 2022
[13]

A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning

Yaxin Hou and Yuheng Jia. A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 3

work page 2025
[14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 4

work page 2022
[15]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5

work page 2017
[16]

Mixture of lora experts

Shaohan Huang and Furu Wei. Mixture of lora experts. In ICLR 2024, 2024. 2

work page 2024
[17]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathan R. Greenbaum, Matthew P. Lungren, Catherine Y . Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317, 2019. 4

work page 2019
[18]

Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

work page arXiv 1904
[19]

Gao, Chao Zhang, and MohamadAli Torkamani

Yinghao Li, Vianne R. Gao, Chao Zhang, and MohamadAli Torkamani. Ensembles of low-rank expert adapters. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2

work page 2025
[20]

Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model

Yan Liang and and others. Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model. arXiv preprint arXiv:2507.09052, 2025. 2

work page arXiv 2025
[21]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 5

work page 2019
[23]

Training diffusion models towards diverse image generation with reinforcement learning

Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10844–10853, 2024. 2

work page 2024
[24]

arXiv preprint arXiv:2504.14450 (2025)

Ana Montenegro and and others. Causal Disentanglement for Robust Long-tail Medical Image Generation.arXiv preprint arXiv:2504.14450, 2025. 2

work page arXiv 2025
[25]

Morshed and Vishnu Boddeti

Mashrur M. Morshed and Vishnu Boddeti. Diverseflow: Sample-efficient diverse mode coverage in flows. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23303–23312, 2025. 1, 2, 4

work page 2025
[26]

Reliable fidelity and diver- sity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diver- sity metrics for generative models. InProceedings of the 9 37th International Conference on Machine Learning, pages 7176–7185. PMLR, 2020. 5

work page 2020
[27]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[28]

Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

Daehee Park and and 5 other authors. Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[29]

H, Sreejith S

Sivaramakrishnan Rajaraman, Sha-E Yaacob, Subathra L. H, Sreejith S. G, and Sameer Antani. Addressing Class Im- balance with Latent Diffusion-based Data Augmentation for Improving Disease Classification in Pediatric Chest X-rays. bioMethods, 9(1), 2024. 2

work page 2024
[30]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2

work page 2022
[31]

Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els

Vikash Sehwag, Caner Hazirbas, Albert Gordo, Firat Oz- genel, and Cristian Canton Ferrer. Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11482–11491, New Or- leans, LA, USA, 2022. IEEE. 1, 2

work page 2022
[32]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 2, 4

work page 2017
[33]

Self-Guided Generation of Minority Samples Using Diffusion Models

Soobin Um and Jong Chul Ye. Self-Guided Generation of Minority Samples Using Diffusion Models. InComputer Vi- sion – ECCV 2024, pages 414–430, Cham, 2025. Springer Nature Switzerland. 1, 2

work page 2024
[34]

Don’t play favorites: Minority guidance for diffusion models

Soobin Um, Suhyeon Lee, and Jong Chul Ye. Don’t play favorites: Minority guidance for diffusion models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1, 2

work page 2024
[35]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo- hammadhadi Bagheri, and Ronald M. Summers. ChestX- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of com- mon thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 5

work page 2017
[36]

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 52187–52207, 2023. 2

work page 2023
[37]

LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models

Hantao Zhang, Yuhe Liu, Jiancheng Yang, Shouhong Wan, Xinyuan Wang, Wei Peng, and Pascal Fua. LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models. InInternational Conference on Learning Represen- tations (ICLR), 2025. 2

work page 2025
[38]

Long-tailed diffusion models with oriented calibration

Tianjiao Zhang, Huangjie Zheng, Jiangchao Yao, Xiangfeng Wang, Mingyuan Zhou, Ya Zhang, and Yanfeng Wang. Long-tailed diffusion models with oriented calibration. In The Twelfth International Conference on Learning Represen- tations, 2024. 2

work page 2024
[39]

Breaking long-tailed learning bottlenecks: A controllable paradigm with hypernetwork-generated diverse experts

Zhe Zhao, Haibin Wen, Zikang Wang, Pengkun Wang, Fanfu Wang, Song Lai, Qingfu Zhang, and Yang Wang. Breaking long-tailed learning bottlenecks: A controllable paradigm with hypernetwork-generated diverse experts. InAdvances in Neural Information Processing Systems, pages 7493–

work page
[40]

Curran Associates, Inc., 2024. 2, 3

work page 2024
[41]

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, and Xiao- hua Xie. Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 18661–18670, 2025. 2 10

work page 2025

[1] [1]

Understanding hallucinations in diffusion models through mode interpolation

Sumukh K Aithal, Pratyush Maini, Zachary Chase Lipton, and J Zico Kolter. Understanding hallucinations in diffusion models through mode interpolation. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page

[2] [2]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2

work page 2021

[3] [3]

Image generation diversity issues and how to tame them

Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, and Bernhard Kainz. Image generation diversity issues and how to tame them. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3029–3039, 2025. 2, 5

work page 2025

[4] [4]

Once read is enough: Domain-specific pretraining-free language mod- els with cluster-guided sparse experts for long-tail domain knowledge

Fang Dong, Mengyi Chen, Jixian Zhou, Yubin Shi, Yix- uan Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Xiaochen Yang, Rui Zhu, Robert Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Once read is enough: Domain-specific pretraining-free language mod- els with cluster-guided sparse experts for long-tail domain knowledge. InAdvances in Neural Inform...

work page 2024

[5] [5]

Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning.arXiv preprint arXiv:2405.01705, 2024

Rafael Elberg, Denis Parra, and Mircea Petrache. Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning.arXiv preprint arXiv:2405.01705, 2024. 2

work page arXiv 2024

[6] [6]

Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

Yixuan Feng and and others. Routing Matters in MoE: Scal- ing Diffusion Transformers with Explicit Routing Guidance. arXiv preprint arXiv:2510.24711, 2025. 2

work page arXiv 2025

[7] [7]

A theoretical analysis of the learning dynamics under class imbalance

Emanuele Francazi, Marco Baity-Jesi, and Aurelien Lucchi. A theoretical analysis of the learning dynamics under class imbalance. InProceedings of the 40th International Con- ference on Machine Learning, pages 10285–10322. PMLR,

work page

[8] [8]

Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection

Alireza Ganjdanesh, Yan Kang, Yuchen Liu, Richard Zhang, Zhe Lin, and Heng Huang. Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection. InEuropean Conference on Computer Vision (ECCV), 2024. 2

work page 2024

[9] [9]

Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023. 2, 7

work page arXiv 2023

[10] [10]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeural Information Processing Systems, 2017. 5

work page 2017

[11] [11]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 2

work page 2021

[12] [12]

Shen, George Shih, Ronald M

Gregory Holste, Song Wang, Ziyu Jiang, Thomas C. Shen, George Shih, Ronald M. Summers, Yifan Peng, and Zhangyang Wang. Long-tailed classification of thorax dis- eases on chest x-ray: A new benchmark study. InData Augmentation, Labelling, and Imperfections, pages 22–32, Cham, 2022. Springer Nature Switzerland. 3, 4, 5

work page 2022

[13] [13]

A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning

Yaxin Hou and Yuheng Jia. A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 3

work page 2025

[14] [14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 4

work page 2022

[15] [15]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5

work page 2017

[16] [16]

Mixture of lora experts

Shaohan Huang and Furu Wei. Mixture of lora experts. In ICLR 2024, 2024. 2

work page 2024

[17] [17]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathan R. Greenbaum, Matthew P. Lungren, Catherine Y . Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317, 2019. 4

work page 2019

[18] [18]

Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

work page arXiv 1904

[19] [19]

Gao, Chao Zhang, and MohamadAli Torkamani

Yinghao Li, Vianne R. Gao, Chao Zhang, and MohamadAli Torkamani. Ensembles of low-rank expert adapters. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2

work page 2025

[20] [20]

Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model

Yan Liang and and others. Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model. arXiv preprint arXiv:2507.09052, 2025. 2

work page arXiv 2025

[21] [21]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 5

work page 2019

[23] [23]

Training diffusion models towards diverse image generation with reinforcement learning

Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10844–10853, 2024. 2

work page 2024

[24] [24]

arXiv preprint arXiv:2504.14450 (2025)

Ana Montenegro and and others. Causal Disentanglement for Robust Long-tail Medical Image Generation.arXiv preprint arXiv:2504.14450, 2025. 2

work page arXiv 2025

[25] [25]

Morshed and Vishnu Boddeti

Mashrur M. Morshed and Vishnu Boddeti. Diverseflow: Sample-efficient diverse mode coverage in flows. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23303–23312, 2025. 1, 2, 4

work page 2025

[26] [26]

Reliable fidelity and diver- sity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diver- sity metrics for generative models. InProceedings of the 9 37th International Conference on Machine Learning, pages 7176–7185. PMLR, 2020. 5

work page 2020

[27] [27]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[28] [28]

Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

Daehee Park and and 5 other authors. Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025. 2

work page 2025

[29] [29]

H, Sreejith S

Sivaramakrishnan Rajaraman, Sha-E Yaacob, Subathra L. H, Sreejith S. G, and Sameer Antani. Addressing Class Im- balance with Latent Diffusion-based Data Augmentation for Improving Disease Classification in Pediatric Chest X-rays. bioMethods, 9(1), 2024. 2

work page 2024

[30] [30]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2

work page 2022

[31] [31]

Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els

Vikash Sehwag, Caner Hazirbas, Albert Gordo, Firat Oz- genel, and Cristian Canton Ferrer. Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11482–11491, New Or- leans, LA, USA, 2022. IEEE. 1, 2

work page 2022

[32] [32]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 2, 4

work page 2017

[33] [33]

Self-Guided Generation of Minority Samples Using Diffusion Models

Soobin Um and Jong Chul Ye. Self-Guided Generation of Minority Samples Using Diffusion Models. InComputer Vi- sion – ECCV 2024, pages 414–430, Cham, 2025. Springer Nature Switzerland. 1, 2

work page 2024

[34] [34]

Don’t play favorites: Minority guidance for diffusion models

Soobin Um, Suhyeon Lee, and Jong Chul Ye. Don’t play favorites: Minority guidance for diffusion models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1, 2

work page 2024

[35] [35]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo- hammadhadi Bagheri, and Ronald M. Summers. ChestX- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of com- mon thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 5

work page 2017

[36] [36]

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 52187–52207, 2023. 2

work page 2023

[37] [37]

LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models

Hantao Zhang, Yuhe Liu, Jiancheng Yang, Shouhong Wan, Xinyuan Wang, Wei Peng, and Pascal Fua. LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models. InInternational Conference on Learning Represen- tations (ICLR), 2025. 2

work page 2025

[38] [38]

Long-tailed diffusion models with oriented calibration

Tianjiao Zhang, Huangjie Zheng, Jiangchao Yao, Xiangfeng Wang, Mingyuan Zhou, Ya Zhang, and Yanfeng Wang. Long-tailed diffusion models with oriented calibration. In The Twelfth International Conference on Learning Represen- tations, 2024. 2

work page 2024

[39] [39]

Breaking long-tailed learning bottlenecks: A controllable paradigm with hypernetwork-generated diverse experts

Zhe Zhao, Haibin Wen, Zikang Wang, Pengkun Wang, Fanfu Wang, Song Lai, Qingfu Zhang, and Yang Wang. Breaking long-tailed learning bottlenecks: A controllable paradigm with hypernetwork-generated diverse experts. InAdvances in Neural Information Processing Systems, pages 7493–

work page

[40] [40]

Curran Associates, Inc., 2024. 2, 3

work page 2024

[41] [41]

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, and Xiao- hua Xie. Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 18661–18670, 2025. 2 10

work page 2025