GRASP: Guided Residual Adapters with Sample-wise Partitioning
Pith reviewed 2026-05-17 02:51 UTC · model grok-4.3
The pith
GRASP partitions conditioning space and adds group residual adapters to fix long-tail collapse in text-to-image flow matching.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In conditional flow matching each condition indexes its own family of probability paths, so a static partition along the conditioning variable supplies a structurally correct proxy for head-versus-tail gradient alignment. GRASP pairs this partition with group-specific residual adapters placed only in the feedforward layers; because assignment is deterministic every tail sample trains its assigned expert. On MIMIC-CXR-LT this yields up to 80 percent lower FID and 44 percent higher tail-class coverage than full fine-tuning, learned-routing MoE, or minority guidance alone. GRASP synthetics used for classifier training on NIH-CXR-LT match the macro F1 obtained from real data and produce nonzeroF
What carries the argument
A deterministic partition of the conditioning space together with group-specific residual adapters inserted into the transformer feedforward layers.
If this is right
- GRASP synthetics train a DenseNet classifier on NIH-CXR-LT to the same macro F1 as real training data.
- Nine of thirteen tail classes obtain nonzero F1 with GRASP synthetics versus only three with full fine-tuning.
- Combining GRASP with self-guided minority sampling at inference produces the highest all-labels IRS observed on MIMIC-CXR-LT.
- The same FID and coverage gains appear on ImageNet-LT, confirming the mechanism does not rely on medical-image structure.
Where Pith is reading between the lines
- The static partition idea could be tested in diffusion models that also condition on class labels to see whether gradient alignment improves there as well.
- If the partition proves too coarse for very fine-grained subclasses, a small number of learnable boundaries might be introduced while keeping the deterministic guarantee for the bulk of tail samples.
- Medical imaging pipelines could reduce reliance on scarce real rare-disease scans by substituting GRASP synthetics for the tail portion of training sets.
Load-bearing premise
That a static deterministic partition of the conditioning space reliably proxies head-versus-tail gradient alignment and that the added adapters do not create new optimization problems for head classes.
What would settle it
If GRASP-generated images, when used to train a DenseNet on NIH-CXR-LT, fail to match real-data macro F1 or produce nonzero F1 on most of the thirteen tail classes, the central claim would be falsified.
Figures
read the original abstract
Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GRASP (Guided Residual Adapters with Sample-wise Partitioning) for text-to-image flow matching transformers in long-tail regimes. It attributes tail-class collapse to low head-versus-tail gradient alignment during fine-tuning and proposes a static deterministic partition of the conditioning space together with group-specific residual adapters inserted into the transformer feedforward layers. The method is presented as non-invasive, leaving the flow-matching objective and sampler unchanged. On MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT the authors report FID reductions of up to 80 % and tail-class coverage lifts of up to 44 % relative to full fine-tuning, learned-routing MoE and minority guidance; downstream DenseNet classification on NIH-CXR-LT synthetics is claimed to match real-data macro F1 and to yield nonzero F1 on 9 of 13 classes versus 3 from full fine-tuning.
Significance. If the reported gains are shown to be robust and the gradient-alignment mechanism is directly validated, the work would be a useful contribution to generative modeling under severe class imbalance, especially for medical imaging where rare conditions matter. The non-invasive, composable design and the extension beyond medical data are positive features. At present the absence of direct evidence for the central proxy assumption and the lack of statistical controls limit the strength of the conclusions.
major comments (2)
- [Abstract / Method] Abstract and Method description: the claim that the deterministic conditioning-space partition serves as a structurally correct proxy for head-versus-tail gradient alignment is load-bearing for the motivation, yet no direct measurements (cosine similarity, inner-product statistics, or alignment curves before versus after GRASP) are reported. Without these the explanation that the partition 'guarantees tail updates' remains an untested assumption rather than an empirically supported mechanism.
- [Experimental results] Experimental results on MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT: large quantitative gains are stated (FID reduction up to 80 %, coverage lift up to 44 %) but no error bars, number of runs, statistical tests, or ablations that isolate the partition choice from the added adapter capacity are provided. This directly affects confidence in whether the central improvements are robust or could be explained by extra parameters alone.
minor comments (1)
- [Abstract] Abstract: the phrases 'up to 80 %' and 'up to 44 %' should be accompanied by the precise experimental configurations and baseline settings that achieve these maxima.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our design rationale while acknowledging areas where additional evidence and controls will strengthen the manuscript. We indicate planned revisions accordingly.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and Method description: the claim that the deterministic conditioning-space partition serves as a structurally correct proxy for head-versus-tail gradient alignment is load-bearing for the motivation, yet no direct measurements (cosine similarity, inner-product statistics, or alignment curves before versus after GRASP) are reported. Without these the explanation that the partition 'guarantees tail updates' remains an untested assumption rather than an empirically supported mechanism.
Authors: We agree that direct measurements of gradient alignment would provide stronger empirical grounding for the central mechanism. The motivation rests on the structural property of conditional flow matching, in which condition values index distinct sets of probability paths; a static deterministic partition along the conditioning space is therefore the natural factorization that guarantees every tail sample updates its assigned adapter. While this argument is theoretical, we acknowledge the absence of explicit validation such as cosine similarities or alignment curves. In the revised manuscript we will add these measurements, reporting gradient statistics for head versus tail classes before and after GRASP to directly test the proxy assumption. revision: yes
-
Referee: [Experimental results] Experimental results on MIMIC-CXR-LT, NIH-CXR-LT and ImageNet-LT: large quantitative gains are stated (FID reduction up to 80 %, coverage lift up to 44 %) but no error bars, number of runs, statistical tests, or ablations that isolate the partition choice from the added adapter capacity are provided. This directly affects confidence in whether the central improvements are robust or could be explained by extra parameters alone.
Authors: We recognize that the lack of error bars, multiple runs, statistical tests, and targeted ablations reduces confidence in the robustness of the reported gains. Although comparisons to learned-routing MoE already control for added capacity to some extent, we did not isolate the deterministic partition from the residual adapters themselves. In revision we will rerun key experiments with multiple seeds to report means and standard deviations, include appropriate statistical tests, and add an ablation that applies residual adapters without the sample-wise partitioning to quantify the contribution of the deterministic conditioning-space partition. revision: yes
Circularity Check
No significant circularity; derivation self-contained against stated assumptions
full rationale
The paper asserts that conditioning values index distinct probability paths in conditional flow matching and therefore a static partition serves as a structurally correct gradient-alignment proxy. This is an explicit modeling premise rather than a quantity fitted inside the experiment or a result that reduces to prior self-citations by construction. Reported gains in FID, tail coverage, and downstream macro F1 are measured against external baselines (full fine-tuning, learned-routing MoE, minority guidance) with no evidence that the performance numbers are forced by re-using the same fitted parameters or by a self-citation chain that itself lacks independent verification. The central claims therefore remain non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption In conditional flow matching, condition values index distinct sets of probability paths, making partitioning along the conditioning the structurally correct factorization.
Forward citations
Cited by 1 Pith paper
-
The Learnability Gap in Medical Latent Diffusion
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architec...
Reference graph
Works this paper leans on
-
[1]
Understanding hallucinations in diffusion models through mode interpolation
Sumukh K Aithal, Pratyush Maini, Zachary Chase Lipton, and J Zico Kolter. Understanding hallucinations in diffusion models through mode interpolation. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,
-
[2]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2
work page 2021
-
[3]
Image generation diversity issues and how to tame them
Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, and Bernhard Kainz. Image generation diversity issues and how to tame them. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3029–3039, 2025. 2, 5
work page 2025
-
[4]
Fang Dong, Mengyi Chen, Jixian Zhou, Yubin Shi, Yix- uan Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Xiaochen Yang, Rui Zhu, Robert Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Once read is enough: Domain-specific pretraining-free language mod- els with cluster-guided sparse experts for long-tail domain knowledge. InAdvances in Neural Inform...
work page 2024
-
[5]
Rafael Elberg, Denis Parra, and Mircea Petrache. Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning.arXiv preprint arXiv:2405.01705, 2024. 2
-
[6]
Yixuan Feng and and others. Routing Matters in MoE: Scal- ing Diffusion Transformers with Explicit Routing Guidance. arXiv preprint arXiv:2510.24711, 2025. 2
-
[7]
A theoretical analysis of the learning dynamics under class imbalance
Emanuele Francazi, Marco Baity-Jesi, and Aurelien Lucchi. A theoretical analysis of the learning dynamics under class imbalance. InProceedings of the 40th International Con- ference on Machine Learning, pages 10285–10322. PMLR,
-
[8]
Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection
Alireza Ganjdanesh, Yan Kang, Yuchen Liu, Richard Zhang, Zhe Lin, and Heng Huang. Mixture of Efficient Diffu- sion Experts Through Automatic Interval and Sub-Network Selection. InEuropean Conference on Computer Vision (ECCV), 2024. 2
work page 2024
-
[9]
Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023. 2, 7
-
[10]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeural Information Processing Systems, 2017. 5
work page 2017
-
[11]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 2
work page 2021
-
[12]
Gregory Holste, Song Wang, Ziyu Jiang, Thomas C. Shen, George Shih, Ronald M. Summers, Yifan Peng, and Zhangyang Wang. Long-tailed classification of thorax dis- eases on chest x-ray: A new benchmark study. InData Augmentation, Labelling, and Imperfections, pages 22–32, Cham, 2022. Springer Nature Switzerland. 3, 4, 5
work page 2022
-
[13]
A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning
Yaxin Hou and Yuheng Jia. A square peg in a square hole: Meta-expert for long-tailed semi-supervised learning. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 3
work page 2025
-
[14]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 4
work page 2022
-
[15]
Densely connected convolutional net- works
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5
work page 2017
-
[16]
Shaohan Huang and Furu Wei. Mixture of lora experts. In ICLR 2024, 2024. 2
work page 2024
-
[17]
Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathan R. Greenbaum, Matthew P. Lungren, Catherine Y . Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317, 2019. 4
work page 2019
-
[18]
Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,
-
[19]
Gao, Chao Zhang, and MohamadAli Torkamani
Yinghao Li, Vianne R. Gao, Chao Zhang, and MohamadAli Torkamani. Ensembles of low-rank expert adapters. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2
work page 2025
-
[20]
Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model
Yan Liang and and others. Contrastive Conditional- Unconditional Alignment for Long-tailed Diffusion Model. arXiv preprint arXiv:2507.09052, 2025. 2
-
[21]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 5
work page 2019
-
[23]
Training diffusion models towards diverse image generation with reinforcement learning
Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10844–10853, 2024. 2
work page 2024
-
[24]
arXiv preprint arXiv:2504.14450 (2025)
Ana Montenegro and and others. Causal Disentanglement for Robust Long-tail Medical Image Generation.arXiv preprint arXiv:2504.14450, 2025. 2
-
[25]
Mashrur M. Morshed and Vishnu Boddeti. Diverseflow: Sample-efficient diverse mode coverage in flows. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23303–23312, 2025. 1, 2, 4
work page 2025
-
[26]
Reliable fidelity and diver- sity metrics for generative models
Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diver- sity metrics for generative models. InProceedings of the 9 37th International Conference on Machine Learning, pages 7176–7185. PMLR, 2020. 5
work page 2020
-
[27]
Dinov2: Learning robust visual features with- out supervision, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...
work page 2024
-
[28]
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model
Daehee Park and and 5 other authors. Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2025. 2
work page 2025
-
[29]
Sivaramakrishnan Rajaraman, Sha-E Yaacob, Subathra L. H, Sreejith S. G, and Sameer Antani. Addressing Class Im- balance with Latent Diffusion-based Data Augmentation for Improving Disease Classification in Pediatric Chest X-rays. bioMethods, 9(1), 2024. 2
work page 2024
-
[30]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2
work page 2022
-
[31]
Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els
Vikash Sehwag, Caner Hazirbas, Albert Gordo, Firat Oz- genel, and Cristian Canton Ferrer. Generating High Fi- delity Data from Low-density Regions using Diffusion Mod- els. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11482–11491, New Or- leans, LA, USA, 2022. IEEE. 1, 2
work page 2022
-
[32]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 2, 4
work page 2017
-
[33]
Self-Guided Generation of Minority Samples Using Diffusion Models
Soobin Um and Jong Chul Ye. Self-Guided Generation of Minority Samples Using Diffusion Models. InComputer Vi- sion – ECCV 2024, pages 414–430, Cham, 2025. Springer Nature Switzerland. 1, 2
work page 2024
-
[34]
Don’t play favorites: Minority guidance for diffusion models
Soobin Um, Suhyeon Lee, and Jong Chul Ye. Don’t play favorites: Minority guidance for diffusion models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 1, 2
work page 2024
-
[35]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo- hammadhadi Bagheri, and Ronald M. Summers. ChestX- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of com- mon thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 5
work page 2017
-
[36]
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 52187–52207, 2023. 2
work page 2023
-
[37]
LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models
Hantao Zhang, Yuhe Liu, Jiancheng Yang, Shouhong Wan, Xinyuan Wang, Wei Peng, and Pascal Fua. LeFusion: Con- trollable Pathology Synthesis via Lesion-Focused Diffusion Models. InInternational Conference on Learning Represen- tations (ICLR), 2025. 2
work page 2025
-
[38]
Long-tailed diffusion models with oriented calibration
Tianjiao Zhang, Huangjie Zheng, Jiangchao Yao, Xiangfeng Wang, Mingyuan Zhou, Ya Zhang, and Yanfeng Wang. Long-tailed diffusion models with oriented calibration. In The Twelfth International Conference on Learning Represen- tations, 2024. 2
work page 2024
-
[39]
Zhe Zhao, Haibin Wen, Zikang Wang, Pengkun Wang, Fanfu Wang, Song Lai, Qingfu Zhang, and Yang Wang. Breaking long-tailed learning bottlenecks: A controllable paradigm with hypernetwork-generated diverse experts. InAdvances in Neural Information Processing Systems, pages 7493–
-
[40]
Curran Associates, Inc., 2024. 2, 3
work page 2024
-
[41]
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, and Xiao- hua Xie. Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 18661–18670, 2025. 2 10
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.