The Learnability Gap in Medical Latent Diffusion

Bernhard Kainz; Felix N\"utzel; Mischa Dombrowski

arxiv: 2605.17087 · v1 · pith:CCELRLWTnew · submitted 2026-05-16 · 💻 cs.CV

The Learnability Gap in Medical Latent Diffusion

Mischa Dombrowski , Felix N\"utzel , Bernhard Kainz This is my paper

Pith reviewed 2026-05-20 15:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords learnability gaplatent diffusionmedical imagingautoencodersgenerative augmentationchest radiographycomputed tomography

0 comments

The pith

Pretrained autoencoders encode medical classification features well in image space but structure their latent representations so classifiers struggle to learn from them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large-scale pretrained autoencoders for medical images achieve near-perfect reconstruction, preserving the visual details needed to tell classes apart such as normal versus abnormal chest X-rays. Yet when the same information is packed into the lower-dimensional latent codes that diffusion models use, standard classifiers achieve much lower accuracy than they do on the original images or their reconstructions. This learnability gap holds steady across five different autoencoder families, multiple medical datasets including CT and dermatoscopy, and even after fine-tuning the autoencoder on medical data. The authors introduce noise-conditioned latent classifiers using FiLM layers plus image-space distillation to both measure the gap and deliver fast, memory-efficient alternatives to full image models.

Core claim

Large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. This gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and medical-domain fine-tuning of the autoencoder does not close it.

What carries the argument

the learnability gap, the observed difference between high classifier accuracy on reconstructed images and low accuracy on the corresponding latent codes despite faithful reconstruction

If this is right

Generative augmentation with latent diffusion models will keep underperforming real data for class balancing until the latent structure itself is changed.
Autoencoder quality for medical use should be judged by how learnable the latents are, not only by reconstruction error or visual fidelity.
Noise-conditioned latent classifiers with FiLM layers provide both higher throughput diagnostics and a partial way to narrow the gap without full image-space computation.
Domain-specific fine-tuning alone cannot be relied on to make latent spaces suitable for downstream discriminative tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural mismatch could limit latent diffusion in other data-scarce domains where class imbalance is common.
Training objectives that explicitly encourage discriminative structure inside the latent space might close the gap more effectively than fidelity-focused fine-tuning.
The reported throughput and memory gains suggest these latent classifiers could be practical for real-time medical image analysis pipelines once accuracy improves.

Load-bearing premise

That near-lossless reconstruction means all features needed for classification are present in the latent codes in a form that standard classifiers can readily access.

What would settle it

Demonstrating a classifier that reaches the same accuracy on latent codes as it does on the reconstructed images for any of the four medical classification benchmarks would falsify the claim of a persistent gap.

Figures

Figures reproduced from arXiv: 2605.17087 by Bernhard Kainz, Felix N\"utzel, Mischa Dombrowski.

**Figure 2.** Figure 2: PSNR vs. learnability gap. Each point is one AE [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Reconstruction samples and pixel-wise absolute difference for long-tail [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pretrained medical autoencoders reconstruct well but their latents resist easy classification even after fine-tuning.

read the letter

The main thing to know is that this paper argues pretrained autoencoders in medical imaging do a good job reconstructing images but their latent spaces are structured in a way that makes classification difficult, and medical fine-tuning does not resolve it. They test this idea across five autoencoder families on four different medical datasets. The consistent performance drop when moving from image space to latent space supports their point. They also introduce noise-conditioned latent classifiers using FiLM layers and distillation, which run much faster and use less memory while acting as tools to check latent quality. This is useful because it moves the discussion beyond fidelity to how the latent representation itself affects downstream tasks like data augmentation for imbalanced classes. The breadth of experiments gives it some credibility. A potential issue is that the gap might come from the classifiers not being powerful enough rather than an inherent property of the latents. The paper shows results over various setups but does not compare against a more expressive latent model that could potentially recover the performance the decoder achieves. This leaves the exact nature of the gap open to some interpretation. The work targets researchers in medical generative modeling and those interested in latent space analysis for classification. It offers a framework for evaluating autoencoders that could be helpful in practice. I think it should go to peer review. The observation is practical and the experiments are reasonably broad, even if some controls could be tighter.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that large-scale pretrained autoencoders for medical imaging faithfully encode class-discriminative features (as shown by near-lossless reconstruction performance) yet structure their latent representations in ways that are intrinsically difficult for classifiers to learn from. This 'learnability gap' is reported to persist across five autoencoder families, four benchmarks (chest radiography, dermatoscopy, CT, echocardiography), initialization strategies, and hyperparameter choices, and is not closed by medical-domain fine-tuning of the autoencoder. The authors introduce noise-conditioned latent classifiers using FiLM layers and image-space distillation as diagnostic tools that partially narrow the gap while providing 64x throughput and 120x memory improvements.

Significance. If the gap is shown to be a property of latent geometry rather than decoder-dependent information recovery, the work would offer a useful evaluation framework for latent spaces in medical generative models and help explain performance shortfalls when using synthetic data for class-imbalanced training. The breadth of architectures and domains tested provides a solid empirical foundation for generality, and the efficiency-focused diagnostic classifiers are a practical contribution.

major comments (1)

[Abstract] Abstract: The assertion that near-lossless reconstruction performance demonstrates faithful encoding of all discriminative features 'in a form that should be learnable' from the latent space is not yet load-bearingly supported. Standard reconstruction losses can preserve global structure while attenuating or entangling low-amplitude task signals that the decoder later recovers nonlinearly; the reported latent-classifier drops could therefore reflect information loss or decoder dependence rather than an intrinsic learnability property of the latent geometry. The experiments across five families and hyperparameter sweeps do not include a control that trains an expressive latent-only model with access to the same information the decoder exploits.

minor comments (2)

[Abstract] The efficiency claims (64x throughput, 120x memory) should be accompanied by precise baseline definitions and measurement protocols in the main text or supplementary material to allow replication.
The newly introduced term 'learnability gap' would benefit from explicit positioning against related concepts in representation learning and latent-space analysis literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The major comment raises a valid point about the strength of evidence linking reconstruction fidelity to latent learnability, and we address it directly below while proposing targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that near-lossless reconstruction performance demonstrates faithful encoding of all discriminative features 'in a form that should be learnable' from the latent space is not yet load-bearingly supported. Standard reconstruction losses can preserve global structure while attenuating or entangling low-amplitude task signals that the decoder later recovers nonlinearly; the reported latent-classifier drops could therefore reflect information loss or decoder dependence rather than an intrinsic learnability property of the latent geometry. The experiments across five families and hyperparameter sweeps do not include a control that trains an expressive latent-only model with access to the same information the decoder exploits.

Authors: We appreciate this observation and agree that reconstruction fidelity alone does not prove the latent space structures information in a form accessible to standard classifiers. Our multi-family experiments (VAE, VQ-VAE, KL-f8, etc.) and hyperparameter sweeps were intended to show the gap is not decoder-specific, but we acknowledge they fall short of the requested control. In revision we will add an experiment training a high-capacity latent-only model (a 6-layer transformer operating directly on latent codes) and compare its performance to both the original latent classifiers and the image-space baseline; preliminary runs indicate the gap remains. We will also revise the abstract to replace 'in a form that should be learnable' with 'yet remain difficult for standard classifiers to exploit' and add a limitations paragraph discussing possible decoder-dependent recovery. These changes constitute a partial revision: the core empirical findings and conclusions are unchanged, but the framing and supporting evidence are strengthened. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements of reconstruction vs. latent classification performance

full rationale

The paper defines the 'learnability gap' directly from observed performance differences: near-lossless reconstruction on medical images versus lower accuracy when training classifiers on the corresponding latent codes. This identification rests on explicit experimental comparisons across five autoencoder families, four benchmarks, and multiple hyperparameter regimes rather than any closed mathematical derivation or fitted parameter that is then relabeled as a prediction. No equations reduce to prior outputs by construction, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The noise-conditioned latent classifiers and distillation methods are presented as diagnostic tools whose value is measured against the same empirical baselines, keeping the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that reconstruction fidelity implies encoded discriminative features, and introduces the learnability gap as a new conceptual entity without independent falsifiable predictions outside the reported experiments.

axioms (1)

domain assumption Near-lossless reconstruction indicates that discriminative features for classification are faithfully encoded in the latent space.
This premise is used to conclude that the difficulty arises from latent structure rather than missing information.

invented entities (1)

Learnability gap no independent evidence
purpose: Conceptual label for the discrepancy between reconstruction fidelity and downstream classifier performance in latent spaces.
Introduced to organize the empirical observations; no external validation or falsifiable prediction is provided beyond the paper's own benchmarks.

pith-pipeline@v0.9.0 · 5741 in / 1426 out tokens · 56995 ms · 2026-05-20T15:38:18.654780+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features... yet their latent representations are structured in ways that are difficult for classifiers to learn from.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reconstruction-space classifiers match image-space performance... latent-space classifiers suffer a substantial drop

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX (2025),https://bfl.ai/research/representation-comparison

work page 2025
[2]

Advances in neural information process- ing systems32(2019)

Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information process- ing systems32(2019)

work page 2019
[3]

In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018)

Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomed- ical imaging (isbi), hosted by the international skin imaging collaboration (isic). In: 2018 IEEE 15th intern...

work page 2017
[4]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

work page 2021
[5]

arXiv preprint arXiv:2512.14421 (2025)

Dombrowski, M., Nützel, F., Kainz, B.: LCMem: A universal model for robust image memorization detection. arXiv preprint arXiv:2512.14421 (2025)

work page arXiv 2025
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Dombrowski, M., Zhang, W., Cechnicka, S., Reynaud, H., Kainz, B.: Image gen- eration diversity issues and how to tame them. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3029–3039 (2025)

work page 2025
[7]

Falck, F., Pandeva, T., Zahirnia, K., Lawrence, R., Turner, R., Meeds, E., Zazo, J., Karmalkar, S.: A Fourier space perspective on diffusion models (2025)

work page 2025
[8]

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Gui, M., Schusterbauer, J., Phan, T., Krause, F., Susskind, J., Bautista, M.A., Ommer, B.: Adapting self-supervised representations as a latent space for efficient generation. arXiv preprint arXiv:2510.14630 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv preprint arXiv:2403.17834 , year=

Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A.G., Esirgun, S.N., Do- gan, I., Durugol, O.F., Hou, B., Shit, S., et al.: Developing generalist foundation models from a multimodal dataset for 3d computed tomography. arXiv preprint arXiv:2403.17834 (2024)

work page arXiv 2024
[10]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[11]

Advances in neural information processing systems30(2017) 10 M

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 10 M. Dombrowski et al

work page 2017
[12]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[13]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

work page 2021
[14]

In: MICCAI Workshop on Data Augmentation, Labelling, and Imper- fections

Holste, G., Wang, S., Jiang, Z., Shen, T.C., Shih, G., Summers, R.M., Peng, Y., Wang, Z.: Long-tailed classification of thorax diseases on chest x-ray: A new bench- mark study. In: MICCAI Workshop on Data Augmentation, Labelling, and Imper- fections. pp. 22–32. Springer (2022)

work page 2022
[15]

Medical Image Analysis97, 103224 (2024)

Holste, G., Zhou, Y., Wang, S., Jaiswal, A., Lin, M., Zhuge, S., Yang, Y., Kim, D., Nguyen-Mau, T.H., Tran, M.T., et al.: Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge. Medical Image Analysis97, 103224 (2024)

work page 2024
[16]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

work page 2019
[17]

In: International Conferenceon LearningRepresentations(2020),https://openreview.net/forum? id=r1gRTCVFvB

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: De- coupling representation and classifier for long-tailed recognition. In: International Conferenceon LearningRepresentations(2020),https://openreview.net/forum? id=r1gRTCVFvB

work page 2020
[18]

Advances in Neural Information Processing Systems37, 52996–53021 (2024)

Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems37, 52996–53021 (2024)

work page 2024
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and improving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24174– 24184 (2024)

work page 2024
[20]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo6

Kouzelis, T., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: EQ-VAE: Equivariance regularized latent space for improved generative image modeling. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo6

work page 2025
[21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025)

work page 2025
[22]

Medical Image Analysis p

Lin, M., Holste, G., Wang, S., Zhou, Y., Wei, Y., Banerjee, I., Chen, P., Dai, T., Du, Y., Dvornek, N.C., et al.: Cxr-lt 2024: A miccai challenge on long-tailed, multi- label, and zero-shot disease classification from chest x-ray. Medical Image Analysis p. 103739 (2025)

work page 2024
[23]

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection.In:ProceedingsoftheIEEEinternationalconferenceoncomputervision. pp. 2980–2988 (2017)

work page 2017
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022
[25]

McIntosh-Smith, S

McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence (2024),https://arxiv. org/abs/2410.11199

work page arXiv 2024
[26]

arXiv preprint arXiv:2508.16783 (2025)

Moroianu, S.L., Bluethgen, C., Chambon, P., Cherti, M., Delbrouck, J.B., Paschali, M., Price, B., Gichoya, J., Jitsev, J., Langlotz, C.P., et al.: Improving performance, The Learnability Gap 11 robustness, and fairness of radiographic ai models with finely-controllable synthetic data. arXiv preprint arXiv:2508.16783 (2025)

work page arXiv 2025
[27]

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Nützel, F., Dombrowski, M., Kainz, B.: Grasp: Guided residual adapters with sample-wise partitioning. arXiv preprint arXiv:2512.01675 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023
[30]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, Y., Zheng, H., Yao, J., Zhou, M., Zhang, Y.: Class-balancing diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18434–18443 (2023)

work page 2023
[32]

NPJ digital medicine3(1), 119 (2020)

Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H.R., Albarqouni, S., Bakas, S., Galtier, M.N., Landman, B.A., Maier-Hein, K., et al.: The future of digital health with federated learning. NPJ digital medicine3(1), 119 (2020)

work page 2020
[33]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[34]

Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

Skorokhodov, I., Girish, S., Hu, B., Menapace, W., Li, Y., Abdal, R., Tulyakov, S., Siarohin, A.: Improving the diffusability of autoencoders. arXiv preprint arXiv:2502.14831 (2025)

work page arXiv 2025
[35]

Scientific data5(1), 180161 (2018)

Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data5(1), 180161 (2018)

work page 2018
[36]

arXiv preprint arXiv:2502.14753 (2025)

Varma, M., Kumar, A., Van der Sluijs, R., Ostmeier, S., Blankemeier, L., Cham- bon, P., Bluethgen, C., Prince, J., Langlotz, C., Chaudhari, A.: Medvae: Efficient automated interpretation of medical images with large-scale generalizable autoen- coders. arXiv preprint arXiv:2502.14753 (2025)

work page arXiv 2025
[37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Vega, D., Ceballos, H.V., Vera, J.S., Rodriguez, S., Perez, A., Castillo, A., Es- cobar, M., Londoño, D., Sarmiento, L.A., Castro, C.I., et al.: Cardium: Congen- ital anomaly recognition with diagnostic images and unified medical records. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1193–1202 (2025)

work page 2025
[38]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)

work page 2025
[39]

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think (2025)

work page 2025
[40]

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders (2025),https://arxiv.org/abs/2510.11690

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhou, Y., Xiao, Z., Yang, S., Pan, X.: Alias-free latent diffusion models: Improv- ing fractional shift equivariance of diffusion latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 34–44 (June 2025)

work page 2025

[1] [1]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX (2025),https://bfl.ai/research/representation-comparison

work page 2025

[2] [2]

Advances in neural information process- ing systems32(2019)

Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information process- ing systems32(2019)

work page 2019

[3] [3]

In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018)

Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomed- ical imaging (isbi), hosted by the international skin imaging collaboration (isic). In: 2018 IEEE 15th intern...

work page 2017

[4] [4]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

work page 2021

[5] [5]

arXiv preprint arXiv:2512.14421 (2025)

Dombrowski, M., Nützel, F., Kainz, B.: LCMem: A universal model for robust image memorization detection. arXiv preprint arXiv:2512.14421 (2025)

work page arXiv 2025

[6] [6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Dombrowski, M., Zhang, W., Cechnicka, S., Reynaud, H., Kainz, B.: Image gen- eration diversity issues and how to tame them. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3029–3039 (2025)

work page 2025

[7] [7]

Falck, F., Pandeva, T., Zahirnia, K., Lawrence, R., Turner, R., Meeds, E., Zazo, J., Karmalkar, S.: A Fourier space perspective on diffusion models (2025)

work page 2025

[8] [8]

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Gui, M., Schusterbauer, J., Phan, T., Krause, F., Susskind, J., Bautista, M.A., Ommer, B.: Adapting self-supervised representations as a latent space for efficient generation. arXiv preprint arXiv:2510.14630 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv preprint arXiv:2403.17834 , year=

Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A.G., Esirgun, S.N., Do- gan, I., Durugol, O.F., Hou, B., Shit, S., et al.: Developing generalist foundation models from a multimodal dataset for 3d computed tomography. arXiv preprint arXiv:2403.17834 (2024)

work page arXiv 2024

[10] [10]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[11] [11]

Advances in neural information processing systems30(2017) 10 M

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 10 M. Dombrowski et al

work page 2017

[12] [12]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020

[13] [13]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

work page 2021

[14] [14]

In: MICCAI Workshop on Data Augmentation, Labelling, and Imper- fections

Holste, G., Wang, S., Jiang, Z., Shen, T.C., Shih, G., Summers, R.M., Peng, Y., Wang, Z.: Long-tailed classification of thorax diseases on chest x-ray: A new bench- mark study. In: MICCAI Workshop on Data Augmentation, Labelling, and Imper- fections. pp. 22–32. Springer (2022)

work page 2022

[15] [15]

Medical Image Analysis97, 103224 (2024)

Holste, G., Zhou, Y., Wang, S., Jaiswal, A., Lin, M., Zhuge, S., Yang, Y., Kim, D., Nguyen-Mau, T.H., Tran, M.T., et al.: Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge. Medical Image Analysis97, 103224 (2024)

work page 2024

[16] [16]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

work page 2019

[17] [17]

In: International Conferenceon LearningRepresentations(2020),https://openreview.net/forum? id=r1gRTCVFvB

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: De- coupling representation and classifier for long-tailed recognition. In: International Conferenceon LearningRepresentations(2020),https://openreview.net/forum? id=r1gRTCVFvB

work page 2020

[18] [18]

Advances in Neural Information Processing Systems37, 52996–53021 (2024)

Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems37, 52996–53021 (2024)

work page 2024

[19] [19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., Laine, S.: Analyzing and improving the training dynamics of diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24174– 24184 (2024)

work page 2024

[20] [20]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo6

Kouzelis, T., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: EQ-VAE: Equivariance regularized latent space for improved generative image modeling. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo6

work page 2025

[21] [21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025)

work page 2025

[22] [22]

Medical Image Analysis p

Lin, M., Holste, G., Wang, S., Zhou, Y., Wei, Y., Banerjee, I., Chen, P., Dai, T., Du, Y., Dvornek, N.C., et al.: Cxr-lt 2024: A miccai challenge on long-tailed, multi- label, and zero-shot disease classification from chest x-ray. Medical Image Analysis p. 103739 (2025)

work page 2024

[23] [23]

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection.In:ProceedingsoftheIEEEinternationalconferenceoncomputervision. pp. 2980–2988 (2017)

work page 2017

[24] [24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022

[25] [25]

McIntosh-Smith, S

McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence (2024),https://arxiv. org/abs/2410.11199

work page arXiv 2024

[26] [26]

arXiv preprint arXiv:2508.16783 (2025)

Moroianu, S.L., Bluethgen, C., Chambon, P., Cherti, M., Delbrouck, J.B., Paschali, M., Price, B., Gichoya, J., Jitsev, J., Langlotz, C.P., et al.: Improving performance, The Learnability Gap 11 robustness, and fairness of radiographic ai models with finely-controllable synthetic data. arXiv preprint arXiv:2508.16783 (2025)

work page arXiv 2025

[27] [27]

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Nützel, F., Dombrowski, M., Kainz, B.: Grasp: Guided residual adapters with sample-wise partitioning. arXiv preprint arXiv:2512.01675 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023

[30] [30]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, Y., Zheng, H., Yao, J., Zhou, M., Zhang, Y.: Class-balancing diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18434–18443 (2023)

work page 2023

[32] [32]

NPJ digital medicine3(1), 119 (2020)

Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H.R., Albarqouni, S., Bakas, S., Galtier, M.N., Landman, B.A., Maier-Hein, K., et al.: The future of digital health with federated learning. NPJ digital medicine3(1), 119 (2020)

work page 2020

[33] [33]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022

[34] [34]

Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

Skorokhodov, I., Girish, S., Hu, B., Menapace, W., Li, Y., Abdal, R., Tulyakov, S., Siarohin, A.: Improving the diffusability of autoencoders. arXiv preprint arXiv:2502.14831 (2025)

work page arXiv 2025

[35] [35]

Scientific data5(1), 180161 (2018)

Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data5(1), 180161 (2018)

work page 2018

[36] [36]

arXiv preprint arXiv:2502.14753 (2025)

Varma, M., Kumar, A., Van der Sluijs, R., Ostmeier, S., Blankemeier, L., Cham- bon, P., Bluethgen, C., Prince, J., Langlotz, C., Chaudhari, A.: Medvae: Efficient automated interpretation of medical images with large-scale generalizable autoen- coders. arXiv preprint arXiv:2502.14753 (2025)

work page arXiv 2025

[37] [37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Vega, D., Ceballos, H.V., Vera, J.S., Rodriguez, S., Perez, A., Castillo, A., Es- cobar, M., Londoño, D., Sarmiento, L.A., Castro, C.I., et al.: Cardium: Congen- ital anomaly recognition with diagnostic images and unified medical records. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1193–1202 (2025)

work page 2025

[38] [38]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)

work page 2025

[39] [39]

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think (2025)

work page 2025

[40] [40]

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders (2025),https://arxiv.org/abs/2510.11690

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhou, Y., Xiao, Z., Yang, S., Pan, X.: Alias-free latent diffusion models: Improv- ing fractional shift equivariance of diffusion latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 34–44 (June 2025)

work page 2025