Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

· 2025 · cs.CV · arXiv 2510.14630

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.

The Learnability Gap in Medical Latent Diffusion

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.

citing papers explorer

Showing 4 of 4 citing papers.

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning cs.LG · 2026-03-20 · unverdicted · none · ref 14 · internal anchor
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
The Learnability Gap in Medical Latent Diffusion cs.CV · 2026-05-16 · unverdicted · none · ref 8 · internal anchor
Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architectures and is not closed by domain fine-tuning.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation cs.CV · 2026-04-21 · unverdicted · none · ref 19 · internal anchor
Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer