Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Bj\"orn Ommer; Felix Krause; Johannes Schusterbauer; Josh Susskind; Miguel Angel Bautista; Ming Gui; Timy Phan

arxiv: 2510.14630 · v2 · submitted 2025-10-16 · 💻 cs.CV

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui , Johannes Schusterbauer , Timy Phan , Felix Krause , Josh Susskind , Miguel Angel Bautista , Bj\"orn Ommer This is my paper

Pith reviewed 2026-05-18 06:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords RepTokself-supervised representationssingle token latentflow matchingimage generationefficient generationvision transformerslatent space adaptation

0 comments

The pith

A single continuous token from fine-tuned self-supervised vision transformers acts as an efficient latent space for image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RepTok to represent each image as one continuous latent token drawn from a pre-trained self-supervised vision transformer. Only the semantic token embedding is fine-tuned while a cosine-similarity loss keeps the original smooth geometry intact; a separate decoder is then trained with flow matching to produce images from this token. This single-token choice removes the spatial redundancies of conventional two-dimensional latent grids and thereby lowers training compute. A sympathetic reader would care because the method shows that existing self-supervised encoders can be turned into compact, effective starting points for generative modeling with modest extra work.

Core claim

RepTok adapts a pre-trained self-supervised vision transformer by fine-tuning only its semantic token embedding and adding a cosine-similarity loss that preserves the favorable geometry of the original SSL space. The resulting single continuous token is paired with a generative decoder trained jointly under a standard flow matching objective. This formulation resolves spatial redundancies of 2D latent spaces, yields faithful image reconstruction, and supports competitive class-conditional generation on ImageNet as well as zero-shot text-to-image synthesis on MS-COCO under limited training budgets.

What carries the argument

The Representation Tokenizer (RepTok), which adapts a single semantic token embedding from a self-supervised vision transformer into a compact latent representation for generation while regularizing it with cosine similarity to retain smooth geometry.

If this is right

Single-token latents eliminate the extra compute that comes from processing spatial grids in typical 2D representations.
Class-conditional ImageNet generation reaches competitive performance with markedly lower training budgets.
The same adapted token extends directly to text-to-image synthesis and achieves competitive zero-shot results on MS-COCO.
The preserved geometry of the SSL space supports stable training of the flow-matching decoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-token design may extend naturally to video or 3D generation where grid-based latents become even more costly.
Other pre-trained SSL encoders beyond vision transformers could be adapted in the same minimal way for generation tasks.
The method suggests that the geometry learned by self-supervised models is already close to what generative models need, reducing the need for entirely new latent spaces.
Practitioners could test whether larger SSL backbones yield better generation quality at the same token budget.

Load-bearing premise

That fine-tuning only the semantic token embedding and adding a cosine-similarity loss is sufficient to supply low-level reconstruction details while keeping the original SSL geometry smooth enough for stable generation.

What would settle it

Train the decoder without the cosine-similarity loss during token adaptation and measure whether sample quality on ImageNet drops below the reported competitive FID or whether training becomes unstable.

read the original abstract

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RepTok adapts a single SSL token into a flow-matching latent with a cosine regularizer, but the geometry preservation may not be as robust as claimed for stable generation under tight budgets.

read the letter

The punchline is that this paper takes a pre-trained self-supervised vision transformer token, adapts it with a simple fine-tune and cosine regularizer, and uses it as a single continuous latent for flow matching generation. It claims this cuts costs while staying competitive on ImageNet and extending to text-to-image. What is new is the specific combination: only updating the semantic token embedding while keeping the rest of the SSL encoder fixed, then training the generative decoder jointly. This resolves the spatial redundancy in typical 2D latents and works under limited training budgets. The approach does well in showing a straightforward reuse of existing representations without inventing new architectures. The soft spots center on the regularization. Cosine similarity keeps the direction aligned with the original SSL space but does not control magnitude or curvature, which could affect how well the flow model learns the path. The abstract highlights competitive results, yet without seeing the actual numbers, error bars, or full ablations, it's difficult to assess if the gains are robust or depend on particular hyperparameter choices. This work is for people in the efficient generative modeling community, especially those experimenting with pre-trained encoders to reduce compute. A reader interested in practical adaptations for flow or diffusion models would get value from the recipe. I recommend sending it for peer review. The core idea is solid enough to warrant referee feedback on the implementation and validation details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Representation Tokenizer (RepTok), a generative modeling framework that adapts a pre-trained self-supervised vision transformer into a single continuous latent token. Only the semantic token embedding is fine-tuned, regularized by a cosine-similarity loss to the original SSL representation, while a decoder is trained jointly under a standard flow-matching objective. The single-token 1D formulation is claimed to eliminate spatial redundancies of 2D latents, reduce training costs, and still deliver competitive class-conditional ImageNet generation as well as zero-shot text-to-image performance on MS-COCO under severely limited training budgets.

Significance. If the central efficiency and quality claims are substantiated with proper controls, the work would demonstrate that fine-tuned SSL representations can serve as compact, geometry-preserving latent spaces for flow-based generation. This could meaningfully lower the computational barrier for high-quality image synthesis and extend naturally to multimodal tasks, offering a lightweight alternative to conventional VAE or diffusion latent spaces.

major comments (2)

[Methods, adaptation procedure] Methods, adaptation procedure: The claim that fine-tuning only the semantic token embedding under a cosine-similarity regularizer simultaneously injects low-level reconstruction details and preserves a geometry suitable for stable flow matching is load-bearing for the efficiency and competitive-performance assertions. Cosine similarity constrains directional alignment but leaves magnitude and higher-order statistics unconstrained; the manuscript provides no ablation on the loss weight, no analysis of resulting latent curvature or Lipschitz constants, and no comparison of flow-matching training dynamics with versus without the regularizer. Without such evidence the reported FID and zero-shot results rest on an untested assumption rather than on demonstrated regularization properties.
[Results section] Results section (ImageNet and MS-COCO experiments): The abstract and summary assert competitive performance under extremely limited training budgets, yet the provided text contains no quantitative FID scores, error bars, dataset splits, or direct comparisons against baselines trained for the same number of iterations or FLOPs. These omissions prevent verification that the single-token formulation actually delivers the claimed efficiency-quality trade-off after standard controls.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete quantitative result (e.g., FID or CLIP score) to support the repeated use of the word 'competitive'.
[Methods] Notation for the adapted token embedding and the cosine-similarity loss term should be introduced explicitly with an equation number in the methods section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and outlining the changes we will incorporate in the revised manuscript.

read point-by-point responses

Referee: Methods, adaptation procedure: The claim that fine-tuning only the semantic token embedding under a cosine-similarity regularizer simultaneously injects low-level reconstruction details and preserves a geometry suitable for stable flow matching is load-bearing for the efficiency and competitive-performance assertions. Cosine similarity constrains directional alignment but leaves magnitude and higher-order statistics unconstrained; the manuscript provides no ablation on the loss weight, no analysis of resulting latent curvature or Lipschitz constants, and no comparison of flow-matching training dynamics with versus without the regularizer. Without such evidence the reported FID and zero-shot results rest on an untested assumption rather than on demonstrated regularization properties.

Authors: We appreciate the referee pointing out the need for stronger empirical support for the regularization's effects. Although the cosine similarity loss is designed to maintain the directional properties of the SSL latent space while allowing the embedding to incorporate reconstruction details, we acknowledge that ablations and analyses are missing. In the revision, we will add an ablation on the cosine similarity loss weight, report its effect on generation quality, and include comparisons of training dynamics (e.g., loss curves) with and without the regularizer. We will also attempt to analyze the latent space properties such as curvature where feasible. revision: yes
Referee: Results section (ImageNet and MS-COCO experiments): The abstract and summary assert competitive performance under extremely limited training budgets, yet the provided text contains no quantitative FID scores, error bars, dataset splits, or direct comparisons against baselines trained for the same number of iterations or FLOPs. These omissions prevent verification that the single-token formulation actually delivers the claimed efficiency-quality trade-off after standard controls.

Authors: We regret that the quantitative details were not sufficiently highlighted in the initial submission. The manuscript does contain FID scores and comparisons in the results section; however, to fully address this concern, we will revise the paper to prominently feature the quantitative FID scores with error bars from repeated experiments, explicitly state the dataset splits, and include additional experiments or tables showing direct comparisons to baselines under equivalent training iterations and FLOPs. This will substantiate the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The paper presents an empirical adaptation procedure: fine-tuning only the semantic token embedding of a pre-trained SSL encoder, adding a cosine-similarity regularizer to the original SSL token, and jointly training a flow-matching decoder. Performance claims rest on reported FID scores and zero-shot results under limited budgets, not on any equation that reduces the output to the input by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain. The cosine-similarity term is an explicit design choice motivated by geometry preservation rather than a hidden tautology. The method is therefore independent of its own fitted values and does not collapse to a renaming or self-referential fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete; the method implicitly relies on the pre-trained SSL encoder already containing useful semantic structure and on the cosine loss being sufficient to keep the adapted token inside a generation-friendly region.

free parameters (1)

cosine-similarity loss weight
The abstract states that a cosine-similarity loss is added to regularize the adapted token; its relative weight is a free parameter that must be chosen to balance reconstruction fidelity against geometry preservation.

axioms (1)

domain assumption The geometry of the original SSL latent space remains suitable for generation after limited fine-tuning of a single token embedding.
Invoked when the authors claim that the cosine regularizer ensures the latent space stays smooth and suitable for generation.

pith-pipeline@v0.9.0 · 5721 in / 1529 out tokens · 28802 ms · 2026-05-18T06:09:05.843700+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
cs.LG 2026-03 unverdicted novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
The Learnability Gap in Medical Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Pretrained autoencoders in medical latent diffusion encode discriminative features well for reconstruction but structure their latent spaces in ways that hinder classifier learning, a gap that persists across architec...
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while...