SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

Jianmin Bao; Longtao Jiang; Pengfei Wan; Xiaojun Chang; Xin Tao; Zhendong Wang; Zhihui Li

arxiv: 2605.18267 · v2 · pith:6F3VS7Q6new · submitted 2026-05-18 · 💻 cs.CV

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

Longtao Jiang , Jianmin Bao , Zhendong Wang , Xin Tao , Pengfei Wan , Zhihui Li , Xiaojun Chang This is my paper

Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords normalizing flowsimage generationsemantic representationexact likelihoodfeature compressionImageNet generation

0 comments

The pith

Compressing high-dimensional image features into a compact semantic space lets normalizing flows generate detailed images while retaining exact likelihood computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Normalizing flows have long offered exact likelihoods and invertible sampling yet lagged in large-scale image generation because they must learn one invertible map across the entire high-dimensional feature space. The paper demonstrates that a Semantic Representation Compressor can first reduce overcomplete visual features to a much smaller semantic space. Normalizing flows then operate only in that reduced space before a frozen decoder reconstructs the output image. This separation keeps the flow's theoretical advantages intact and produces competitive generation quality on ImageNet at 256 and 512 pixel resolutions.

Core claim

SRC-Flow inserts a Semantic Representation Compressor between a pre-trained representation encoder and the normalizing flow so that the flow learns its invertible transport only in the resulting low-dimensional semantic space; the original decoder then reconstructs high-fidelity images from flow-generated semantic codes.

What carries the argument

The Semantic Representation Compressor (SRC), which maps high-dimensional RAE features into a lower-dimensional semantic space while preserving reconstructibility through the frozen decoder.

If this is right

Exact likelihoods become available directly in the semantic space rather than in pixel space.
Sampling remains deterministic and invertible at the flow stage.
Generation quality among normalizing-flow methods improves on ImageNet 256 by 256 and 512 by 512 resolutions while classifier-free guidance remains usable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression step could be tested with other invertible models that currently struggle with high-dimensional inputs.
Jointly training the compressor with the flow might further reduce the dimensionality needed for good performance.
Similar semantic bottlenecks could be inserted into other latent-variable generative models to ease their modeling load.

Load-bearing premise

High-dimensional visual features can be compressed into a low-dimensional semantic space without losing the information required for the frozen decoder to reconstruct high-fidelity images.

What would settle it

Generate images with the flow in the compressed space and measure whether reconstruction error or perceptual quality falls substantially below the reported levels when the same decoder is used on uncompressed features.

Figures

Figures reproduced from arXiv: 2605.18267 by Jianmin Bao, Longtao Jiang, Pengfei Wan, Xiaojun Chang, Xin Tao, Zhendong Wang, Zhihui Li.

**Figure 2.** Figure 2: Diffusion adapts through timestepdependent noise schedule shifts, while NFs learn a single fixed bijection over full representation space. Although the effective semantic information is compact, the ambient dimension Nn is large and overcomplete. For NFs, every modeled channel contributes to the likelihood objective and the logdeterminant, forcing the flow to learn an exact invertible transport over … view at source ↗

**Figure 3.** Figure 3: PCA of normalized RAE features. The first 32 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic Representation Compressor (SRC). The [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of SRC-Flow. Stage 1 trains SRC with frozen RAE. Stage 2 trains a NF on compact semantic [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Class-conditional samples generated by SRC-Flow on ImageNet. The top row shows [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstruction visualization across compact dimensions. The [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of compact dimension d. Generation is best at d = 32, while reconstruction improves with larger d [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Noise regularization. σflow = 0.4 gives the best gFID, and the d = 32 SRC improves high-noise robustness [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet $256 \times 256$ and $512 \times 512$, SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at https://github.com/longtaojiang/SRC-Flow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to map high-dimensional RAE features into a compact low-dimensional semantic space for subsequent normalizing flow modeling. This is claimed to resolve the semantic-capacity mismatch that has limited NFs on large-scale images, enabling exact likelihood computation in the compressed space and deterministic invertible sampling. On ImageNet 256×256 and 512×512, the method reports state-of-the-art gFID scores of 1.65 and 2.07 among normalizing flow approaches under classifier-free guidance, while using constant noise regularization tailored to the fixed unconditional bijection.

Significance. If the empirical results and the lossless-compression assumption hold, the work would meaningfully advance normalizing flows toward competitiveness with diffusion models on high-resolution image generation. Retaining exact likelihoods and invertibility in a compact semantic space is a substantive technical contribution, and the constant-noise regularization represents a practical adaptation worth further exploration.

major comments (3)

[Abstract and §3.1] Abstract and §3.1: The central claim that flow samples decoded by the frozen RAE decoder achieve the reported gFID scores rests on the assumption that SRC compression incurs negligible information loss. No quantitative bound on reconstruction fidelity (e.g., PSNR or LPIPS between original RAE features and SRC-reconstructed features prior to flow modeling) is supplied, leaving open the possibility that discarded high-frequency details degrade final image quality.
[§5 Experiments] §5 Experiments: Strong gFID numbers are presented, yet the manuscript supplies no ablation studies isolating the SRC dimensionality, loss terms, or regularization strength, nor any statistical significance tests or multiple-run variance. Without these, it is impossible to attribute the gains specifically to the proposed compressor rather than unstated training choices or baseline differences.
[§4.2] §4.2: The constant noise regularization is introduced to accommodate the fixed unconditional bijection, but the text does not derive or verify that this modification preserves the exact likelihood property of the flow; a short proof or explicit likelihood expression under the regularized objective would strengthen the claim.

minor comments (2)

[Abstract] Abstract: The acronym 'gFID' is introduced without definition; clarify whether it denotes a guided variant of FID or another metric.
[Throughout] Throughout: Ensure first-use definitions for RAE, SRC, and NF; the current presentation assumes familiarity that may not hold for all readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to improve our manuscript. We address each major comment point by point below, and we will incorporate the suggested changes in the revised version.

read point-by-point responses

Referee: [Abstract and §3.1] Abstract and §3.1: The central claim that flow samples decoded by the frozen RAE decoder achieve the reported gFID scores rests on the assumption that SRC compression incurs negligible information loss. No quantitative bound on reconstruction fidelity (e.g., PSNR or LPIPS between original RAE features and SRC-reconstructed features prior to flow modeling) is supplied, leaving open the possibility that discarded high-frequency details degrade final image quality.

Authors: We appreciate the referee's point regarding the need for quantitative validation of the compression fidelity. Although the high gFID scores and visual quality of generated images suggest effective preservation of semantic information, we agree that explicit metrics would strengthen the claim. In the revised manuscript, we will report PSNR and LPIPS values between the original RAE features and the SRC-reconstructed features to provide a quantitative bound on any information loss. revision: yes
Referee: [§5 Experiments] §5 Experiments: Strong gFID numbers are presented, yet the manuscript supplies no ablation studies isolating the SRC dimensionality, loss terms, or regularization strength, nor any statistical significance tests or multiple-run variance. Without these, it is impossible to attribute the gains specifically to the proposed compressor rather than unstated training choices or baseline differences.

Authors: We thank the referee for this suggestion. To more rigorously demonstrate the contribution of the SRC, we will include additional ablation experiments in the revised version. These will vary the dimensionality of the semantic space, the weighting of loss terms, and the strength of the constant noise regularization. Furthermore, we will conduct multiple training runs with different random seeds and report mean gFID scores along with standard deviations to provide statistical context. revision: yes
Referee: [§4.2] §4.2: The constant noise regularization is introduced to accommodate the fixed unconditional bijection, but the text does not derive or verify that this modification preserves the exact likelihood property of the flow; a short proof or explicit likelihood expression under the regularized objective would strengthen the claim.

Authors: We agree that a formal justification is valuable. The constant noise regularization is designed such that it does not alter the bijective nature of the flow transformation. The likelihood computation remains exact via the change-of-variables formula, where the regularization affects the base distribution in a fixed manner. In the revision, we will add a brief derivation and the explicit expression for the log-likelihood under this regularized setup to confirm preservation of exact likelihoods. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on independent training and evaluation

full rationale

The paper's derivation introduces an SRC compressor trained to map RAE features to a lower-dimensional space, followed by standard normalizing-flow training in that space with a frozen decoder for reconstruction. Reported gFID scores on ImageNet are direct empirical measurements against external baselines, not quantities defined in terms of fitted parameters or prior self-citations within the same equations. No self-definitional loops, fitted-input predictions, or ansatz smuggling appear in the method description; the central claims remain falsifiable via the stated metrics and do not reduce to tautological redefinitions of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the pre-existence of a high-quality RAE model whose decoder can be frozen without retraining and on the effectiveness of constant noise regularization matched to the unconditional flow bijection; neither is derived in the abstract.

axioms (1)

domain assumption RAE features contain semantic information that remains sufficient for high-quality reconstruction after compression to a low-dimensional space and subsequent flow modeling.
Invoked to justify moving the flow out of the ambient high-dimensional space.

invented entities (1)

Semantic Representation Compressor (SRC) no independent evidence
purpose: Compact high-dimensional RAE features into a low-dimensional semantic space suitable for normalizing-flow modeling.
New module introduced to resolve the semantic-capacity mismatch described in the abstract.

pith-pipeline@v0.9.0 · 5779 in / 1379 out tokens · 55610 ms · 2026-05-20T11:56:21.074770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the first 32 principal components already explain 99.06% of the total variance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.