Asymmetric Flow Models

Gordon Wetzstein; Hansheng Chen; Jan Ackermann; Leonidas Guibas; Minseo Kim

arxiv: 2605.12964 · v2 · pith:5BJXHONLnew · submitted 2026-05-13 · 💻 cs.CV

Asymmetric Flow Models

Hansheng Chen , Jan Ackermann , Minseo Kim , Gordon Wetzstein , Leonidas Guibas This is my paper

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords asymmetric flow modelingflow-based generationlow-rank subspacevelocity parameterizationimage generationlatent to pixel finetuningImageNet FIDtext-to-image

0 comments

The pith

AsymFlow achieves 1.57 FID on ImageNet by predicting noise only in a low-rank subspace while recovering full-dimensional velocity analytically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-based generation in high dimensions requires predicting velocity from high-dimensional noise, even when the underlying data has strong low-rank structure. AsymFlow introduces a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace but keeps data prediction full-dimensional. From this split, the method analytically recovers the complete velocity without any changes to the network architecture, training, or sampling procedures. On ImageNet 256 by 256 the approach sets a new leading FID score and supplies the first practical path for finetuning pretrained latent flow models into full pixel-space generators. A reader would care because the technique turns an apparent structural property of natural images into measurable gains in quality and training efficiency.

Core claim

The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From the asymmetric prediction the full-dimensional velocity is recovered analytically. This yields a leading 1.57 FID on ImageNet 256 by 256, outperforming prior DiT- and JiT-style pixel diffusion models, and supplies the first route for seamless finetuning of latent flow models such as FLUX.2 klein 9B into pixel-space text-to-image models that surpass their latent bases on HPSv3, DPG-Bench, and GenEval.

What carries the argument

The rank-asymmetric velocity parameterization, which separates low-rank noise prediction from full-dimensional data prediction so that full velocity can be recovered analytically without architectural changes.

If this is right

On ImageNet 256 by 256, AsymFlow reaches 1.57 FID and outperforms prior pixel diffusion models by a large margin.
The method provides the first route for finetuning pretrained latent flow models into pixel-space generators by aligning the low-rank pixel subspace to the latent space.
The pixel AsymFlow model finetuned from FLUX.2 klein 9B sets a new state of the art for pixel-space text-to-image generation on HPSv3, DPG-Bench, and GenEval.
No modifications to network architecture, training schedule, or sampling procedure are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank asymmetry could be applied to video or 3D flow models where natural data also exhibits strong subspace structure.
Adaptive rank selection during training might further reduce compute while preserving the analytical recovery guarantee.
The approach implies that many existing latent models already encode useful low-rank pixel information that can be directly transferred rather than relearned.

Load-bearing premise

The data possesses strong low-rank structure that allows restricting noise prediction to a low-rank subspace without losing critical information needed for accurate full-dimensional velocity recovery.

What would settle it

Training an AsymFlow model on a dataset engineered to lack low-rank structure, such as independent Gaussian noise images, and observing that the recovered velocity produces no FID improvement or diverges from a symmetric baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.12964 by Gordon Wetzstein, Hansheng Chen, Jan Ackermann, Leonidas Guibas, Minseo Kim.

**Figure 1.** Figure 1: AsymFLUX.2 klein generations. AsymFlow finetunes FLUX.2 klein into a pixel-space flow model, producing highly realistic images with rich visual styles and fine detail. Abstract Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank… view at source ↗

**Figure 2.** Figure 2: AsymFlow parameterization and recovery. (a) AsymFlow changes the standard velocity target by keeping the data term full-dimensional while replacing the noise term with its low-rank projection P ϵ. (b) To recover the full-rank velocity, the low-rank component P uˆA is used directly, while the orthogonal component is converted using the x0-to-u relation in Eq. (1). 4.1 AsymFlow Parameterization Let A ∈ R D×r… view at source ↗

**Figure 3.** Figure 3: Orthogonal component view of AsymFlow. AsymFlow parameterization can be decomposed into a P u component in the low-rank subspace Im(P ) and an (I − P )x0 component in the orthogonal complement Im(I − P ). Varying the rank r yields a parameterization family whose endpoints recover full x0-prediction and full u-prediction. The decomposition reveals that AsymFlow behaves like u-prediction in the low-rank sub… view at source ↗

**Figure 4.** Figure 4: Latent-to-pixel initialization. The lifted low-rank pixel generation are semantically and structurally aligned with the decoded latent generation, leaving only a low-level gap to correct. Initialization property. The initialized lowrank pixel model predicts a target of the form P ϵ − x L 0 , so its gap to the AsymFlow target uA (Eq. (3)) is only the approximation gap x0 −x L 0 . Due to the trajectory cou… view at source ↗

**Figure 5.** Figure 5: Patch rank and PCA ablation. 160 epochs. 40 80 120 160 Epoch 10 20 30 40 50 60 FID AsymFlow (r=8) JiT (r=0) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of T2I diffusion models. AsymFLUX.2 klein produces more realistic images with richer visual styles than prior models. More results are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation of AsymFLUX.2 klein finetuning. AsymFlow produces finer details than the DDT baseline. Variance reduction further improves details and texture but introduces excessive noise. The LPIPS perceptual correction suppresses this artifact while preserving the sharp appearance. on HPSv3, indicating a substantial improvement in human-aligned visual quality. Consequently, it outperforms the prior pixel mode… view at source ↗

**Figure 9.** Figure 9: Additional qualitative text-to-image comparisons (part A). [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative text-to-image comparisons (part B). [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsymFlow introduces a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while recovering full-dimensional velocity analytically, plus a practical finetuning path from latent to pixel flow models.

read the letter

The core idea is straightforward: predict data in full dimension but noise only in a low-rank subspace, then recover the complete velocity field without altering the network or the usual training and sampling loops. This is paired with a finetuning recipe that starts from a pretrained latent model like FLUX.2 klein and aligns the low-rank subspace to the latent space so that high-level semantics carry over and only low-level pixel details need adjustment. On ImageNet 256x256 the method reports 1.57 FID, which beats earlier DiT/JiT-style pixel models by a noticeable margin, and the finetuned pixel version improves over its latent base on HPSv3, DPG-Bench, and GenEval while looking more realistic in qualitative checks. Those two pieces—the asymmetric parameterization and the latent-to-pixel route—are the genuinely new elements relative to the cited flow and diffusion literature. The empirical numbers are the strongest part of what is shown. The main uncertainty is whether the analytical recovery is truly exact in practice. The recovery step is exact only if the velocity components that matter lie inside the chosen low-rank subspace; any orthogonal part gets lost or aliased, and natural image data on ImageNet is unlikely to be perfectly low-rank in that sense. The abstract gives no error bound or explicit subspace-selection rule, so it is hard to know how much bias is introduced or how sensitive results are to the subspace choice. If the full paper supplies a clean derivation plus ablations that test this, the concern shrinks; otherwise it remains the load-bearing assumption. This paper is aimed at people already working on scaling flow or diffusion models to pixel space and who have access to large latent checkpoints. A reader looking for concrete ways to reduce the cost of high-dimensional velocity prediction will find usable ideas here. The empirical results are strong enough that a serious editor should send it to peer review rather than desk-reject; the theoretical claim on exact recovery will need the most scrutiny during revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric parameterization for velocity fields in flow-based generative models. Noise prediction is restricted to a low-rank subspace while data prediction remains full-dimensional; an analytical step then recovers the full-dimensional velocity without altering network architecture, training, or sampling. On ImageNet 256×256 the method reports 1.57 FID, outperforming prior DiT/JiT-style pixel diffusion models, and demonstrates that finetuning a pretrained latent model (FLUX.2 klein 9B) into pixel space yields new state-of-the-art results on HPSv3, DPG-Bench, and GenEval.

Significance. If the analytical recovery step is exact and the low-rank subspace captures all velocity components needed for accurate generation, the approach would offer a practical route to efficient high-dimensional flow models and seamless latent-to-pixel transfer. The reported FID and benchmark gains would constitute a meaningful empirical advance for pixel-space text-to-image generation.

major comments (2)

[§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.
[§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.

minor comments (2)

[§3.1] Notation for the low-rank projection operator is introduced without an explicit definition or reference to its construction; a short appendix equation would improve reproducibility.
[Figure 3] Figure 3 (subspace visualization) lacks axis labels and a quantitative measure of captured variance; readers cannot assess how much of the velocity energy is retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional theoretical discussion and experimental controls.

read point-by-point responses

Referee: [§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.

Authors: We thank the referee for highlighting this important clarification. The derivation in §3.2 recovers the velocity exactly by solving the linear system that combines the full-dimensional data prediction with the low-rank noise prediction projected onto the chosen subspace; this step is algebraically exact under the asymmetric parameterization. We agree, however, that the manuscript would benefit from an explicit discussion of when the assumption holds for natural images. In the revised version we have added a paragraph in §3.2 that (i) describes the data-driven construction of the subspace via SVD on velocity fields estimated from a held-out ImageNet subset, (ii) reports that the average energy in the orthogonal complement is below 5 % for 256×256 images, and (iii) supplies a simple residual-norm bound on the reconstruction error. These additions make the completeness condition explicit without changing the method or results. revision: yes
Referee: [§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.

Authors: We agree that the current experimental section would be strengthened by explicit ablations. In the revised manuscript we have expanded §4.1 with three new analyses: (1) FID versus subspace rank (r = 32, 64, 128, 256, 512), showing that performance saturates at r = 128 and that the reported 1.57 FID is stable across nearby ranks; (2) a direct comparison of the data-driven SVD subspace against a random orthonormal basis of the same dimension, demonstrating a clear degradation (FID rises to 4.8) when the subspace is not aligned with the data; and (3) a control experiment that trains an otherwise identical full-rank model without the analytical recovery step, isolating the contribution of the asymmetric parameterization. These controls confirm that the gains are attributable to the method rather than post-hoc tuning of the subspace. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; analytical recovery follows directly from parameterization without reduction to inputs

full rationale

The paper defines an asymmetric parameterization (full-dimensional data prediction, low-rank noise prediction) and states that full-dimensional velocity is recovered analytically from these predictions via the underlying flow equations. This is an algebraic step presented as a direct consequence of the model definition rather than a fitted quantity or self-referential loop. No quoted equations reduce the recovered velocity to the low-rank subspace choice by construction, nor does any central claim rely on self-citation chains, uniqueness theorems imported from prior author work, or renaming of known results. The low-rank assumption is explicit but does not make the recovery tautological; the reported FID gains are empirical. This is the common case of a non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate specific free parameters or axioms; the low-rank subspace restriction appears to be the central modeling choice.

pith-pipeline@v0.9.0 · 5521 in / 1137 out tokens · 31370 ms · 2026-05-14T19:21:49.954041+00:00 · methodology

Asymmetric Flow Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)