One-step Latent-free Image Generation with Pixel Mean Flows

Hanhong Zhao; Kaiming He; Qiao Sun; Susie Lu; Tianhong Li; Xianbang Wang; Yiyang Lu; Zhengyang Geng; Zhicheng Jiang

arxiv: 2601.22158 · v3 · submitted 2026-01-29 · 💻 cs.CV

One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu , Susie Lu , Qiao Sun , Hanhong Zhao , Zhicheng Jiang , Xianbang Wang , Tianhong Li , Zhengyang Geng

show 1 more author

Kaiming He

This is my paper

Pith reviewed 2026-05-16 09:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords image generationone-step samplinglatent-freeflow matchingdiffusion modelspixel spaceMeanFlowImageNet

0 comments

The pith

Pixel MeanFlow separates x-prediction on the image manifold from MeanFlow loss in velocity space to enable stable one-step latent-free generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes pixel MeanFlow (pMF) as a way to perform high-quality image generation in a single sampling step without any latent encoding or iterative refinement. It achieves this by setting the network's direct output target on a presumed low-dimensional image manifold using x-prediction, while computing the training loss separately through MeanFlow in velocity space. A simple transformation connects the two spaces so that the averaged velocity field can guide one-step sampling. Experiments show this produces 2.22 FID on ImageNet at 256x256 resolution and 2.48 FID at 512x512 resolution. The approach addresses the gap in existing diffusion and flow models that still require either multiple steps or latent representations.

Core claim

pMF formulates the network output space on the presumed low-dimensional image manifold via x-prediction while defining the loss via MeanFlow in the velocity space, linked by a simple transformation between the image manifold and the average velocity field, which together support stable one-step sampling without additional iterative refinement or latent encoding.

What carries the argument

The simple transformation between the presumed low-dimensional image manifold (x-prediction) and the average velocity field that carries the MeanFlow loss.

If this is right

One-step sampling becomes viable for high-resolution image generation at 256x256 and 512x512 scales.
Latent encodings are no longer required to reach competitive FID scores on ImageNet.
Network targets and loss spaces can be formulated independently while still yielding coherent outputs.
The same separation principle can be applied to other flow-based or diffusion-based models.
Generation speed improves because no multi-step iteration or latent decoding is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce inference latency in deployed systems by replacing multi-step sampling with a single forward pass.
If the transformation generalizes, similar one-step latent-free techniques might apply to video or 3D generation tasks.
Averaging velocity fields may replace the need for learned schedulers in flow matching pipelines.
Direct pixel-space prediction could simplify training pipelines by removing the need for separate autoencoders.

Load-bearing premise

A simple transformation exists between the low-dimensional image manifold and the average velocity field that enables stable one-step sampling without extra refinement.

What would settle it

Measuring whether one-step sampling without the proposed transformation produces FID scores substantially above 2.22 on ImageNet 256x256 or fails to produce coherent images.

read the original abstract

Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

pMF hits competitive one-step FID numbers in pure pixel space on ImageNet, but the unspecified transformation between manifold output and velocity loss leaves the mechanism hard to evaluate from the abstract alone.

read the letter

The main takeaway is that this pixel MeanFlow method reaches 2.22 FID for one-step generation on ImageNet 256x256 and 2.48 at 512x512, all without latents or iterative sampling. That is a concrete data point in a regime where most approaches still need one or the other compromise. The separation they introduce—network output as x-prediction on the image manifold, loss defined through MeanFlow in velocity space, connected by a transformation—is the distinct technical move. It differs from the one-step diffusion setups in the cited literature by keeping the two spaces explicit rather than forcing everything into a single formulation. The results side is where the paper earns credit. Delivering those FID scores while staying latent-free and single-step is not easy, and including both resolutions gives a clearer picture of how it scales. If the full experiments hold up with proper controls, this becomes a useful reference for anyone trying to simplify inference. The soft spot is the transformation itself. The abstract labels it simple but supplies no formula, derivation, or check that the resulting velocity field stays integrable in one step or preserves the needed properties. Without that, it is difficult to separate the contribution of the claimed mechanism from other training choices. No error bars or targeted ablations on this piece are mentioned either, so the central claim rests on moderate evidence from the abstract. This is for people working on fast diffusion and flow models who want to drop both multi-step sampling and latent encoders. Readers focused on practical deployment or flow-matching theory would find the benchmark worth examining. It deserves peer review because the numbers are strong enough to justify a full look at the math and experiments, even if revisions will likely be needed on the transformation details.

Referee Report

2 major / 1 minor

Summary. The paper proposes pixel MeanFlow (pMF) for one-step latent-free image generation by separating the network output (x-prediction on a presumed low-dimensional image manifold) from the loss (MeanFlow in velocity space) and introducing a simple transformation between them. It reports FID scores of 2.22 on ImageNet at 256x256 and 2.48 at 512x512, claiming to fill a gap in one-step latent-free regimes.

Significance. If the results and mechanism hold, this would advance efficient one-step generation without latents, with concrete benchmark performance that could simplify pipelines. The reported FIDs are competitive for the regime, but significance hinges on validating the transformation's role rather than unstated training choices.

major comments (2)

[Abstract] Abstract: the central one-step claim depends on an unspecified 'simple transformation' between the x-prediction manifold and average velocity field. No explicit formula, derivation, or analysis of required properties (e.g., one-step integrability or divergence) is given, so the FID results may not directly substantiate the proposed mechanism over alternative explanations.
[Experiments] Experiments section: FID scores of 2.22 and 2.48 are reported without error bars, ablation studies isolating the transformation, or direct comparisons to other one-step latent-free baselines, leaving the support for the 'key missing piece' claim only moderately grounded.

minor comments (1)

[Abstract] Abstract: 'pixel MeanFlow (pMF)' is introduced without referencing prior MeanFlow work or defining the term before use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve clarity and empirical support.

read point-by-point responses

Referee: [Abstract] Abstract: the central one-step claim depends on an unspecified 'simple transformation' between the x-prediction manifold and average velocity field. No explicit formula, derivation, or analysis of required properties (e.g., one-step integrability or divergence) is given, so the FID results may not directly substantiate the proposed mechanism over alternative explanations.

Authors: We agree that the abstract would be strengthened by an explicit reference to the transformation. The mapping is derived in Section 3.2 of the manuscript as a direct consequence of the MeanFlow velocity formulation applied to the x-prediction manifold; it takes the form of a linear shift that preserves the average velocity field while ensuring exact one-step integrability under the manifold assumption. In the revised version we will insert the explicit formula and a short sentence on its integrability property into the abstract. revision: yes
Referee: [Experiments] Experiments section: FID scores of 2.22 and 2.48 are reported without error bars, ablation studies isolating the transformation, or direct comparisons to other one-step latent-free baselines, leaving the support for the 'key missing piece' claim only moderately grounded.

Authors: We accept that the current experimental section lacks these elements. We will add (i) error bars from at least three independent training runs, (ii) ablation tables that remove or replace the manifold-to-velocity transformation while keeping all other training choices fixed, and (iii) comparisons against the strongest published one-step latent-free baselines. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; transformation introduced as independent component

full rationale

The paper's core mechanism separates network output (x-prediction on presumed manifold) from loss (MeanFlow in velocity space) and explicitly introduces a simple transformation to connect them, rather than deriving the transformation from self-citations, fitting parameters to the target outputs, or renaming known results. Reported FID scores (2.22 at 256x256, 2.48 at 512x512) are presented as empirical training outcomes on ImageNet, not as predictions forced by construction from the same data or prior author work. No load-bearing step reduces the claimed one-step sampling to a tautology or self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that image data occupy a low-dimensional manifold amenable to direct x-prediction and that a fixed map to an averaged velocity field suffices for one-step sampling; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Image data lies on a presumed low-dimensional manifold suitable for direct x-prediction
Invoked to justify setting the network target in image space rather than velocity space

invented entities (1)

pixel MeanFlow (pMF) no independent evidence
purpose: Framework that separates manifold output from velocity loss for one-step generation
New named method introduced to achieve the stated one-step latent-free regime

pith-pipeline@v0.9.0 · 5507 in / 1283 out tokens · 25587 ms · 2026-05-16T09:32:29.532352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a simple transformation between the image manifold and the average velocity field... x(zt, r, t)≜zt − t·u(zt, r, t)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Representation Fr\'echet Loss for Visual Generation
cs.CV 2026-04 unverdicted novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...
Action-to-Action Flow Matching
cs.RO 2026-02 unverdicted novelty 7.0

A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
Registers Matter for Pixel-Space Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows
cs.CV 2026-04 unverdicted novelty 6.0

Point-MF performs one-step point cloud reconstruction from single images by learning a mean velocity field in point space with a tailored Diffusion Transformer and a new auxiliary loss.
Drift Flow Matching
cs.LG 2026-05 unverdicted novelty 5.0

Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.
PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows
cs.CV 2026-05 unverdicted novelty 5.0

PixelFlowCast delivers high-fidelity precipitation nowcasts from radar sequences using a latent-free Pixel Mean Flows predictor guided by a deterministic coarse stage and KANCondNet features.
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
cs.LG 2026-04 unverdicted novelty 5.0

SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling
astro-ph.IM 2026-05 unverdicted novelty 4.0

One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.