Recognition: unknown
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
Pith reviewed 2026-05-08 13:19 UTC · model grok-4.3
The pith
FREPix improves pixel-space image generation by routing low- and high-frequency components along separate transport paths.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at 256×256 and 2.38 FID at 512×512, with particularly strong behavior in the low-NFE regime.
What carries the argument
Frequency-heterogeneous flow matching that decomposes the image into low- and high-frequency components and assigns each its own transport path along with a factorized prediction network.
If this is right
- Competitive FID scores are reached directly in pixel space at both 256 and 512 resolution on ImageNet.
- Results remain strong even when the number of function evaluations is kept small.
- Coarse-to-fine structure is enforced by design rather than emerging only from the training dynamics.
- The approach avoids the representation bottleneck that comes from using a variational autoencoder.
Where Pith is reading between the lines
- The same frequency split could be tested in other pixel-space generative methods such as standard diffusion to check for similar efficiency gains.
- Fixed low/high bands might be replaced by learned or adaptive frequency ranges in follow-up work.
- The factorized network structure might lend itself to separate control of coarse structure and fine detail during sampling.
Load-bearing premise
That explicitly separating low- and high-frequency components with dedicated transport paths and a factorized network produces the reported performance gains without hidden costs or implementation artifacts.
What would settle it
A standard flow-matching model without any frequency separation that reaches the same or better FID scores at low NFE on the identical ImageNet class-to-image task would falsify the benefit of the heterogeneous design.
Figures
read the original abstract
Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at $256\times256$ and 2.38 FID at $512\times512$, with particularly strong behavior in the low-NFE regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FREPix, a frequency-heterogeneous flow matching framework for pixel-space image generation. It decomposes the generation process into separate low- and high-frequency components, assigns distinct transport paths to each, employs a factorized network for prediction, and uses a frequency-aware training objective. This makes coarse-to-fine generation an explicit design choice. On ImageNet class-conditional generation, it reports FID scores of 1.91 at 256×256 and 2.38 at 512×512, with particular strength in the low-NFE regime among pixel-space models.
Significance. If the reported FID numbers and low-NFE behavior are reproducible with proper ablations confirming the contribution of the frequency decomposition, this could meaningfully advance pixel-space generative modeling by avoiding VAE bottlenecks while explicitly leveraging frequency-specific dynamics. The emphasis on low-NFE efficiency has practical value for deployment.
major comments (2)
- [§4.2, Table 2] §4.2 and Table 2: the claim of 'particularly strong behavior in the low-NFE regime' is supported only by aggregate FID curves; without per-frequency error breakdowns or ablation removing the separate transport paths, it is unclear whether the gains are due to the frequency-heterogeneous design or to other implementation choices such as the factorized network capacity.
- [§3.3, Eq. (8)] §3.3, Eq. (8): the frequency-aware objective is defined as a weighted sum of low- and high-frequency losses, but the weighting schedule and its interaction with the flow-matching velocity field are not derived from first principles; this leaves open whether the reported 1.91 FID is robust to alternative weightings or simply tuned for the ImageNet splits.
minor comments (2)
- [Figure 3, §4.1] Figure 3 caption and §4.1: the NFE axis labels and the exact definition of 'low-NFE' (e.g., <10 steps) should be stated explicitly to allow direct comparison with prior pixel-space flow-matching baselines.
- [§5] §5: the discussion of limitations mentions only computational cost but does not address potential artifacts from frequency decomposition at high resolutions (512×512), such as boundary effects between low- and high-frequency bands.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§4.2, Table 2] §4.2 and Table 2: the claim of 'particularly strong behavior in the low-NFE regime' is supported only by aggregate FID curves; without per-frequency error breakdowns or ablation removing the separate transport paths, it is unclear whether the gains are due to the frequency-heterogeneous design or to other implementation choices such as the factorized network capacity.
Authors: We agree that the current presentation relies on aggregate curves and that targeted ablations would provide stronger evidence. In the revised manuscript we will add per-frequency error breakdowns (separate low- and high-frequency reconstruction metrics) and an ablation that disables the separate transport paths while retaining the factorized network architecture. These additions will isolate the contribution of the frequency-heterogeneous design. We note that the factorized network is itself a direct consequence of the decomposition, so a fully orthogonal ablation is not feasible, but the requested experiments will clarify the source of the low-NFE gains. revision: yes
-
Referee: [§3.3, Eq. (8)] §3.3, Eq. (8): the frequency-aware objective is defined as a weighted sum of low- and high-frequency losses, but the weighting schedule and its interaction with the flow-matching velocity field are not derived from first principles; this leaves open whether the reported 1.91 FID is robust to alternative weightings or simply tuned for the ImageNet splits.
Authors: The weighting schedule is chosen empirically to compensate for the faster convergence of low-frequency components under flow matching. While we did not supply a first-principles derivation, we will include a sensitivity study in the appendix that reports FID scores across a range of alternative weighting schedules. This analysis will demonstrate robustness and will explicitly document the interaction between the weights and the velocity-field prediction. The reported 1.91 FID corresponds to the schedule described in the paper; the new experiments will show performance under nearby schedules. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces FREPix as an explicit design that decomposes pixel-space flow matching into separate low- and high-frequency transport paths, a factorized network, and a frequency-aware objective. These choices are motivated by the stated limitation of prior frequency-homogeneous pixel-space methods and are presented as independent architectural decisions rather than quantities derived from fitted parameters or prior self-citations. The reported ImageNet FID numbers (1.91 at 256×256, 2.38 at 512×512) and low-NFE behavior are framed as empirical outcomes of this construction, with no equations shown that reduce predictions to inputs by definition, no load-bearing self-citations, and no uniqueness theorems invoked to force the approach. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
2021
-
[2]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[3]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[4]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024
2024
-
[5]
Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025
2025
-
[6]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
2025
-
[7]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
-
[8]
Learnings from scaling visual tokenizers for reconstruction and generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation. InInternational Conference on Machine Learning, pages 22023–22043. PMLR, 2025
2025
-
[9]
On the spectral bias of neural networks
Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019
2019
-
[10]
Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022
2022
-
[11]
Relay diffusion: Unifying diffusion process across resolutions for image synthesis
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[12]
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation.arXiv preprint arXiv:2511.20645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
-
[14]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
-
[15]
Statistics of natural image categories.Network: computation in neural systems, 14(3):391, 2003
Antonio Torralba and Aude Oliva. Statistics of natural image categories.Network: computation in neural systems, 14(3):391, 2003
2003
-
[16]
Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution
Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. InProceedings of the IEEE/CVF international conference on computer vision, pages 3435–3444, 2019
2019
-
[17]
Frequency principle: Fourier analysis sheds light on deep neural networks.Communica- tions in Computational Physics, 28(5):1746–1767, 2020
Zhi-Qin John Xu. Frequency principle: Fourier analysis sheds light on deep neural networks.Communica- tions in Computational Physics, 28(5):1746–1767, 2020
2020
-
[18]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[19]
Sliced score matching: A scalable approach to density and score estimation
Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. InUncertainty in artificial intelligence, pages 574–584. PMLR, 2020. 10
2020
-
[20]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021
2021
-
[21]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency-decoupled pixel diffusion for end-to-end image generation.arXiv preprint arXiv:2511.19365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[24]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[25]
Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025
Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025
2025
-
[26]
Car-flow: Condition-aware reparameterization aligns source and target for better flow matching
Chen Chen, Pengsheng Guo, Liangchen Song, Jiasen Lu, Rui Qian, Tsu-Jui Fu, Xinze Wang, Wei Liu, Yinfei Yang, and Alex Schwing. Car-flow: Condition-aware reparameterization aligns source and target for better flow matching. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[27]
Mean flows for one-step generative modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[28]
One-step Latent-free Image Generation with Pixel Mean Flows
Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158, 2026
work page internal anchor Pith review arXiv 2026
-
[29]
Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18062–18071, 2025
2025
-
[30]
Wavelet diffusion models are fast and scalable image generators
Hao Phung, Quan Dao, and Anh Tran. Wavelet diffusion models are fast and scalable image generators. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10199–10208, 2023
2023
-
[31]
On estimation of the wavelet variance.Biometrika, 82(3):619–631, 1995
Donald P Percival. On estimation of the wavelet variance.Biometrika, 82(3):619–631, 1995
1995
-
[32]
David Pollard.Empirical processes: theory and applications. 1990
1990
-
[33]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In The Thirteenth International Conference on Learning Representations, 2024
2024
-
[34]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
2018
-
[35]
Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
2017
-
[36]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
2016
-
[37]
Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019
2019
-
[38]
Haar wavelets
Ülo Lepik and Helle Hein. Haar wavelets. InHaar wavelets: with applications, pages 7–20. Springer, 2014
2014
-
[39]
Jetformer: An autoregressive generative model of raw images and text
Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InThe Thirteenth International Conference on Learning Representations, 2024. 11
2024
-
[40]
Fractal generative models.Transactions on Machine Learning Research, 2025
Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.Transactions on Machine Learning Research, 2025
2025
-
[41]
Scalable adaptive computation for iterative generation
Allan Jabri, David J Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. In International Conference on Machine Learning, pages 14569–14589. PMLR, 2023
2023
-
[42]
Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023
Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023
2023
-
[43]
Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022
2022
-
[44]
The sizes of compact subsets of hilbert space and continuity of gaussian processes
Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967
1967
-
[45]
Universal donsker classes and metric entropy.The Annals of Probability, 15(4):1306–1326, 1987
RM Dudley. Universal donsker classes and metric entropy.The Annals of Probability, 15(4):1306–1326, 1987
1987
-
[46]
MIT press, 2018
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018
2018
-
[47]
Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024
2024
-
[48]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion.arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review arXiv 2014
-
[49]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2018
2018
-
[50]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. Advances in Neural Information Processing Systems, 37:122458–122483, 2024
2024
-
[51]
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images.arXiv preprint arXiv:2301.11093, 2023. A Broader Impact This work studies pixel-space image generation and proposes a frequency-heterogeneous formulation of flow matching. By making the roles of low- and high-frequency components explicit in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.