Recognition: 3 theorem links
· Lean TheoremBack to Basics: Let Denoising Generative Models Denoise
Pith reviewed 2026-05-11 22:10 UTC · model grok-4.3
The pith
Predicting clean images directly with simple Transformers on raw pixels produces competitive generative models for ImageNet at 256 and 512 resolutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Directly predicting the clean data from noised inputs, rather than predicting noise or a noised quantity, lets simple large-patch Transformers operate effectively as generative models on raw pixels. These JiT networks require no tokenizer, no pre-training, and no auxiliary loss yet produce competitive samples on ImageNet at 256 and 512 resolution, where high-dimensional noise prediction tends to fail.
What carries the argument
JiT, or Just image Transformers: large-patch Transformers applied directly to pixels that predict clean data from noised versions by exploiting the manifold structure of natural images.
If this is right
- Networks with limited capacity can still generate high-resolution images when trained to recover points on the data manifold.
- Generative performance remains competitive without tokenizers or pre-training when the prediction target is the clean image.
- Large patch sizes of 16 and 32 become viable for Transformer-based diffusion on raw pixels.
- A self-contained training paradigm for diffusion models on natural images is possible without auxiliary components.
- Direct clean-image prediction avoids catastrophic failure modes observed when predicting high-dimensional noised quantities.
Where Pith is reading between the lines
- The same direct-prediction strategy could reduce architectural complexity in generative models for other high-dimensional data such as audio or video.
- Training dynamics might change when the network is explicitly encouraged to map back onto the manifold rather than into the ambient noise space.
- Model-size requirements for high-resolution generation could be re-examined under the clean-prediction objective.
- Classical signal-processing denoising ideas may map more directly onto modern diffusion training once the target is restored to clean data.
Load-bearing premise
Natural data lies on a low-dimensional manifold while noised data does not.
What would settle it
A clean-data-predicting large-patch Transformer that produces visibly worse or incoherent samples than a noise-predicting baseline at 512 resolution on ImageNet would falsify the central claim.
read the original abstract
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard denoising diffusion models predict noise or noised quantities rather than clean data, and that directly predicting clean images is fundamentally different because natural data lies on a low-dimensional manifold while noised quantities do not. This allows simple, under-capacity networks to operate in high-dimensional pixel space. The authors introduce JiT (Just image Transformers): large-patch pixel Transformers trained with no tokenizer, no pre-training, and no extra loss, and report competitive ImageNet results at 256 and 512 resolutions with patch sizes 16 and 32, where noise-prediction baselines fail catastrophically.
Significance. If the results hold, the work demonstrates that a back-to-basics clean-data prediction target can enable competitive generative performance with minimal architectural complexity on raw pixels. This provides an empirical existence proof for simple large-patch Transformers as generative models and highlights the modeling choice of prediction target as potentially more important than tokenization or pre-training in high-dimensional settings.
major comments (2)
- [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.
- [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.
minor comments (2)
- [§2] §2: The precise mathematical formulation of the clean-data prediction objective (e.g., the training loss and how it differs from standard noise-prediction diffusion) should be stated explicitly with an equation for reproducibility.
- [Tables and figures] Tables and figures: Ensure quantitative tables report both patch size and resolution explicitly and include error bars or multiple seeds for the ImageNet 256/512 results to allow direct comparison with noise-prediction baselines.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.
Authors: We appreciate this observation. The manifold hypothesis for natural images is a standard assumption in the field, with substantial supporting evidence from prior studies on the low-dimensional structure of image data. Our work builds on this by demonstrating that direct prediction of clean data enables effective modeling in high-dimensional pixel space with simple architectures, in contrast to noise prediction. While we do not provide new intrinsic dimension calculations, the empirical results—particularly the failure of noise prediction at large patch sizes—serve as indirect validation. In the revised manuscript, we will expand the discussion in Section 1 to include references to key literature on image manifolds and clarify the role of this assumption. revision: partial
-
Referee: [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.
Authors: We agree that careful isolation of variables strengthens the argument. Our experiments compare clean-data prediction (JiT) against noise-prediction baselines using the exact same Transformer architecture, patch sizes, and training protocol on raw pixels, with the only difference being the prediction target. This setup controls for network capacity and largely for optimization dynamics, as the training procedure is identical. The loss geometry is inherently tied to the choice of target, which is the central modeling decision under investigation. We believe this provides sufficient evidence for the importance of the prediction target. However, we will add a note in the experiments section acknowledging potential confounding factors and discussing why the target choice is the primary variable. revision: partial
Circularity Check
No significant circularity; empirical demonstration remains self-contained without reductions to fitted inputs or self-citations.
full rationale
The manuscript advances an empirical claim that direct clean-image prediction with large-patch pixel Transformers yields competitive ImageNet results at 256/512 resolution, without tokenizers or pre-training. The manifold assumption is invoked as an explanatory premise for why this modeling choice succeeds where noise prediction fails, but the paper presents no equations, derivations, or parameter fits that reduce the reported performance to the assumption by construction. No self-citation chains, uniqueness theorems, or ansatzes are used to justify core choices; results are benchmark numbers rather than forced predictions. The derivation chain is therefore independent of its inputs and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural data lies on a low-dimensional manifold, whereas noised quantities do not.
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces.
-
Foundation.JCostCoshIdentityjcost_exp_cosh_form echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Predicting clean data is fundamentally different from predicting noise or a noised quantity.
-
Foundation.DiscretenessForcingcontinuous_no_isolated_zero_defect echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 40 Pith papers
-
FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution
FluxFlow is a conservative pixel-space flow-matching framework for astronomical super-resolution that incorporates real atmospheric uncertainty and a training-free Wiener correction, outperforming baselines on a new 1...
-
Binomial flows: Denoising and flow matching for discrete ordinal data
Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.
-
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Coevolving Representations in Joint Image-Feature Diffusion
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
-
FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking
FARM is a foundation model combining masked autoencoders and diffusion decoders to estimate high-resolution aerial radio maps from a new multi-band low-altitude dataset, claiming superior accuracy and generalization o...
-
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.
-
ELF: Embedded Language Flows
ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
-
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
A Few-Step Generative Model on Cumulative Flow Maps
Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.
-
High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification
MSDiff maps degraded hyperspectral data to a low-dimensional manifold and uses diffusion to regularize features for more robust classification under complex degradations.
-
CoreFlow: Low-Rank Matrix Generative Models
CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport
VOLT is a probabilistic transport method with a 3D anisotropic network that improves wide-field microscopy volume reconstruction in lateral and axial directions while supplying voxel-wise credibility estimates.
-
Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...
-
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation
LGCD creates pseudo-overlapping user data via LLM reasoning and uses conditional diffusion to generate target-domain user representations for inter-domain sequential recommendation without real overlapping users.
-
ML-based approach to classification and generation of structured light propagation in turbulent media
ML models classify and generate structured light in turbulence using CNNs and diffusion models enhanced by Bregman distance minimization.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution
FluxFlow uses conservative pixel-space flow-matching with uncertainty weights and Wiener test-time correction to outperform baselines on photometric and scientific accuracy for ground-to-space super-resolution, valida...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
Scaling Properties of Continuous Diffusion Spoken Language Models
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
-
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
-
PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction
PoreDiT generates 1024^3 voxel digital rock models via 3D Swin Transformer binary pore-field prediction, matching prior methods on porosity, permeability, and Euler characteristics while running on consumer hardware.
-
Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification
Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.
-
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
The second NTIRE challenge on day and night raindrop removal for dual-focused images received 17 valid team submissions that demonstrated strong performance on the Raindrop Clarity dataset.
-
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
The NTIRE 2026 challenge reports strong performance from 17 teams on raindrop removal for dual-focused day and night images using an adjusted real-world dataset with 14,139 training images.
Reference graph
Works this paper leans on
-
[1]
Build- ing normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR, 2023
work page 2023
-
[2]
Deep variational information bottleneck
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017
work page 2017
-
[3]
Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009
Gunnar Carlsson. Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009
work page 2009
-
[4]
MIT Press, Cambridge, MA, USA, 2006
Olivier Chapelle, Bernhard Sch ¨olkopf, and Alexander Zien, editors.Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 2006
work page 2006
-
[5]
Neural ordinary differential equations
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InNeurIPS, 2018
work page 2018
-
[6]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. PixelFlow: Pixel-space generative models with flow.arXiv:2504.07963, 2025
-
[7]
On the importance of noise scheduling for diffusion models
Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv:2301.10972, 2023
-
[8]
De- constructing denoising diffusion models for self-supervised learning
Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InICLR, 2025
work page 2025
-
[9]
Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007
work page 2080
-
[10]
Mauricio Delbracio and Peyman Milanfar. Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023
work page 2023
-
[11]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009
work page 2009
-
[12]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021
work page 2021
-
[13]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021
work page 2021
-
[14]
Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006
work page 2006
-
[15]
Scaling rec- tified flow Transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow Transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[16]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014
work page 2014
-
[17]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv:1706.02677, 2017
work page internal anchor Pith review arXiv 2017
-
[18]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv:2509.24527, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Query-key normalization for Transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020
work page 2020
-
[20]
Karl Heun. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z. Math. Phys, 45:23–38, 1900
work page 1900
-
[21]
GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017
work page 2017
-
[22]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops, 2021
work page 2021
-
[23]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020
work page 2020
-
[24]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. DDPM github repo. L155,diffusion utils 2.py, 2020
work page 2020
-
[25]
sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023
work page 2023
-
[26]
Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion. InCVPR, 2025
work page 2025
-
[27]
What secrets do your manifolds hold? understanding the local geometry of gener- ative models
Ahmed Imtiaz Humayun, Ibtihel Amara, Cristina Vascon- celos, Deepak Ramachandran, Candice Schumann, Jun- feng He, Katherine Heller, Golnoosh Farnadi, Negar Ros- tamzadeh, and Mohammad Havaei. What secrets do your manifolds hold? understanding the local geometry of gener- ative models. InICLR, 2025
work page 2025
-
[28]
Scalable adaptive computation for iterative generation
Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023
work page 2023
-
[29]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022
work page 2022
-
[30]
Understanding diffusion objectives as the ELBO with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In NeurIPS, 2023
work page 2023
-
[31]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015
work page 2015
-
[32]
Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019
work page 2019
-
[33]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 12
work page 2024
-
[34]
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025
-
[35]
Autoregressive image generation without vec- tor quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization. InNeurIPS, 2024
work page 2024
-
[36]
Fractal generative models.arXiv:2502.17437, 2025
Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv:2502.17437, 2025
-
[37]
Flow matching for generative mod- eling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023
work page 2023
-
[38]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023
work page 2023
-
[39]
Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hossein- zadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024
work page 2024
-
[40]
SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024
work page 2024
-
[41]
Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv:1312.5663, 2013
work page Pith review arXiv 2013
-
[42]
Peyman Milanfar and Mauricio Delbracio. Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025
work page 2025
-
[43]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021
work page 2021
-
[44]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021
work page 2021
-
[45]
Maxime Oquab et al. DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023
work page 2023
-
[46]
Scalable diffusion models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023
work page 2023
-
[47]
Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003
work page 2003
-
[48]
Contractive auto-encoders: Explicit in- variance during feature extraction
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit in- variance during feature extraction. InICML, 2011
work page 2011
-
[49]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022
work page 2022
-
[50]
U- Net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015
work page 2015
-
[51]
Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000
Sam T Roweis and Lawrence K Saul. Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000
work page 2000
-
[52]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022
work page 2022
-
[53]
Improved techniques for training GANs.NeurIPS, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs.NeurIPS, 29, 2016
work page 2016
-
[54]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve Transformer. arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[55]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Ji- wen Lu. Latent diffusion model without variational autoen- coder.arXiv:2510.15301, 2025
-
[56]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[57]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015
work page 2015
-
[58]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021
work page 2021
-
[59]
Generative modeling by es- timating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019
work page 2019
-
[60]
Score-based generative modeling through stochastic differential equa- tions
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021
work page 2021
-
[61]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014
work page 1929
-
[62]
RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024
work page 2024
-
[63]
Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000
work page 2000
-
[64]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[65]
JetFormer: an autoregressive generative model of raw images and text
Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. JetFormer: an autoregressive generative model of raw images and text. InICLR, 2025
work page 2025
-
[66]
Attention is all you need.NeurIPS, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017
work page 2017
-
[67]
Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011
work page 2011
-
[68]
Extracting and composing robust features with denoising autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008
work page 2008
-
[69]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L ´eon Bottou. Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 2010. 13
work page 2010
-
[70]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. PixNerd: Pixel neural field diffusion. arXiv:2507.23268, 2025
-
[71]
DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025
-
[72]
Dif- fusion model for generative image denoising
Yutong Xie, Minne Yuan, Bin Dong, and Quanzheng Li. Dif- fusion model for generative image denoising. InICCV, 2023
work page 2023
-
[73]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025
work page 2025
-
[74]
Representation alignment for generation: Training diffusion Transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion Transformers is easier than you think. InICLR, 2025
work page 2025
-
[75]
Root mean square layer nor- malization
Biao Zhang and Rico Sennrich. Root mean square layer nor- malization. InNeurIPS, 2019
work page 2019
-
[77]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
work page 2018
-
[78]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion Transformers with representation autoen- coders.arXiv:2510.11690, 2025
work page internal anchor Pith review arXiv 2025
-
[79]
From learning models of nat- ural image patches to whole image restoration
Daniel Zoran and Yair Weiss. From learning models of nat- ural image patches to whole image restoration. InICCV, 2011. 14 class 012: house finch, linnet, Carpodacus mexicanus class 014: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 042: agama class 081: ptarmigan class 107: jellyfish class 108: sea anemone, anemone class 110: flatworm,...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.