pith. sign in

arxiv: 2511.13720 · v2 · submitted 2025-11-17 · 💻 cs.CV

Back to Basics: Let Denoising Generative Models Denoise

Pith reviewed 2026-05-11 22:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords denoising diffusiongenerative modelstransformersimage generationmanifold assumptionImageNetpixel-level predictionclean data prediction
0
0 comments X

The pith

Predicting clean images directly with simple Transformers on raw pixels produces competitive generative models for ImageNet at 256 and 512 resolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current denoising diffusion models avoid directly predicting clean images and instead forecast noise or other noised quantities. It argues that this choice ignores the manifold structure of natural data, where clean images occupy a low-dimensional surface while noisy versions fill the full high-dimensional space. By instead training models to map noisy inputs straight back to clean pixels, apparently limited networks can still succeed in pixel space without tokenizers, pre-training, or extra losses. The resulting JiT models, which are plain large-patch Transformers, reach competitive generation quality on ImageNet at both 256 and 512 resolution.

Core claim

Directly predicting the clean data from noised inputs, rather than predicting noise or a noised quantity, lets simple large-patch Transformers operate effectively as generative models on raw pixels. These JiT networks require no tokenizer, no pre-training, and no auxiliary loss yet produce competitive samples on ImageNet at 256 and 512 resolution, where high-dimensional noise prediction tends to fail.

What carries the argument

JiT, or Just image Transformers: large-patch Transformers applied directly to pixels that predict clean data from noised versions by exploiting the manifold structure of natural images.

If this is right

  • Networks with limited capacity can still generate high-resolution images when trained to recover points on the data manifold.
  • Generative performance remains competitive without tokenizers or pre-training when the prediction target is the clean image.
  • Large patch sizes of 16 and 32 become viable for Transformer-based diffusion on raw pixels.
  • A self-contained training paradigm for diffusion models on natural images is possible without auxiliary components.
  • Direct clean-image prediction avoids catastrophic failure modes observed when predicting high-dimensional noised quantities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-prediction strategy could reduce architectural complexity in generative models for other high-dimensional data such as audio or video.
  • Training dynamics might change when the network is explicitly encouraged to map back onto the manifold rather than into the ambient noise space.
  • Model-size requirements for high-resolution generation could be re-examined under the clean-prediction objective.
  • Classical signal-processing denoising ideas may map more directly onto modern diffusion training once the target is restored to clean data.

Load-bearing premise

Natural data lies on a low-dimensional manifold while noised data does not.

What would settle it

A clean-data-predicting large-patch Transformer that produces visibly worse or incoherent samples than a noise-predicting baseline at 512 resolution on ImageNet would falsify the central claim.

read the original abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard denoising diffusion models predict noise or noised quantities rather than clean data, and that directly predicting clean images is fundamentally different because natural data lies on a low-dimensional manifold while noised quantities do not. This allows simple, under-capacity networks to operate in high-dimensional pixel space. The authors introduce JiT (Just image Transformers): large-patch pixel Transformers trained with no tokenizer, no pre-training, and no extra loss, and report competitive ImageNet results at 256 and 512 resolutions with patch sizes 16 and 32, where noise-prediction baselines fail catastrophically.

Significance. If the results hold, the work demonstrates that a back-to-basics clean-data prediction target can enable competitive generative performance with minimal architectural complexity on raw pixels. This provides an empirical existence proof for simple large-patch Transformers as generative models and highlights the modeling choice of prediction target as potentially more important than tokenization or pre-training in high-dimensional settings.

major comments (2)
  1. [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.
  2. [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.
minor comments (2)
  1. [§2] §2: The precise mathematical formulation of the clean-data prediction objective (e.g., the training loss and how it differs from standard noise-prediction diffusion) should be stated explicitly with an equation for reproducibility.
  2. [Tables and figures] Tables and figures: Ensure quantitative tables report both patch size and resolution explicitly and include error bars or multiple seeds for the ImageNet 256/512 results to allow direct comparison with noise-prediction baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.

    Authors: We appreciate this observation. The manifold hypothesis for natural images is a standard assumption in the field, with substantial supporting evidence from prior studies on the low-dimensional structure of image data. Our work builds on this by demonstrating that direct prediction of clean data enables effective modeling in high-dimensional pixel space with simple architectures, in contrast to noise prediction. While we do not provide new intrinsic dimension calculations, the empirical results—particularly the failure of noise prediction at large patch sizes—serve as indirect validation. In the revised manuscript, we will expand the discussion in Section 1 to include references to key literature on image manifolds and clarify the role of this assumption. revision: partial

  2. Referee: [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.

    Authors: We agree that careful isolation of variables strengthens the argument. Our experiments compare clean-data prediction (JiT) against noise-prediction baselines using the exact same Transformer architecture, patch sizes, and training protocol on raw pixels, with the only difference being the prediction target. This setup controls for network capacity and largely for optimization dynamics, as the training procedure is identical. The loss geometry is inherently tied to the choice of target, which is the central modeling decision under investigation. We believe this provides sufficient evidence for the importance of the prediction target. However, we will add a note in the experiments section acknowledging potential confounding factors and discussing why the target choice is the primary variable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration remains self-contained without reductions to fitted inputs or self-citations.

full rationale

The manuscript advances an empirical claim that direct clean-image prediction with large-patch pixel Transformers yields competitive ImageNet results at 256/512 resolution, without tokenizers or pre-training. The manifold assumption is invoked as an explanatory premise for why this modeling choice succeeds where noise prediction fails, but the paper presents no equations, derivations, or parameter fits that reduce the reported performance to the assumption by construction. No self-citation chains, uniqueness theorems, or ansatzes are used to justify core choices; results are benchmark numbers rather than forced predictions. The derivation chain is therefore independent of its inputs and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about data manifolds; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Natural data lies on a low-dimensional manifold, whereas noised quantities do not.
    Invoked in the abstract to justify why predicting clean data is fundamentally different and advantageous.

pith-pipeline@v0.9.0 · 5504 in / 1122 out tokens · 41294 ms · 2026-05-11T22:10:03.888395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces.

  • Foundation.JCostCoshIdentity jcost_exp_cosh_form echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Predicting clean data is fundamentally different from predicting noise or a noised quantity.

  • Foundation.DiscretenessForcing continuous_no_isolated_zero_defect echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    SMFP introduces a one-step generative policy class using MeanFlow to map noise to actions, providing a tractable entropy surrogate for unified off-policy mirror descent training that outperforms Gaussian and generativ...

  2. Let EEG Models Learn EEG

    cs.CV 2026-05 unverdicted novelty 7.0

    JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising me...

  3. CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

  4. Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes

    cs.GR 2026-05 unverdicted novelty 7.0

    Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.

  5. FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

    cs.CV 2026-05 unverdicted novelty 7.0

    FluxFlow is a conservative pixel-space flow-matching framework for astronomical super-resolution that incorporates real atmospheric uncertainty and a training-free Wiener correction, outperforming baselines on a new 1...

  6. Binomial flows: Denoising and flow matching for discrete ordinal data

    cs.LG 2026-05 unverdicted novelty 7.0

    Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.

  7. Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement

    cs.CV 2026-04 unverdicted novelty 7.0

    A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

  8. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  9. Coevolving Representations in Joint Image-Feature Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...

  10. FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking

    eess.SP 2026-04 unverdicted novelty 7.0

    FARM is a foundation model combining masked autoencoders and diffusion decoders to estimate high-resolution aerial radio maps from a new multi-band low-altitude dataset, claiming superior accuracy and generalization o...

  11. Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...

  12. Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    cs.CL 2026-02 unverdicted novelty 7.0

    Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.

  13. Latent Generative Solvers for Generalizable Long-Term Physics Simulation

    cs.AI 2026-02 unverdicted novelty 7.0

    LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.

  14. PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...

  15. RiT: Vanilla Diffusion Transformers Suffice in Representation Space

    cs.CV 2026-05 conditional novelty 6.0

    A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

  16. Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

  17. Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

    cs.CV 2026-05 unverdicted novelty 6.0

    HDFM adds a continuous heat-dissipation (blur) process to flow matching, aligns an interpolated path to fix ill-posed inverse heat dissipation, and uses x-prediction to ease high-dimensional regression, yielding bette...

  18. WavFlow: Audio Generation in Waveform Space

    cs.SD 2026-05 conditional novelty 6.0

    WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

  19. Improved Baselines with Representation Autoencoders

    cs.CV 2026-05 conditional novelty 6.0

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

  20. Registers Matter for Pixel-Space Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

  21. HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

  22. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  23. BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

  24. Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.

  25. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  26. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  27. Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

    cs.CV 2026-05 unverdicted novelty 6.0

    Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...

  28. FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.

  29. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  30. A Few-Step Generative Model on Cumulative Flow Maps

    cs.LG 2026-05 unverdicted novelty 6.0

    Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.

  31. High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification

    cs.CV 2026-04 unverdicted novelty 6.0

    MSDiff maps degraded hyperspectral data to a low-dimensional manifold and uses diffusion to regularize features for more robust classification under complex degradations.

  32. CoreFlow: Low-Rank Matrix Generative Models

    cs.LG 2026-04 unverdicted novelty 6.0

    CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.

  33. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

  34. V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

    cs.LG 2026-04 unverdicted novelty 6.0

    V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

  35. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  36. VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport

    eess.IV 2026-04 unverdicted novelty 6.0

    VOLT is a probabilistic transport method with a 3D anisotropic network that improves wide-field microscopy volume reconstruction in lateral and axial directions while supplying voxel-wise credibility estimates.

  37. Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing

    cs.LG 2026-04 unverdicted novelty 6.0

    RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.

  38. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  39. FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...

  40. CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

    cs.CV 2026-04 unverdicted novelty 6.0

    CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.

  41. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  42. From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    LGCD creates pseudo-overlapping user data via LLM reasoning and uses conditional diffusion to generate target-domain user representations for inter-domain sequential recommendation without real overlapping users.

  43. FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

    cs.CL 2026-04 unverdicted novelty 6.0

    FlowLM converts diffusion LMs to flow matching via fine-tuning, achieving few-step generation that rivals or beats 2000-step diffusion and saturates faster than training flow models from scratch.

  44. ML-based approach to classification and generation of structured light propagation in turbulent media

    physics.optics 2026-04 unverdicted novelty 6.0

    ML models classify and generate structured light in turbulence using CNNs and diffusion models enhanced by Bregman distance minimization.

  45. What Does Flow Matching Bring To TD Learning?

    cs.LG 2026-03 conditional novelty 6.0

    Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.

  46. Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

    cs.RO 2026-02 unverdicted novelty 6.0

    The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x perform...

  47. Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    cs.CL 2026-02 conditional novelty 6.0

    Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.

  48. ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    cs.CV 2026-02 unverdicted novelty 6.0

    ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.

  49. Protein Autoregressive Modeling via Multiscale Structure Generation

    cs.LG 2026-02 unverdicted novelty 6.0

    PAR is a multi-scale autoregressive transformer framework for protein backbone generation that uses coarse-to-fine prediction, noisy context learning, and flow-based decoding to achieve high-quality unconditional and ...

  50. Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

    cs.LG 2026-02 conditional novelty 6.0

    An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.

  51. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  52. Sampling-Free Diffusion Transformers for Low-Complexity MIMO Channel Estimation

    eess.SP 2026-02 unverdicted novelty 6.0

    A diffusion transformer directly maps noisy MIMO channel observations to clean estimates in a single pass by exploiting angular sparsity, achieving better accuracy and much lower complexity than iterative diffusion baselines.

  53. PixelDiT: Pixel Diffusion Transformers for Image Generation

    cs.CV 2025-11 conditional novelty 6.0

    PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.

  54. Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

    cs.LG 2026-05 unverdicted novelty 5.0

    Stochastic MeanFlow Policies enable one-step generative control in off-policy mirror descent by mapping noise through a MeanFlow transform, yielding tractable entropy and improved MuJoCo performance over Gaussian and ...

  55. PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.

  56. FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

  57. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  58. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  59. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  60. FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

    cs.CV 2026-05 unverdicted novelty 5.0

    FluxFlow uses conservative pixel-space flow-matching with uncertainty weights and Wiener test-time correction to outperform baselines on photometric and scientific accuracy for ground-to-space super-resolution, valida...

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 64 Pith papers · 6 internal anchors

  1. [1]

    Build- ing normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR, 2023

  2. [2]

    Deep variational information bottleneck

    Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017

  3. [3]

    Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

    Gunnar Carlsson. Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

  4. [4]

    MIT Press, Cambridge, MA, USA, 2006

    Olivier Chapelle, Bernhard Sch ¨olkopf, and Alexander Zien, editors.Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 2006

  5. [5]

    Neural ordinary differential equations

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InNeurIPS, 2018

  6. [6]

    arXiv preprint arXiv:2504.07963 (2025)

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. PixelFlow: Pixel-space generative models with flow.arXiv:2504.07963, 2025

  7. [7]

    On the importance of noise scheduling for diffu- sion models

    Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv:2301.10972, 2023

  8. [8]

    De- constructing denoising diffusion models for self-supervised learning

    Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InICLR, 2025

  9. [9]

    Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

    Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

  10. [10]

    Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

    Mauricio Delbracio and Peyman Milanfar. Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

  11. [11]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009

  12. [12]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  14. [14]

    Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

    Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

  15. [15]

    Scaling rec- tified flow Transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow Transformers for high-resolution image synthesis. InICML, 2024

  16. [16]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

  17. [17]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv:1706.02677, 2017

  18. [18]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv:2509.24527, 2025

  19. [19]

    Query-key normalization for Transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020

  20. [20]

    Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z

    Karl Heun. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z. Math. Phys, 45:23–38, 1900

  21. [21]

    GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

  22. [22]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops, 2021

  23. [23]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

  24. [24]

    DDPM github repo

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. DDPM github repo. L155,diffusion utils 2.py, 2020

  25. [25]

    sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

  26. [26]

    Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion. InCVPR, 2025

  27. [27]

    What secrets do your manifolds hold? understanding the local geometry of gener- ative models

    Ahmed Imtiaz Humayun, Ibtihel Amara, Cristina Vascon- celos, Deepak Ramachandran, Candice Schumann, Jun- feng He, Katherine Heller, Golnoosh Farnadi, Negar Ros- tamzadeh, and Mohammad Havaei. What secrets do your manifolds hold? understanding the local geometry of gener- ative models. InICLR, 2025

  28. [28]

    Scalable adaptive computation for iterative generation

    Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

  29. [29]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

  30. [30]

    Understanding diffusion objectives as the ELBO with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In NeurIPS, 2023

  31. [31]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015

  32. [32]

    Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

  33. [33]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 12

  34. [34]

    Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

    Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

  35. [35]

    Autoregressive image generation without vec- tor quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization. InNeurIPS, 2024

  36. [36]

    Fractal generative models

    Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv:2502.17437, 2025

  37. [37]

    Flow matching for generative mod- eling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023

  38. [38]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  39. [39]

    Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hossein- zadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

  40. [40]

    SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024

  41. [41]

    k-Sparse Autoencoders

    Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv:1312.5663, 2013

  42. [42]

    Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

    Peyman Milanfar and Mauricio Delbracio. Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

  43. [43]

    NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

  44. [44]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

  45. [45]

    DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

    Maxime Oquab et al. DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

  46. [46]

    Scalable diffusion models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023

  47. [47]

    Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

    Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

  48. [48]

    Contractive auto-encoders: Explicit in- variance during feature extraction

    Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit in- variance during feature extraction. InICML, 2011

  49. [49]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022

  50. [50]

    U- Net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015

  51. [51]

    Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

    Sam T Roweis and Lawrence K Saul. Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

  52. [52]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

  53. [53]

    Improved techniques for training GANs.NeurIPS, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs.NeurIPS, 29, 2016

  54. [54]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve Transformer. arXiv:2002.05202, 2020

  55. [55]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Ji- wen Lu. Latent diffusion model without variational autoen- coder.arXiv:2510.15301, 2025

  56. [56]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014

  57. [57]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

  58. [58]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021

  59. [59]

    Generative modeling by es- timating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019

  60. [60]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021

  61. [61]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  62. [62]

    RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

  63. [63]

    A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

    Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

  64. [64]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  65. [65]

    JetFormer: an autoregressive generative model of raw images and text

    Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. JetFormer: an autoregressive generative model of raw images and text. InICLR, 2025

  66. [66]

    Attention is all you need.NeurIPS, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

  67. [67]

    A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

  68. [68]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008

  69. [69]

    Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L ´eon Bottou. Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 2010. 13

  70. [70]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. PixNerd: Pixel neural field diffusion. arXiv:2507.23268, 2025

  71. [71]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

  72. [72]

    Dif- fusion model for generative image denoising

    Yutong Xie, Minne Yuan, Bin Dong, and Quanzheng Li. Dif- fusion model for generative image denoising. InICCV, 2023

  73. [73]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

  74. [74]

    Representation alignment for generation: Training diffusion Transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion Transformers is easier than you think. InICLR, 2025

  75. [75]

    Root mean square layer nor- malization

    Biao Zhang and Rico Sennrich. Root mean square layer nor- malization. InNeurIPS, 2019

  76. [77]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  77. [78]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion Transformers with representation autoen- coders.arXiv:2510.11690, 2025

  78. [79]

    From learning models of nat- ural image patches to whole image restoration

    Daniel Zoran and Yair Weiss. From learning models of nat- ural image patches to whole image restoration. InICCV, 2011. 14 class 012: house finch, linnet, Carpodacus mexicanus class 014: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 042: agama class 081: ptarmigan class 107: jellyfish class 108: sea anemone, anemone class 110: flatworm,...