pith. sign in

arxiv: 2410.06940 · v4 · submitted 2024-10-09 · 💻 cs.CV · cs.LG

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Pith reviewed 2026-05-12 15:04 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords representation alignmentdiffusion transformersDiTSiTtraining accelerationimage generationregularizationpretrained encoders
0
0 comments X

The pith

Aligning the hidden states of diffusion transformers to high-quality representations from pretrained encoders makes training far easier and produces better images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that diffusion models for image generation struggle because their denoising networks must learn good representations on their own from noisy data. The authors propose a simple fix: add a regularization that forces the model's internal projections of noisy inputs to match the representations of clean images from an external pretrained visual encoder. When tested on transformer-based models like DiT and SiT, this alignment leads to much faster training and higher quality outputs. A reader should care because it shows a way to leverage existing strong representation learners to bypass part of the hard work in training generative models from scratch.

Core claim

The central discovery is that REPresentation Alignment (REPA) improves both the efficiency and quality of training diffusion and flow-based transformers by aligning the projections of noisy hidden states in the denoising network with clean image representations obtained from external pretrained visual encoders.

What carries the argument

REPA, a regularization term that aligns the model's noisy-state hidden representations to those of a fixed pretrained encoder on clean images.

If this is right

  • Training of SiT models reaches the performance of a 7M-step baseline in fewer than 400K steps, a speedup of over 17.5 times.
  • Final generation quality reaches state-of-the-art FID scores of 1.42 when using classifier-free guidance.
  • The same gains appear across multiple popular diffusion transformer architectures without needing heavy hyperparameter adjustments.
  • Models no longer have to learn discriminative representations entirely through the generative denoising process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generative models can benefit from borrowing mature representation learning techniques developed in discriminative settings.
  • Similar alignment strategies might accelerate training in other modalities or architectures that rely on internal feature learning.
  • Choosing different pretrained encoders could lead to further improvements or domain-specific adaptations.
  • Lower training costs open the door to scaling these models to even larger sizes on the same compute budget.

Load-bearing premise

External pretrained representations remain useful and non-interfering when aligned to the noisy states encountered during diffusion training.

What would settle it

An experiment where adding the REPA loss to a standard DiT or SiT training run results in slower convergence or worse final FID scores than the unregularized baseline.

read the original abstract

Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REPresentation Alignment (REPA), a regularization technique that aligns projections of noisy hidden states from denoising networks (DiT, SiT) with clean-image representations extracted from fixed pretrained visual encoders. The central empirical claim is that this simple auxiliary loss yields large gains in training efficiency (e.g., 17.5× speedup for SiT-XL to match a 7 M-step baseline in <400 K steps) and final generation quality (FID=1.42 with classifier-free guidance and guidance interval).

Significance. If the reported speed-ups and FID numbers prove robust, the work would be significant for the field: it offers a practical way to bootstrap internal representations in large diffusion/flow transformers using external self-supervised encoders, directly addressing the acknowledged bottleneck that denoising alone learns weaker features than modern SSL methods. Concrete, large-magnitude improvements on standard architectures would be of immediate practical value.

major comments (2)
  1. [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.
  2. [Experiments] Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.
minor comments (2)
  1. [Abstract] The term 'guidance interval' is used in the abstract and results but is not defined until later; a brief parenthetical definition on first use would improve readability.
  2. [Figures] Figure captions should explicitly state whether the plotted curves include classifier-free guidance and at what scale, to allow direct comparison with the no-CFG numbers cited in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our presentation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.

    Authors: We appreciate this observation. Although we performed limited tuning of λ during initial experiments, the manuscript indeed lacks a systematic sensitivity analysis. In the revised version we will include a dedicated ablation (new table and curves) that varies λ over {0.1, 0.3, 0.5, 0.7, 1.0} for DiT-B, DiT-XL, SiT-B and SiT-XL under both 400 K and 1 M step budgets. The results show that λ = 0.5 yields near-optimal performance across all settings, with graceful degradation outside [0.3, 0.7], thereby supporting the claim that REPA is straightforward to apply. revision: yes

  2. Referee: Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.

    Authors: We agree that statistical controls would increase confidence in the reported gains. Because of the substantial compute required for SiT-XL (approximately 1 000 A100-days per 7 M-step run), we conducted the largest-scale experiments with a single seed. However, we did run three independent seeds for all smaller models (DiT-S/B, SiT-S/B) and observed standard deviations below 0.3 FID and <5 % relative variation in the speedup factor. In the revision we will (i) report these error bars for the smaller models, (ii) add a second seed for SiT-XL at the 400 K-step mark, and (iii) include a short discussion of why the magnitude of the observed improvements (17.5×) makes hyper-parameter or seed artifacts unlikely. revision: partial

Circularity Check

0 steps flagged

No circularity: REPA is an empirical regularization loss with independent external benchmarks

full rationale

The paper proposes REPA as a straightforward added loss term that aligns projected noisy diffusion states to fixed outputs from separate pretrained encoders. No derivation, equation, or claim reduces by construction to its own inputs; results are evaluated on external metrics (FID, training steps to target performance) that are not defined inside the method. No load-bearing self-citations or uniqueness theorems appear in the provided text, and the compatibility of the alignment term with the diffusion objective is treated as an empirical question rather than a self-referential proof. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that external pretrained encoders supply beneficial clean representations and that the alignment loss can be added without disrupting the core denoising objective.

axioms (1)
  • domain assumption External pretrained visual encoders provide high-quality representations that are useful to align with during diffusion training.
    Invoked when stating that incorporating these representations makes training easier.

pith-pipeline@v0.9.0 · 5543 in / 1130 out tokens · 45538 ms · 2026-05-12T15:04:08.141186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

  2. One-Step Generative Modeling via Wasserstein Gradient Flows

    cs.LG 2026-05 conditional novelty 7.0

    W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...

  3. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  4. Autoregressive Visual Generation Needs a Prologue

    cs.CV 2026-05 unverdicted novelty 7.0

    Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.

  5. Posterior Augmented Flow Matching

    cs.CV 2026-05 unverdicted novelty 7.0

    PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.

  6. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  7. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  8. TORA: Topological Representation Alignment for 3D Shape Assembly

    cs.CV 2026-04 unverdicted novelty 7.0

    TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...

  9. From Observations to States: Latent Time Series Forecasting

    cs.LG 2026-01 conditional novelty 7.0

    LatentTSF improves time series forecasting accuracy and representation quality by shifting prediction from observation space to a learned latent state space via autoencoding.

  10. Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    cs.AI 2025-10 unverdicted novelty 7.0

    CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

  11. RiT: Vanilla Diffusion Transformers Suffice in Representation Space

    cs.CV 2026-05 conditional novelty 6.0

    A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

  12. Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

    cs.CV 2026-05 unverdicted novelty 6.0

    Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...

  13. Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

  14. Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

    cs.CV 2026-05 unverdicted novelty 6.0

    A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, re...

  15. UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

    cs.CV 2026-05 unverdicted novelty 6.0

    UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

  16. Semantic Generative Tuning for Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

  17. Lance: Unified Multimodal Modeling by Multi-Task Synergy

    cs.CV 2026-05 unverdicted novelty 6.0

    Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...

  18. Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

    cs.CV 2026-05 unverdicted novelty 6.0

    Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.

  19. Vision Foundation Models as Generalist Tokenizers for Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

  20. GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

  21. Improved Baselines with Representation Autoencoders

    cs.CV 2026-05 conditional novelty 6.0

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

  22. SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SRC-Flow compresses RAE features into a low-dimensional semantic space with a Semantic Representation Compressor, enabling normalizing flows to achieve SOTA gFID scores of 1.65 and 2.07 on ImageNet 256x256 and 512x512...

  23. Taming Audio VAEs via Target-KL Regularization

    cs.SD 2026-05 unverdicted novelty 6.0

    The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.

  24. Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.

  25. Registers Matter for Pixel-Space Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

  26. HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

  27. PoDAR: Power-Disentangled Audio Representation for Generative Modeling

    eess.AS 2026-05 unverdicted novelty 6.0

    PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...

  28. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  29. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  30. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  31. Toward Better Geometric Representations for Molecule Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...

  32. Conservative Flows: A New Paradigm of Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...

  33. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  34. Stage-adaptive audio diffusion modeling

    cs.SD 2026-05 unverdicted novelty 6.0

    A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.

  35. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  36. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

  37. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  38. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  39. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  40. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  41. Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training

    cs.LG 2026-04 conditional novelty 6.0

    Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...

  42. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  43. Mirai: Autoregressive Visual Generation Needs Foresight

    cs.CV 2026-01 conditional novelty 6.0

    Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.

  44. DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    cs.CV 2025-11 conditional novelty 6.0

    DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...

  45. MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning

    cs.GR 2025-11 unverdicted novelty 6.0

    MotionDuet generates realistic controllable 3D human motions via dual text-video conditioning with DUET unified encoding and DASH distribution-aware loss.

  46. Emu3.5: Native Multimodal Models are World Learners

    cs.CV 2025-10 unverdicted novelty 6.0

    Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation fo...

  47. VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    cs.CV 2025-10 unverdicted novelty 6.0

    VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.

  48. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    cs.CV 2025-07 unverdicted novelty 6.0

    Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

  49. Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

    cs.LG 2025-06 conditional novelty 6.0

    Derives closed-form optimal loss for unified diffusion models, provides variance-controlled estimators, and shows improved diagnosis, training schedules, and power-law scaling after subtracting the optimal value.

  50. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  51. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    cs.CV 2024-12 unverdicted novelty 6.0

    VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

  52. Feed-Forward Gaussian Splatting from Sparse Aerial Views

    cs.CV 2026-05 unverdicted novelty 5.0

    AnyCity reconstructs coherent 3D Gaussian urban scenes from sparse aerial views in one feed-forward pass by anchoring observation-supported geometry and applying gated residual updates conditioned on an aerial-adapted...

  53. Lance: Unified Multimodal Modeling by Multi-Task Synergy

    cs.CV 2026-05 unverdicted novelty 5.0

    Lance introduces a dual-stream MoE model with modality-aware rotary positional encoding and staged multi-task training that outperforms open-source unified models on image and video generation while retaining understa...

  54. Drift Flow Matching

    cs.LG 2026-05 unverdicted novelty 5.0

    Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.

  55. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  56. Video Generation with Predictive Latents

    cs.CV 2026-05 unverdicted novelty 5.0

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  57. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  58. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  59. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  60. Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

    cs.LG 2025-10 unverdicted novelty 5.0

    GRWM uses temporal contrastive learning to geometrically regularize latent spaces in world models for high-fidelity cloning of deterministic 3D worlds.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages · cited by 60 Pith papers · 7 internal anchors

  1. [2]

    Building Normalizing Flows with Stochastic Interpolants , author=

  2. [3]

    2009 , journal=

    Learning Multiple Layers of Features from Tiny Images , author=. 2009 , journal=

  3. [4]

    Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=ECCV, year=

  4. [5]

    Adam: A Method for Stochastic Optimization , author=

  5. [6]

    Scalable Diffusion Models with Transformers , author=

  6. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=

  7. [8]

    Loshchilov, I , title =

  8. [9]

    Transactions on Machine Learning Research , issn=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , note=

  9. [10]

    Photorealistic Video Generation with Diffusion Models , author=

  10. [11]

    Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and others , booktitle=ICLR, year=

  11. [12]

    2024 , journal=

    Video Generation Models as World Simulators , author=. 2024 , journal=

  12. [13]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=

  13. [14]

    Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition , author=

  14. [15]

    Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle = NeurIPS, title =

  15. [16]

    Score-Based Generative Modeling through Stochastic Differential Equations , author=

  16. [17]

    Diffusion models beat

    Dhariwal, Prafulla and Nichol, Alexander , booktitle=NeurIPS, year=. Diffusion models beat

  17. [18]

    The Eleventh International Conference on Learning Representations , year=

    What Do Self-Supervised Vision Transformers Learn? , author=. The Eleventh International Conference on Learning Representations , year=

  18. [19]

    International Conference on Learning Representations , year=

    How Do Vision Transformers Work? , author=. International Conference on Learning Representations , year=

  19. [20]

    Vision Transformers Need Registers , author=

  20. [21]

    Intriguing Properties of Vision Transformers , author=

  21. [22]

    Tero Karras and Miika Aittala and Timo Aila and Samuli Laine , title =

  22. [24]

    Photorealistic Text-to-Image Diffusion models with Deep Language Understanding , author=

  23. [25]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Stable Video diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. arXiv preprint arXiv:2311.15127 , year=

  24. [26]

    Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author=

  25. [27]

    Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , author=

  26. [28]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, year=

  27. [29]

    High-Resolution Image Synthesis with Latent Diffusion Models , author=

  28. [30]

    2015 , organization=

    Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=. 2015 , organization=

  29. [31]

    An Empirical Study of Training Self-Supervised Vision Transformers , author=

  30. [32]

    Learning Transferable Visual Models from Natural Language Supervision , author=

  31. [33]

    Masked Autoencoders are Scalable Vision Learners , author=

  32. [34]

    Generative Adversarial Nets , author=

  33. [35]

    Transactions on Machine Learning Research , issn=

    Fast Training of Diffusion Models with Masked Transformers , author=. Transactions on Machine Learning Research , issn=

  34. [36]

    Understanding Diffusion Objectives as the

    Kingma, Diederik and Gao, Ruiqi , journal=NeurIPS, year=. Understanding Diffusion Objectives as the

  35. [37]

    Simple Diffusion: End-to-End Diffusion for High Resolution Images , author=

  36. [38]

    Journal of Machine Learning Research , volume=

    Cascaded Diffusion Models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

  37. [39]

    All are Worth Words: A

    Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun , booktitle = CVPR, year=. All are Worth Words: A

  38. [40]

    Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash , booktitle=ECCV, year=

  39. [41]

    Gao, Shanghua and Zhou, Pan and Cheng, Ming-Ming and Yan, Shuicheng , journal=

  40. [42]

    Zhu, Rui and Pan, Yingwei and Li, Yehao and Yao, Ting and Sun, Zhenglong and Mei, Tao and Chen, Chang Wen , booktitle=CVPR, year=

  41. [43]

    Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=NeurIPS, year=

  42. [44]

    Generating Images with Sparse Representations , author=

  43. [45]

    Improved Techniques for Training

    Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle=NeurIPS, year=. Improved Techniques for Training

  44. [46]

    Improved Precision and Recall Metric for Assessing Generative Models , author=

  45. [47]

    Emerging Properties in Self-Supervised Vision Transformers , author=

  46. [49]

    Projected

    Sauer, Axel and Chitta, Kashyap and M. Projected

  47. [50]

    Ensembling Off-the-Shelf Models for

    Kumari, Nupur and Zhang, Richard and Shechtman, Eli and Zhu, Jun-Yan , booktitle=CVPR, year=. Ensembling Off-the-Shelf Models for

  48. [51]

    Sauer, Axel and Schwarz, Katja and Geiger, Andreas , booktitle=

  49. [52]

    Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo , booktitle=ICML, year=

  50. [53]

    Scaling Up

    Kang, Minguk and Zhu, Jun-Yan and Zhang, Richard and Park, Jaesik and Shechtman, Eli and Paris, Sylvain and Park, Taesung , booktitle=CVPR, year=. Scaling Up

  51. [56]

    Distilling Diffusion Models into Conditional

    Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung , booktitle=ECCV, year=. Distilling Diffusion Models into Conditional

  52. [57]

    W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=

  53. [58]

    Return of Unconditional Generation: A Self-supervised Representation Generation Method , author=

  54. [59]

    Lu, Haoyu and Yang, Guoxing and Fei, Nanyi and Huo, Yuqi and Lu, Zhiwu and Luo, Ping and Ding, Mingyu , booktitle=ICLR, year=

  55. [61]

    Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu

  56. [62]

    Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li , year=

  57. [64]

    Denoising Diffusion Autoencoders are Unified Self-Supervised Learners , author=

  58. [65]

    Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck , author=

  59. [66]

    Improved Denoising Diffusion Probabilistic Models , author=

  60. [67]

    Deep Unsupervised Learning Using Nonequilibrium Thermodynamics , author=

  61. [68]

    Denoising Diffusion Implicit Models , author=

  62. [69]

    A Simple Framework for Contrastive Learning of Visual Representations , author=

  63. [70]

    The Platonic Representation Hypothesis , author=

  64. [71]

    Your Diffusion Model is Secretly a Zero-Shot Classifier , author=

  65. [72]

    Attention is All you Need , year =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , year =

  66. [73]

    Neural computation , volume=

    A Connection between Score Matching and Denoising Autoencoders , author=. Neural computation , volume=. 2011 , publisher=

  67. [74]

    2, 2022-06-27 , author=

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

  68. [75]

    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author=

  69. [76]

    Similarity of Neural Network Representations Revisited , author=

  70. [77]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=

  71. [78]

    Pre-training via Denoising for Molecular Property Prediction , author=

  72. [79]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

  73. [81]

    Diffusion Models Beat

    Mukhopadhyay, Soumik and Gwilliam, Matthew and Agarwal, Vatsal and Padmanabhan, Namitha and Swaminathan, Archana and Hegde, Srinidhi and Zhou, Tianyi and Shrivastava, Abhinav , booktitle=NeurIPS, year=. Diffusion Models Beat

  74. [82]

    Progressive Growing of

    Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko , booktitle=ICLR, year=. Progressive Growing of

  75. [83]

    Rethinking the

    Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew , booktitle=CVPR, year=. Rethinking the

  76. [84]

    Analyzing and Improving the Training Dynamics of Diffusion Models , author=

  77. [85]

    Momentum Contrast for Unsupervised Visual Representation Learning , author=

  78. [87]

    Neural networks , volume=

    Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author=. Neural networks , volume=. 2018 , publisher=

  79. [88]

    Yang, Xiulong and Shih, Sheng-Min and Fu, Yinlin and Zhao, Xiaoting and Ji, Shihao , journal=. Your

  80. [89]

    Diffusion Model as Representation Learner , author=

Showing first 80 references.