hub Mixed citations

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin · 2024 · cs.CV · arXiv 2410.06940

Mixed citation behavior. Most common role is background (43%).

60 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 60 citing papers arXiv PDF

abstract

Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 6 method 4 extension 1

citation-polarity summary

background 9 baseline 6 use method 4 extend 1 unclear 1

representative citing papers

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

Autoregressive Visual Generation Needs a Prologue

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.

Posterior Augmented Flow Matching

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.

TORA: Topological Representation Alignment for 3D Shape Assembly

cs.CV · 2026-04-05 · unverdicted · novelty 7.0

TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmarks with zero inference overhead.

From Observations to States: Latent Time Series Forecasting

cs.LG · 2026-01-30 · conditional · novelty 7.0

LatentTSF improves time series forecasting accuracy and representation quality by shifting prediction from observation space to a learned latent state space via autoencoding.

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

cs.AI · 2025-10-03 · unverdicted · novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, realism, and aesthetics.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Taming Audio VAEs via Target-KL Regularization

cs.SD · 2026-05-16 · unverdicted · novelty 6.0

The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.

Registers Matter for Pixel-Space Diffusion Transformers

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

eess.AS · 2026-05-11 · unverdicted · novelty 6.0

PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.

citing papers explorer

Showing 50 of 60 citing papers.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 1 · 2 links · internal anchor
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 70 · internal anchor
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Autoregressive Visual Generation Needs a Prologue cs.CV · 2026-05-07 · unverdicted · none · ref 60 · internal anchor
Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
Posterior Augmented Flow Matching cs.CV · 2026-05-01 · unverdicted · none · ref 35 · internal anchor
PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 89 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image cs.CV · 2026-04-06 · unverdicted · none · ref 64 · internal anchor
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.
TORA: Topological Representation Alignment for 3D Shape Assembly cs.CV · 2026-04-05 · unverdicted · none · ref 60 · internal anchor
TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmarks with zero inference overhead.
From Observations to States: Latent Time Series Forecasting cs.LG · 2026-01-30 · conditional · none · ref 17 · internal anchor
LatentTSF improves time series forecasting accuracy and representation quality by shifting prediction from observation space to a learned latent state space via autoencoding.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner cs.AI · 2025-10-03 · unverdicted · none · ref 43 · internal anchor
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 41 · internal anchor
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis cs.CV · 2026-05-20 · unverdicted · none · ref 14 · internal anchor
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics cs.CV · 2026-05-20 · unverdicted · none · ref 32 · internal anchor
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, realism, and aesthetics.
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CV · 2026-05-19 · unverdicted · none · ref 39 · internal anchor
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 82 · internal anchor
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 143 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling cs.CV · 2026-05-18 · unverdicted · none · ref 41 · internal anchor
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 93 · internal anchor
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 87 · internal anchor
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 64 · internal anchor
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
Taming Audio VAEs via Target-KL Regularization cs.SD · 2026-05-16 · unverdicted · none · ref 49 · internal anchor
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
Registers Matter for Pixel-Space Diffusion Transformers cs.CV · 2026-05-15 · unverdicted · none · ref 44 · internal anchor
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unverdicted · none · ref 28 · internal anchor
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
PoDAR: Power-Disentangled Audio Representation for Generative Modeling eess.AS · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.
The two clocks and the innovation window: When and how generative models learn rules cs.LG · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 100 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
Toward Better Geometric Representations for Molecule Generative Models cs.LG · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% stability on GEOM-DRUG.
Conservative Flows: A New Paradigm of Generative Models cs.LG · 2026-05-07 · unverdicted · none · ref 29 · internal anchor
Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow model and showing gains on Swiss-roll, ImageNet-256 and Oxford Flowers-102.
Taming Outlier Tokens in Diffusion Transformers cs.CV · 2026-05-06 · unverdicted · none · ref 37 · internal anchor
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Stage-adaptive audio diffusion modeling cs.SD · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 59 · internal anchor
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 51 · 2 links · internal anchor
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
Normalizing Flows with Iterative Denoising cs.CV · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation cs.CV · 2026-04-20 · unverdicted · none · ref 26 · internal anchor
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 62 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 79 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training cs.LG · 2026-04-08 · conditional · none · ref 37 · internal anchor
Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform sampling.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision cs.CV · 2026-02-02 · accept · none · ref 24 · internal anchor
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
Mirai: Autoregressive Visual Generation Needs Foresight cs.CV · 2026-01-21 · conditional · none · ref 49 · internal anchor
Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 68 · internal anchor
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning cs.GR · 2025-11-22 · unverdicted · none · ref 8 · internal anchor
MotionDuet generates realistic controllable 3D human motions via dual text-video conditioning with DUET unified encoding and DASH distribution-aware loss.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 120 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models cs.CV · 2025-10-21 · unverdicted · none · ref 24 · internal anchor
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 88 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value cs.LG · 2025-06-16 · conditional · none · ref 53 · internal anchor
Derives closed-form optimal loss for unified diffusion models, provides variance-controlled estimators, and shows improved diagnosis, training schedules, and power-law scaling after subtracting the optimal value.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 93 · internal anchor
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 273 · internal anchor
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Feed-Forward Gaussian Splatting from Sparse Aerial Views cs.CV · 2026-05-19 · unverdicted · none · ref 40 · internal anchor
AnyCity reconstructs coherent 3D Gaussian urban scenes from sparse aerial views in one feed-forward pass by anchoring observation-supported geometry and applying gated residual updates conditioned on an aerial-adapted video diffusion prior.
Drift Flow Matching cs.LG · 2026-05-17 · unverdicted · none · ref 52 · internal anchor
Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer