arxiv: 2501.00103 · v1 · submitted 2024-12-30 · 💻 cs.CV

Recognition: no theorem link

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen , Nisan Chiprut , Benny Brazowski , Daniel Shalem , Dudu Moshe , Eitan Richardson , Eran Levin , Guy Shiran

show 8 more authors

Nir Zabari Ori Gordon Poriya Panet Sapir Weissbuch Victor Kulikov Yaki Bitterman Zeev Melumian Ofir Bibi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 10:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent diffusionvideo generationtransformervideo VAEreal-time generationspatiotemporal attentiontext-to-videoimage-to-video

0 comments

The pith

LTX-Video generates 5 seconds of 768x512 video at 24 fps in 2 seconds by merging the VAE decoder with the final denoising step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video latent diffusion models can reach faster-than-real-time speeds by treating the Video-VAE and denoising transformer as one optimized system instead of independent modules. A high 1:192 compression ratio moves patchifying into the VAE input so the transformer can run full spatiotemporal self-attention across the entire video in a tiny latent space. The VAE decoder then handles both latent-to-pixel conversion and the last denoising step in pixel space, which keeps fine details without a separate upsampler. A sympathetic reader would care because this removes the main speed bottleneck in current video generators and makes high-quality output practical on single high-end GPUs.

Core claim

LTX-Video is a transformer-based latent diffusion model whose Video-VAE achieves 1:192 spatiotemporal compression with 32x32x8 pixels per token and whose decoder performs both upsampling and the final denoising step directly in pixel space, allowing efficient full self-attention while supporting simultaneous text-to-video and image-to-video training and delivering faster-than-real-time generation.

What carries the argument

The Video-VAE decoder that executes both latent-to-pixel conversion and the final denoising step after the transformer operates in the 1:192 compressed latent space.

If this is right

The model produces 5 seconds of 24 fps video at 768x512 resolution in 2 seconds on an Nvidia H100 GPU.
Full spatiotemporal self-attention becomes computationally feasible because the latent space is small enough for the transformer to attend over the whole video at once.
Text-to-video and image-to-video generation are trained jointly in the same model without separate fine-tuning paths.
No separate upsampling network is needed at inference time because the decoder already completes denoising in pixel space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoder-denoising integration might allow longer video clips to be generated without linear growth in compute time if the compression ratio stays fixed.
Real-time interactive video tools could become possible on high-end consumer hardware once the approach is adapted to lower-precision or smaller GPUs.
The high compression plus decoder finishing step may transfer to other modalities such as audio or 3D generation where separate upsampling modules currently add latency.

Load-bearing premise

The VAE decoder can perform the final denoising step at the same time as latent-to-pixel conversion without losing fine details or requiring an extra upsampling module.

What would settle it

Run the model on an H100 GPU to generate 5 seconds of 24 fps video at 768x512 resolution and measure whether the wall-clock time exceeds 2 seconds or visible detail quality falls below that of comparable models that keep denoising and upsampling separate.

read the original abstract

We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LTX-Video gets real-time video by pushing patchifying into the VAE and letting the decoder finish denoising in pixel space, but the abstract leaves the quality impact of that choice untested.

read the letter

The main point is that this paper shows a tighter coupling between the video VAE and the diffusion transformer than usual. They move the patch embedding step to the VAE input, reach 1:192 spatiotemporal compression down to 32x32x8 tokens, and assign the decoder the job of both converting latents to pixels and performing the final denoising pass. That setup lets the transformer run full attention cheaply and claims 2 seconds to generate 5 seconds of 768x512 24 fps video on an H100, which is faster than real time and beats prior models of similar size. The code and weights are public, which is the strongest practical contribution here.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LTX-Video, a transformer-based latent diffusion model for video generation that integrates the Video-VAE and denoising transformer. It employs a high-compression VAE (1:192 ratio via 32×32×8 spatiotemporal tokens per latent) by relocating patchification to the VAE input, enabling efficient full spatiotemporal self-attention in the transformer. The VAE decoder is assigned the dual role of latent-to-pixel conversion and final denoising directly in pixel space to recover fine details without extra upsampling modules. The model supports simultaneous text-to-video and image-to-video training and reports faster-than-real-time performance: 5 seconds of 24 fps video at 768×512 resolution generated in 2 seconds on an Nvidia H100 GPU, outperforming prior models of similar scale. Source code and pretrained models are released publicly.

Significance. If the central claims hold, the work offers a meaningful advance in realtime video generation by demonstrating that aggressive latent compression combined with decoder-level final denoising can deliver substantial speedups without separate upsampling stages. The public release of code and models is a clear strength that supports reproducibility and further research. The approach could influence efficient video diffusion architectures if the quality preservation at 1:192 compression is validated.

major comments (2)

[§3.2] §3.2 (VAE Decoder Design): The claim that the decoder jointly performs latent-to-pixel conversion and the final denoising step is load-bearing for both the efficiency and quality assertions, yet the manuscript supplies no explicit loss formulation, training objective, or ablation study demonstrating that a single decoder pass suffices to replace additional latent denoising iterations while preserving spatiotemporal details at 768×512/24 fps. Without these details, it is unclear whether the reported 2-second inference time is achieved without hidden quality degradation or extra post-processing.
[§5] §5 (Experiments and Results): The quantitative performance claims (2-second generation time, outperforming all similar-scale models) are presented without error bars, multiple-run statistics, or full comparison tables including standard metrics such as FVD or temporal consistency scores against baselines run on identical hardware. This weakens verification of the central faster-than-real-time claim.

minor comments (2)

[Abstract and §3.1] The abstract and §3.1 use the compression ratio 1:192 without an accompanying equation or explicit token-dimension breakdown (e.g., relating 32×32×8 to the input resolution), which would aid clarity.
[Figures in §5] Figure captions and axis labels in the results section could more explicitly state the exact resolution, frame count, and hardware used for each timing measurement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the manuscript's clarity and rigor without altering the core claims.

read point-by-point responses

Referee: [§3.2] §3.2 (VAE Decoder Design): The claim that the decoder jointly performs latent-to-pixel conversion and the final denoising step is load-bearing for both the efficiency and quality assertions, yet the manuscript supplies no explicit loss formulation, training objective, or ablation study demonstrating that a single decoder pass suffices to replace additional latent denoising iterations while preserving spatiotemporal details at 768×512/24 fps. Without these details, it is unclear whether the reported 2-second inference time is achieved without hidden quality degradation or extra post-processing.

Authors: We acknowledge that Section 3.2 describes the decoder's dual role at a high level but does not provide an explicit loss formulation or ablation. In the revised manuscript we will add the precise training objective (including how the VAE reconstruction loss interacts with the latent diffusion process) and an ablation comparing single-pass decoder output against additional latent-space denoising iterations. The ablation will report spatiotemporal fidelity metrics at 768×512 resolution to confirm that the single decoder pass recovers fine details without quality degradation or extra post-processing steps. This will directly support the efficiency and quality claims. revision: yes
Referee: [§5] §5 (Experiments and Results): The quantitative performance claims (2-second generation time, outperforming all similar-scale models) are presented without error bars, multiple-run statistics, or full comparison tables including standard metrics such as FVD or temporal consistency scores against baselines run on identical hardware. This weakens verification of the central faster-than-real-time claim.

Authors: We agree that the current presentation lacks statistical detail. In the revision we will expand the results section to include error bars and standard deviations from multiple runs (different random seeds) for the reported inference times. The comparison tables will be augmented with FVD and temporal consistency metrics (e.g., optical-flow-based consistency and video-level CLIP similarity) for both our model and the baselines. All timing and quality measurements will be re-run on the same Nvidia H100 hardware to ensure direct comparability. The 2-second figure is the measured wall-clock time for the full end-to-end pipeline; we will document the exact measurement protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an architectural design choice for a Video-VAE with 1:192 compression and assigns the decoder a dual role in latent-to-pixel conversion plus final denoising. Performance claims (e.g., 5s video in 2s on H100) are presented as measured empirical outcomes on hardware, not as outputs of any equations or predictions that reduce to fitted parameters or self-referential definitions within the paper. No load-bearing derivations, self-citations, or ansatzes are shown that would make central results tautological by construction. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard latent diffusion assumptions plus two architecture-specific choices: that full spatiotemporal attention remains tractable at 1:192 compression and that the VAE decoder can absorb the final denoising without introducing artifacts.

axioms (2)

domain assumption Full spatiotemporal self-attention in the transformer is essential for temporal consistency at high resolution.
Invoked in the abstract to justify operating in the compressed latent space.
ad hoc to paper The VAE decoder can jointly perform upsampling and final denoising without quality degradation.
Central to the claim that no separate upsampling module is needed.

pith-pipeline@v0.9.0 · 5633 in / 1365 out tokens · 49808 ms · 2026-05-11T10:30:01.060354+00:00 · methodology

discussion (0)

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
cs.CV 2026-04 unverdicted novelty 8.0

ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
cs.CV 2026-05 unverdicted novelty 7.0

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
cs.RO 2026-05 unverdicted novelty 7.0

Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
cs.CV 2026-04 unverdicted novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
cs.CV 2026-04 unverdicted novelty 7.0

GenLCA enables scalable training of a 3D diffusion model for photorealistic, animatable full-body avatars by tokenizing large-scale real-world videos with a pretrained reconstructor and applying visibility-aware diffu...
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
cs.GR 2026-04 unverdicted novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
Qwen-Image-VAE-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 6.0

Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
cs.CV 2026-05 conditional novelty 6.0

WorldJen is a multi-dimensional video generation benchmark using VLM-graded Likert questionnaires on joint prompts, validated to match human three-tier rankings.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models
cs.CV 2026-04 unverdicted novelty 6.0

ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism...
HDR Video Generation via Latent Alignment with Logarithmic Encoding
cs.CV 2026-04 unverdicted novelty 6.0

HDR video generation is achieved by logarithmically encoding HDR imagery to align with pretrained generative model latents, enabling minimal fine-tuning and degradation-based inference of missing content.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Latent-Compressed Variational Autoencoder for Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity...
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
cs.CV 2026-04 unverdicted novelty 6.0

Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer ...
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
cs.LG 2026-03 unverdicted novelty 6.0

WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.
LongLive: Real-time Interactive Long Video Generation
cs.CV 2025-09 conditional novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models
cs.RO 2026-05 unverdicted novelty 5.0

No tested generative video model achieves low rollout error, non-divergent temporal error, and real-time inference simultaneously for zero-shot predictive display in CARLA-simulated teleoperation.
Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication
eess.SP 2026-04 unverdicted novelty 5.0

A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with ...
Controllable Video Object Insertion via Multiview Priors
cs.CV 2026-04 unverdicted novelty 5.0

A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
LTX-2: Efficient Joint Audio-Visual Foundation Model
cs.CV 2026-01 conditional novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 50 Pith papers · 10 internal anchors

[1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024. URL https://openai. com/research/video-generation-models-as-world-simulators, 3, 2024

work page 2024
[2]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[3]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Open-sora-plan.GitHub, 2024

PKU-Yuan Lab and Tuzhan AI. Open-sora-plan.GitHub, 2024. https://doi.org/10.5281/ zenodo.10948109

work page 2024
[5]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[6]

Deep compression autoencoder for efficient high-resolution diffusion models, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models, 2024

work page 2024
[7]

Pixel-space post-training of latent diffusion models

Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, and Jialiang Wang. Pixel-space post-training of latent diffusion models. arXiv preprint arXiv:2409.17565, 2024

work page arXiv 2024
[8]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023
[9]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[10]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[11]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213– 13232. PMLR, 2023

work page 2023
[12]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[13]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Principal component analysis for special types of data

Ian T Jolliffe. Principal component analysis for special types of data. Springer, 2002

work page 2002
[15]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[16]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[17]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[19]

Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model. arXiv preprint arXiv:2402.12376, 2024

work page arXiv 2024
[20]

Large diffusion transformer

Alpha VLLM. Large diffusion transformer. GitHub, 2024

work page 2024
[21]

Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[22]

Round and round we go! what makes rotary positional encodings useful?, 2025

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veliˇckovi´c. Round and round we go! what makes rotary positional encodings useful? arXiv preprint arXiv:2410.06205, 2024

work page arXiv 2024
[23]

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16144–16155, 2022

work page 2022
[24]

Layer normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv e-prints, pages arXiv–1607, 2016

work page 2016
[25]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[27]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[29]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[30]

Black Forest Labs. Flux.1. GitHub, 2024. https://github.com/black-forest-labs/ flux

work page 2024
[31]

Auraflow v0.1, an open exploration of large rectified flow models

fofr. Auraflow v0.1, an open exploration of large rectified flow models. GitHub, 2024. https: //github.com/fofr/cog-aura-flow

work page 2024
[32]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18444–18455, 2023

work page 2023
[34]

arXiv preprint arXiv:2311.04145 (2023)

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. 20

work page arXiv 2023
[35]

Open-sora

HPC-AI Tech. Open-sora. GitHub, 2024. https://github.com/hpcaitech/Open-Sora

work page 2024
[36]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Signature verification using a ’siamese’ time delay neural network

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a ’siamese’ time delay neural network. InAdvances in Neural Information Processing Systems (NeurIPS), volume 6, pages 737–744, 1993. 21

work page 1993