super hub Canonical reference

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Jiayan Teng, Jiazheng Xu, Ming Ding, Shiyu Huang, Wendi Zheng, Zhuoyi Yang · 2024 · cs.CV · arXiv 2408.06072

Canonical reference. 76% of citing Pith papers cite this work as background.

239 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 239 citing papers more from Jiayan Teng arXiv PDF

abstract

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 61 method 9 baseline 7 dataset 1

citation-polarity summary

background 59 use method 9 baseline 7 unclear 2 use dataset 1

claims ledger

abstract We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and v

authors

Jiayan Teng Jiazheng Xu Ming Ding Shiyu Huang Wendi Zheng Zhuoyi Yang

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

cs.CV · 2026-05-14 · conditional · novelty 7.0

EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.

citing papers explorer

Showing 50 of 54 citing papers after filters.

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents cs.CV · 2025-12-19 · unverdicted · none · ref 57 · internal anchor
LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 44 · internal anchor
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
VABench: A Comprehensive Benchmark for Audio-Video Generation cs.CV · 2025-12-10 · unverdicted · none · ref 56 · internal anchor
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation cs.CV · 2025-12-09 · conditional · none · ref 41 · internal anchor
GimbalDiffusion adds gravity-referenced absolute camera control and null-pitch conditioning to text-to-video diffusion models, trained on full-sphere panoramic data, to support extreme trajectories and reduce prompt entanglement.
VideoCoF: Unified Video Editing with Temporal Reasoner cs.CV · 2025-12-08 · unverdicted · none · ref 42 · internal anchor
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer cs.CV · 2025-11-28 · unverdicted · none · ref 56 · internal anchor
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
ASTRA: Let Arbitrary Subjects Transform in Video Editing cs.CV · 2025-10-01 · unverdicted · none · ref 26 · internal anchor
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing cs.CV · 2025-09-27 · unverdicted · none · ref 10 · internal anchor
Vid-Freeze immunizes images by adding perturbations that target attention dynamics in I2V models to enforce temporal freezing and suppress motion synthesis.
CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion cs.CV · 2025-09-24 · unverdicted · none · ref 7 · internal anchor
CamPVG is the first diffusion-based framework for generating geometrically consistent panoramic videos from camera pose inputs using a panoramic Plücker embedding and spherical epipolar attention module.
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation cs.CV · 2025-05-24 · unverdicted · none · ref 3 · internal anchor
SVG2 accelerates DiT video generation via semantic-aware token permutation with k-means, achieving up to 2.3x speedup and PSNR of 30 while fixing position-based clustering and scattered-token waste.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models cs.RO · 2025-05-19 · unverdicted · none · ref 8 · internal anchor
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
Beyond the Frame: Generating 360 Panoramic Videos from Perspective Videos cs.CV · 2025-04-10 · unverdicted · none · ref 58 · internal anchor
A generative model produces realistic and coherent 360 panoramic videos from in-the-wild perspective videos via curated online data and geometry-motion aware operations.
Stitch-a-Demo: Video Demonstrations from Multistep Descriptions cs.CV · 2025-03-18 · unverdicted · none · ref 73 · internal anchor
Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.
History-Guided Video Diffusion cs.LG · 2025-02-10 · unverdicted · none · ref 64 · internal anchor
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 78 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure cs.CV · 2025-12-25 · unverdicted · none · ref 45 · internal anchor
GeCo is a new geometry-based metric that produces dense maps of motion and structure inconsistencies in video generation by fusing residual motion and depth priors.
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection cs.CV · 2025-12-23 · unverdicted · none · ref 42 · internal anchor
KeyTailor improves video virtual try-on realism by using instruction-guided keyframes to enhance garment details and background integrity in DiT models without major architectural changes.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CV · 2025-12-17 · unverdicted · none · ref 81 · internal anchor
Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification cs.CV · 2025-12-10 · unverdicted · none · ref 91 · internal anchor
VHOI densifies sparse trajectories into color-encoded HOI mask sequences and conditions a fine-tuned video diffusion model on them to produce controllable human-object interaction videos, including full navigation sequences.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation cs.CV · 2025-12-04 · conditional · none · ref 83 · internal anchor
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models cs.CV · 2025-12-01 · conditional · none · ref 55 · internal anchor
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
IGen: Scalable Data Generation for Robot Learning from Open-World Images cs.RO · 2025-12-01 · unverdicted · none · ref 64 · internal anchor
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
SURF: Signature-Retained Fast Video Generation cs.GR · 2025-11-25 · unverdicted · none · ref 43 · internal anchor
SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on cs.CV · 2025-11-24 · unverdicted · none · ref 58 · internal anchor
A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 95 · internal anchor
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 26 · internal anchor
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 67 · internal anchor
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training cs.RO · 2025-09-29 · unverdicted · none · ref 19 · internal anchor
World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 28 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 84 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching cs.CV · 2025-06-30 · unverdicted · none · ref 25 · internal anchor
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
Test-Time Training Done Right cs.LG · 2025-05-29 · conditional · none · ref 53 · internal anchor
Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 52 · internal anchor
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback cs.CV · 2025-04-24 · unverdicted · none · ref 90 · internal anchor
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute cs.CV · 2025-04-23 · unverdicted · none · ref 54 · internal anchor
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
SkyReels-V2: Infinite-length Film Generative Model cs.CV · 2025-04-17 · unverdicted · none · ref 17 · internal anchor
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 59 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Long-Context Autoregressive Video Modeling with Next-Frame Prediction cs.CV · 2025-03-25 · unverdicted · none · ref 11 · internal anchor
FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 97 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 79 · internal anchor
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models cs.CV · 2025-11-23 · unverdicted · none · ref 56 · internal anchor
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution cs.CV · 2025-09-28 · unverdicted · none · ref 18 · internal anchor
OASIS reduces redundancy in diffusion models for real-world video super-resolution via attention specialization routing and progressive training, delivering state-of-the-art quality with 6.2x faster inference than prior one-step baselines.
Matrix-game 2.0: An open-source real-time and streaming interactive world model cs.CV · 2025-08-18 · unverdicted · none · ref 50 · internal anchor
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.
Geometry-aware 4D Video Generation for Robot Manipulation cs.CV · 2025-07-01 · unverdicted · none · ref 14 · internal anchor
A geometry-aware 4D video generation model trained with cross-view pointmap alignment to produce spatio-temporally consistent future videos from novel viewpoints for robot manipulation.
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation cs.CV · 2025-06-30 · unverdicted · none · ref 96 · internal anchor
SynMotion combines disentangled semantic embeddings, parameter-efficient motion adapters, and alternate subject-motion training on a new SPV dataset to improve motion customization in text-to-video and image-to-video generation.
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment cs.RO · 2025-04-22 · unverdicted · none · ref 71 · internal anchor
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 91 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 136 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CV · 2025-06-10 · unverdicted · none · ref 30 · internal anchor
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 64 · internal anchor
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer