super hub Canonical reference

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Jiayan Teng, Jiazheng Xu, Ming Ding, Shiyu Huang, Wendi Zheng, Zhuoyi Yang · 2024 · cs.CV · arXiv 2408.06072

Canonical reference. 76% of citing Pith papers cite this work as background.

299 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 299 citing papers more from Jiayan Teng arXiv PDF

abstract

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 61 method 9 baseline 7 dataset 1

citation-polarity summary

background 59 use method 9 baseline 7 unclear 2 use dataset 1

claims ledger

abstract We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and v

authors

Jiayan Teng Jiazheng Xu Ming Ding Shiyu Huang Wendi Zheng Zhuoyi Yang

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

OmniTryOn: Video Try-On Anything at Once!

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Ultra-Fast Neural Video Compression

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

DCVC-UF uses chunk-based joint encoding and parallel frame-specific decoding to deliver ultra-fast neural video compression while claiming new state-of-the-art rate-distortion performance.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

citing papers explorer

Showing 50 of 299 citing papers.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation cs.CV · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 81 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption cs.CV · 2026-07-01 · unverdicted · none · ref 45 · internal anchor
ISPA reduces KV cache size by up to 50% in AR video models by transitioning layers to local attention and applying instance-specific least-squares weight modulation to compensate for lost history.
MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 62 · internal anchor
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data cs.CV · 2026-06-29 · unverdicted · none · ref 47 · internal anchor
Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.
OmniTryOn: Video Try-On Anything at Once! cs.CV · 2026-06-07 · unverdicted · none · ref 57 · internal anchor
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control cs.LG · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Ultra-Fast Neural Video Compression cs.CV · 2026-06-03 · unverdicted · none · ref 69 · internal anchor
DCVC-UF uses chunk-based joint encoding and parallel frame-specific decoding to deliver ultra-fast neural video compression while claiming new state-of-the-art rate-distortion performance.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 7 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
From Zero to Hero: Training-Free Custom Concept Spawning in World Models cs.CV · 2026-06-01 · unverdicted · none · ref 36 · internal anchor
SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization cs.CV · 2026-06-01 · unverdicted · none · ref 25 · 2 links · internal anchor
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 52 · internal anchor
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 60 · internal anchor
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CV · 2026-05-30 · unverdicted · none · ref 88 · internal anchor
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction cs.CV · 2026-05-29 · unverdicted · none · ref 105 · internal anchor
C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution cs.CV · 2026-05-28 · unverdicted · none · ref 40 · internal anchor
Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective cs.CV · 2026-05-28 · unverdicted · none · ref 124 · internal anchor
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
What-If World: A Causal Benchmark for General World Models in Embodied Scenarios cs.CV · 2026-05-26 · unverdicted · none · ref 73 · internal anchor
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation cs.CV · 2026-05-25 · unverdicted · none · ref 29 · internal anchor
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
DeltaCam: Differential Intrinsic Camera Modeling for Video Generation cs.CV · 2026-05-24 · unverdicted · none · ref 40 · internal anchor
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
World Models as Group Actions cs.CV · 2026-05-23 · unverdicted · none · ref 11 · internal anchor
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 22 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 51 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration cs.CV · 2026-05-21 · unverdicted · none · ref 55 · internal anchor
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration cs.CV · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance cs.CV · 2026-05-20 · unverdicted · none · ref 94 · internal anchor
iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning cs.CV · 2026-05-20 · unverdicted · none · ref 35 · internal anchor
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls cs.CV · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CV · 2026-05-16 · unverdicted · none · ref 70 · internal anchor
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CV · 2026-05-15 · unverdicted · none · ref 27 · internal anchor
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation cs.RO · 2026-05-15 · unverdicted · none · ref 45 · internal anchor
WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation cs.CV · 2026-05-14 · conditional · none · ref 21 · internal anchor
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
MechVerse: Evaluating Physical Motion Consistency in Video Generation Models cs.CV · 2026-05-14 · unverdicted · none · ref 47 · internal anchor
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
Probing into Camera Control of Video Models cs.CV · 2026-05-14 · unverdicted · none · ref 53 · internal anchor
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention cs.CV · 2026-05-14 · unverdicted · none · ref 39 · internal anchor
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 54 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 84 · internal anchor
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics cs.CV · 2026-05-12 · unverdicted · none · ref 40 · 2 links · internal anchor
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 99 · internal anchor
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CV · 2026-05-09 · unverdicted · none · ref 34 · 2 links · internal anchor
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 45 · internal anchor
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping cs.CV · 2026-05-06 · unverdicted · none · ref 28 · internal anchor
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CV · 2026-05-05 · unverdicted · none · ref 36 · internal anchor
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 6 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 54 · 2 links · internal anchor
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 17 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal cs.CV · 2026-04-30 · unverdicted · none · ref 24 · internal anchor
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer cs.CV · 2026-04-27 · unverdicted · none · ref 42 · internal anchor
OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer