arxiv: 2408.06072 · v3 · submitted 2024-08-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang , Jiayan Teng , Wendi Zheng , Ming Ding , Shiyu Huang , Jiazheng Xu , Yuanming Yang , Wenyi Hong

show 10 more authors

Xiaohan Zhang Guanyu Feng Da Yin Yuxuan Zhang Weihan Wang Yean Cheng Bin Xu Xiaotao Gu Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-videodiffusion transformer3D VAEexpert transformeradaptive LayerNormvideo generationprogressive trainingtext-video alignment

0 comments

The pith

CogVideoX generates coherent 10-second text-to-video clips at 16 fps and 768x1360 resolution using a diffusion transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CogVideoX as a text-to-video diffusion model that produces longer, higher-resolution videos with better motion and narrative coherence than prior work. It tackles short clips and weak text alignment by introducing a 3D variational autoencoder to compress space and time, plus an expert transformer that uses adaptive LayerNorm for tighter fusion of text and video features. Progressive training and a custom data pipeline with captioning further support extended sequences and varied shapes. A sympathetic reader would care because this moves text-to-video generation closer to usable storytelling tools, as evidenced by top scores on automatic metrics and human preference tests.

Core claim

CogVideoX is a diffusion transformer trained to generate 10-second continuous videos at 16 frames per second and 768 by 1360 pixels that remain aligned with the input text prompt. The model achieves this through a 3D causal VAE for joint spatiotemporal compression, an expert transformer equipped with adaptive LayerNorm layers to deepen text-video interaction, progressive training schedules, multi-resolution frame packing, and a dedicated text-video data preprocessing and captioning pipeline. These elements together yield state-of-the-art results on machine benchmarks and human evaluations for motion quality, duration, and semantic fidelity.

What carries the argument

The expert transformer with expert adaptive LayerNorm, which performs deep cross-modal fusion between text embeddings and video latents inside the diffusion denoising process.

Load-bearing premise

The reported gains in video length, motion, and text alignment result from the specific combination of 3D VAE, expert adaptive LayerNorm, progressive training, and data pipeline rather than from model scale or data volume alone.

What would settle it

A controlled ablation that trains an otherwise identical diffusion transformer on the same dataset and scale but removes the 3D VAE and expert adaptive LayerNorm, then measures whether it matches the original model's machine metrics and human preference scores.

read the original abstract

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogVideoX is a useful open release of a longer-form text-to-video diffusion model with some practical engineering additions, but the SOTA claims still need ablations to separate the proposed components from scale and data effects.

read the letter

Hey, the main point on this one is that CogVideoX gives the community an open weights text-to-video model that reaches 10-second coherent clips at 16 fps and 768x1360 resolution, built around a diffusion transformer with a 3D causal VAE, expert adaptive LayerNorm, progressive multi-resolution training, and a custom data pipeline including captioning. Those pieces are presented as fixes for short duration, weak motion, and poor text alignment in earlier work. The public release of the 3D VAE, caption model, and main model weights on GitHub is the clearest practical win here, since it lets others run or extend the system without rebuilding everything from scratch. The architecture descriptions are straightforward and show how the components slot into the existing diffusion transformer setup. The data pipeline details also look like solid engineering practice for improving semantic alignment. The soft spot is exactly the one the stress-test note flags. The abstract asserts SOTA on both automatic metrics and human evaluations, yet the provided sections do not include the quantitative baselines, ablation tables, or scaling curves that would show whether the 3D VAE or expert LayerNorm actually drive the gains once total parameters and training data are held fixed. Without those controls the results remain compatible with the simpler story that this is mostly a larger, better-curated run of prior video diffusion ideas. This paper is aimed at researchers working on generative video models who want a new open baseline or concrete ideas for handling longer sequences. It has enough substance and released artifacts to deserve a serious referee, even if the experiments will need tightening. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents CogVideoX, a diffusion-transformer text-to-video model that generates 10-second videos at 16 fps and 768×1360 resolution. It introduces a 3D causal VAE for spatio-temporal compression, an expert transformer using adaptive LayerNorm for text-video fusion, progressive training with multi-resolution frame packing, and a custom text-video data pipeline. The authors claim these components enable coherent long-duration videos with significant motion and strong text alignment, achieving state-of-the-art results on machine metrics and human evaluations, with public release of the 3D VAE, caption model, and main model weights.

Significance. If the performance claims are substantiated, the work advances text-to-video generation by extending duration and motion coherence while maintaining alignment at high resolution. The public release of model weights and components is a clear strength that supports reproducibility and downstream research. The engineering focus on 3D compression, modality fusion, and training schedule could inform subsequent diffusion-transformer video models.

major comments (2)

[Experiments] Experiments section: The manuscript asserts SOTA performance across machine metrics and human evaluations but provides no quantitative baselines, ablation studies, or error analysis. To support the central claim that the 3D VAE, expert transformer with adaptive LayerNorm, and progressive training (rather than model scale or dataset size) are responsible for the gains in duration, motion, and alignment, controlled ablations that hold total capacity and data fixed while toggling each component are required.
[§3] §3 (Method), expert transformer description: The adaptive LayerNorm mechanism for deep text-video fusion is presented as a key innovation, yet the text does not include a direct comparison (e.g., parameter count, attention maps, or ablation against standard cross-attention) to prior video diffusion transformers, leaving the incremental contribution unclear.

minor comments (2)

[Abstract] Abstract and §4: The phrase 'multiple machine metrics' is used without naming the specific metrics (e.g., FVD, CLIP-T, VBench) or reporting numerical values and comparisons in the provided text.
The data-processing pipeline is described at a high level; additional details on captioning model architecture, filtering criteria, and dataset statistics would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to improve our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript asserts SOTA performance across machine metrics and human evaluations but provides no quantitative baselines, ablation studies, or error analysis. To support the central claim that the 3D VAE, expert transformer with adaptive LayerNorm, and progressive training (rather than model scale or dataset size) are responsible for the gains in duration, motion, and alignment, controlled ablations that hold total capacity and data fixed while toggling each component are required.

Authors: We appreciate this feedback. While the manuscript does include comparisons to existing methods showing SOTA results on various metrics and human studies, we agree that additional ablations and error analysis would further strengthen the paper. In the revised manuscript, we will incorporate quantitative baselines, ablation studies on the proposed components (holding capacity and data as fixed as possible), and error analysis to better support our claims about the contributions of the 3D VAE, expert transformer, and progressive training. revision: yes
Referee: [§3] §3 (Method), expert transformer description: The adaptive LayerNorm mechanism for deep text-video fusion is presented as a key innovation, yet the text does not include a direct comparison (e.g., parameter count, attention maps, or ablation against standard cross-attention) to prior video diffusion transformers, leaving the incremental contribution unclear.

Authors: We thank the referee for this suggestion. To clarify the incremental contribution of the expert adaptive LayerNorm, we will add a direct comparison in the revised §3, including parameter counts relative to standard cross-attention in prior models, and where possible, ablation results or attention visualizations demonstrating improved text-video fusion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering claims rest on measured outcomes, not self-referential derivations

full rationale

The paper proposes concrete architectural and training choices (3D causal VAE for spatio-temporal compression, expert transformer with adaptive LayerNorm, progressive multi-resolution training, and a custom data pipeline) and reports their empirical effects on video duration, motion coherence, and text alignment. These are presented as engineering innovations validated by machine metrics, human evaluations, and public model release. No equations, first-principles derivations, or predictions appear in the manuscript that reduce any central claim to a fitted parameter, self-defined quantity, or self-citation chain. The performance results are therefore not tautological; they remain open to external verification or refutation via ablations and scaling studies.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard diffusion training assumptions and the effectiveness of the proposed architectural extensions; no new physical axioms or invented entities are introduced.

free parameters (1)

Model scale, learning rates, and training schedule
Typical large-scale ML hyperparameters chosen to achieve reported performance; not detailed in abstract.

pith-pipeline@v0.9.0 · 5608 in / 1082 out tokens · 51715 ms · 2026-05-10T18:19:19.270453+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CogVideoX demonstrates state-of-the-art performance... generating 10-second continuous videos... 16 fps and 768x1360 resolution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
cs.CV 2026-04 unverdicted novelty 8.0

ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
cs.CV 2026-05 conditional novelty 7.0

EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
cs.CV 2026-05 unverdicted novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
cs.CV 2026-05 unverdicted novelty 7.0

OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
cs.CV 2026-05 unverdicted novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
cs.CV 2026-04 unverdicted novelty 7.0

YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
cs.CV 2026-04 unverdicted novelty 7.0

OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
cs.CV 2026-04 unverdicted novelty 7.0

Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
HumanScore: Benchmarking Human Motions in Generated Videos
cs.CV 2026-04 unverdicted novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
cs.CV 2026-04 unverdicted novelty 7.0

DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
eess.IV 2026-04 unverdicted novelty 7.0

DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
cs.GR 2026-04 unverdicted novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
cs.CV 2026-04 unverdicted novelty 7.0

UENR-600K is a 600,000-frame synthetic dataset for nighttime video deraining that uses 3D rain particle simulation in Unreal Engine to enable better generalization to real scenes.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
cs.RO 2026-03 conditional novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
cs.CV 2024-07 unverdicted novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
cs.CV 2026-05 unverdicted novelty 6.0

PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 126 Pith papers · 15 internal anchors

[2]

Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=

U-net: Convolutional networks for biomedical image segmentation , author=. Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=. 2015 , organization=

work page 2015
[4]

International Conference on Learning Representations , year=

Phenaki: Variable length video generation from open domain textual descriptions , author=. International Conference on Learning Representations , year=

work page
[6]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[7]

Forty-first International Conference on Machine Learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first International Conference on Machine Learning , year=

work page
[9]

2024 , url =

OpenAI , title=. 2024 , url =

work page 2024
[10]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Effective long-context scaling of foundation models

Effective long-context scaling of foundation models , author=. arXiv preprint arXiv:2309.16039 , year=

work page arXiv
[12]

LongAlign: A recipe for long context alignment of large language models.arXiv preprint arXiv:2401.18058, 2024

Longalign: A recipe for long context alignment of large language models , author=. arXiv preprint arXiv:2401.18058 , year=

work page arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

2024 , url =

Kling , author=. 2024 , url =

work page 2024
[17]

2023 , url =

Gen-2 , author=. 2023 , url =

work page 2023
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Magvit: Masked generative video transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[23]

Neurocomputing , volume=

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning , author=. Neurocomputing , volume=. 2022 , publisher=

work page 2022
[24]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Mocogan: Decomposing motion and content for video generation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[25]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Videocrafter2: Overcoming data limitations for high-quality video diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

2024 , howpublished=

GPT-4o , author=. 2024 , howpublished=

work page 2024
[30]

2024 , url =

Llama 3 Model Card , author=. 2024 , url =

work page 2024
[32]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

work page 2014
[33]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Panda-70m: Captioning 70m videos with multiple cross-modality teachers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[36]

2024 , note =

CogVLM-Team , title =. 2024 , note =

work page 2024
[37]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[38]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[39]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[40]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[42]

Advances in Neural Information Processing Systems , volume=

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[44]

Computer Science

Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=

work page
[45]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[46]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[47]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Common diffusion noise schedules and sample steps are flawed , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page
[49]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[51]

Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle=

work page
[53]

2024 , url =

Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

work page 2024
[54]

2024 , eprint=

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective , author=. 2024 , eprint=

work page 2024
[56]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[57]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[58]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Pixart- alpha : Fast training of diffusion transformer for photorealistic text-to-image synthesis , author=. arXiv preprint arXiv:2310.00426 , year=

work page internal anchor Pith review arXiv
[60]

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Max Bain and Arsha Nagrani and G \"u l Varol and Andrew Zisserman. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. IEEE International Conference on Computer Vision. 2021

work page 2021
[61]

Latte: Latent Diffusion Transformer for Video Generation

Latte: Latent diffusion transformer for video generation , author=. arXiv preprint arXiv:2401.03048 , year=

work page internal anchor Pith review arXiv
[62]

2022 , howpublished =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

work page 2022
[63]

Advances in Neural Information Processing Systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

work page
[64]

Pika beta. 2023. URL https://pika.art/home

work page 2023
[65]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[67]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021 a

work page 2021
[68]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 1728--1738, 2021 b

work page 2021
[69]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2 0 (3): 0 8, 2023

work page 2023
[70]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[71]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[72]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7310--7320, 2024 a

work page 2024
[73]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13320--13331, 2024 b

work page 2024
[74]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

work page 2022
[75]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[76]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \"u ller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[77]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[78]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

work page 2020
[79]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[80]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review arXiv 2022
[81]

Cogvlm2: Visual language models for image and video un- derstanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, et al. Cogvlm2: Visual language models for image and video understanding. arXiv preprint arXiv:2408.16500, 2024

work page arXiv 2024
[82]

VBench : Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench : Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

work page 2024
[83]

Open-sora-plan,

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, April 2024. URL https://doi.org/10.5281/zenodo.10948109

work page doi:10.5281/zenodo.10948109 2024
[84]

T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback.arXiv preprint arXiv:2405.18750, 2024

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024

work page arXiv 2024
[85]

Evaluation of text-to-video generation models: A dynamics perspective, 2024

Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective, 2024. URL https://arxiv.org/abs/2407.01094

work page arXiv 2024
[86]

Common diffusion noise schedules and sample steps are flawed

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.\ 5404--5411, 2024

work page 2024
[87]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

work page 2014
[88]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 0 293--304, 2022

work page 2022
[89]

OpenAI. Sora. 2024. URL https://openai.com/index/sora/

work page 2024
[90]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

work page 2023
[91]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[92]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020
[93]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[94]

runway. Gen-2. 2023. URL https://runwayml.com/ai-tools/gen-2-text-to-video

work page 2023
[95]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review arXiv 2022
[96]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

work page 2022
[97]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[98]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[99]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[100]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1526--1535, 2018

work page 2018
[101]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

Showing first 80 references.