hub Canonical reference

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen · 2024 · cs.CV · arXiv 2402.17177

Canonical reference. 85% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 baseline 1

citation-polarity summary

background 11 baseline 1 unclear 1

representative citing papers

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egocentric motion recovery.

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.

$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.

MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

Controllable Generative Video Compression

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.

DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized methods and 23.4% over GPT-4o on T-IC13.

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

cs.RO · 2026-03-18 · conditional · novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

Chameleon: Benchmarking Detection and Backtracking on Commercial-Grade AI-Generated Videos

cs.CV · 2025-03-09 · unverdicted · novelty 7.0

Chameleon is a new benchmark of commercial-grade AI videos for detection and forensic backtracking, showing existing methods struggle with high-fidelity spatiotemporally consistent content.

A Good Talk Does not Look Like a Summary, It Teaches You! Measuring Takeaways from Paper-to-Video Talks

cs.MM · 2026-06-26 · unverdicted · novelty 6.0

EffectivePresentationScorer evaluates paper-to-video talks for instructional quality by checking clear explanation of ideas, prerequisite concepts, and links to contributions, finding that current systems cover topics but fail to teach.

Class-frequency Guided Noise Schedule for Diffusion Models

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Proposes CFRG noise schedule for diffusion models that assigns larger noises to low-frequency classes to improve generation on imbalanced datasets.

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 19 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
VideoPhy: Evaluating Physical Commonsense for Video Generation cs.CV · 2024-06-05 · conditional · none · ref 64 · internal anchor
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations? cs.CV · 2026-04-26 · conditional · none · ref 22 · internal anchor
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer