SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
hub Canonical reference
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Canonical reference. 85% of citing Pith papers cite this work as background.
abstract
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egocentric motion recovery.
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized methods and 23.4% over GPT-4o on T-IC13.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
Chameleon is a new benchmark of commercial-grade AI videos for detection and forensic backtracking, showing existing methods struggle with high-fidelity spatiotemporally consistent content.
EffectivePresentationScorer evaluates paper-to-video talks for instructional quality by checking clear explanation of ideas, prerequisite concepts, and links to contributions, finding that current systems cover topics but fail to teach.
Proposes CFRG noise schedule for diffusion models that assigns larger noises to low-frequency classes to improve generation on imbalanced datasets.
GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
citing papers explorer
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
-
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.