arXiv preprint arXiv:2506.15675 (2025)

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al · 2025 · arXiv 2506.15675

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

dataset 2

citation-polarity summary

baseline 1 use dataset 1

representative citing papers

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

cs.CV · 2025-12-16 · unverdicted · novelty 6.0

WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.

Emu3.5: Native Multimodal Models are World Learners

cs.CV · 2025-10-30 · unverdicted · novelty 6.0

Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.

Matrix-game 2.0: An open-source real-time and streaming interactive world model

cs.CV · 2025-08-18 · unverdicted · novelty 5.0

Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.

citing papers explorer

Showing 7 of 7 citing papers.

Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 11
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer cs.CV · 2026-04-27 · unverdicted · none · ref 20
OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 32
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling cs.CV · 2025-12-16 · unverdicted · none · ref 36
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 53
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CV · 2026-05-14 · unverdicted · none · ref 90
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
Matrix-game 2.0: An open-source real-time and streaming interactive world model cs.CV · 2025-08-18 · unverdicted · none · ref 24
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.

arXiv preprint arXiv:2506.15675 (2025)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer