{"total":13,"items":[{"citing_arxiv_id":"2606.30292","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model","primary_cat":"cs.LG","submitted_at":"2026-06-29T13:35:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A preview system demonstrates real-time controllable world modeling at 14-15 FPS on RTX 4090 by adapting open video backbones with action pathways for keyboard/mouse control and multimodal features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02553","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:50:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02436","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry-Aware Implicit Memory for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-06-01T16:08:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01164","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends","primary_cat":"cs.CV","submitted_at":"2026-05-31T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00793","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MBench: A Comprehensive Benchmark on Memory Capability for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-30T16:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25874","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation","primary_cat":"cs.CV","submitted_at":"2026-05-25T14:01:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18601","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:12:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Incantation is the first video world model to use per-frame natural language conditioning for simultaneous multi-entity control and concept-level cross-entity transfer in interactive video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15178","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11550","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The DAWN of World-Action Interactive Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:30:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"combining video pretraining and trajectory distillation to support end-to-end planning. UniFuture [34] and HERMES [64] enforce 4D geometric constraints. Recent methods move beyond visual forecasting toward policy-aware simulation: Uni-World VLA [36] interleaves future frame prediction and trajectory planning to form a closed-loop interaction. For enhanced complexity, SGDrive [24] and Infinite-World [50] introduce hierarchical cognition and memory to scale simulations to long horizons. However, most prior DWMs still treat world prediction as a passive backdrop for planning. 5 Conclusion We introducedWorld-ActionInteractive Models (WAIMs), a perspective in which future world states and actions are inferred as coupled variables rather than produced by decoupled pipelines."},{"citing_arxiv_id":"2606.02586","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fewer, Better Frames: A Compute-Normalized Proof of Concept for Coherence-First World-Model Rendering with Model-Guided FSR4 Frame Generation","primary_cat":"cs.GR","submitted_at":"2026-05-11T16:42:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Coherence-first rendering with 15 FPS anchors plus FSR4 upsampling to 30 FPS preserves scene geometry and identity longer than native 30 FPS generation across tested forest, sword, desert, and snow scenes, with LPIPS favoring the coherence branch.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18564","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that responds to interactive control signals has evolved rapidly. Existing mod- els incorporate various signals like camera controls [13,36,52,65] and action controls [7,10,12,35] to simulate future states. Recent studies have explored sev- eral essential properties [16] of interactive video world models, such as physical consistency [37,47,49,72], and long-horizon coherence [53,56,62], alongside effi- cient real-time generation [17,61,66,75] to enable practical deployment. With these properties, world models can serve as powerful simulators for downstream tasks like game generation [39,55], embodied AI [6,27], and autonomous driv- ing [33,58]. Game video world models [44,64] control the environment and simu- late player observations based on provided actions."},{"citing_arxiv_id":"2604.07209","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-08T15:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing by lifting depth to point clouds and using rendered proxy videos. This is seen in methods such as 3 Gen3C [69], MVGenMaster [13], TrajectoryCrafter [60], and others [7, 21, 25, 48, 64, 67, 90, 97, 101, 102]. Furthermore, several training-free methods have been proposed to achieve flexible camera con- trol [36, 38, 54, 91]. For open-ended generation and dynamic scene exploration, methods like Infinite- World [87], and CameraCtrl II [31], LingBot-World [78], Google Genie 3 [5], World Labs RTFM [86], Matrix-game 2.0 [32] target unbounded horizons. However, these prior methods fundamentally suffer from spatial persistence degradation due to a lack of effective memory mechanisms and explicit geomet- ric guidance, a synthetic-to-real gap in visual statistics caused by an over-reliance on synthetic training"},{"citing_arxiv_id":"2604.04707","ref_index":138,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenWorldLib: A Unified Codebase and Definition of Advanced World Models","primary_cat":"cs.CV","submitted_at":"2026-04-06T14:19:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rlvr-world: Training world models with reinforcement learning.Advancesin Neural Information Processing Systems, 2025. [137] Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long. Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026. [138] Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026. [139] Eric Xing, Mingkai Deng, Jinyu Hou, and Zhiting Hu. Critiques of world models."}],"limit":50,"offset":0}