{"total":26,"items":[{"citing_arxiv_id":"2605.22718","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldKV: Efficient World Memory with World Retrieval and Compression","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:55:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19957","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:10:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18601","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:12:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Incantation is the first video world model to use per-frame natural language conditioning for simultaneous multi-entity control and concept-level cross-entity transfer in interactive video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13111","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:23:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16395","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:43:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09965","ref_index":67,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse","primary_cat":"cs.CV","submitted_at":"2026-05-11T04:16:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[65] Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. [66] Yingchen He, Christian D Weilbach, Martyna E Wojciechowska, Yuxuan Zhang, and Frank Wood. Plaicraft: Large-scale time-aligned vision-speech-action dataset for embodied ai.arXiv preprint arXiv:2505.12707, 2025. [67] Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evalu- ating LLM reasoning through live computer games. InThe Thirteenth International Conference on Learning Representations, 2025. [68] Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang."},{"citing_arxiv_id":"2605.01896","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models","primary_cat":"cs.CV","submitted_at":"2026-05-03T14:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"World models enable agents to predict environmental dynamics and plan ac- tions [11,65]. Recent video diffusion models [1,13,22,42,53,65] trained on large- scale datasets with structured inputs (e.g., actions and camera movements) have emerged as promising world simulators [32,51,52], with applications in autonomous driving [10,15,36], embodied intelligence [7,61], and interactive game engines [12,29,41,57]. However, existing models operate solely on 2D RGB pixels, while the physical world is inherently 3D. This limitation poses signifi- cant challenges for 3D-aware modeling and applications requiring accurate depth estimation and multi-modal information. † Project lead. ⋆ Corresponding author. arXiv:2605.01896v1 [cs.CV] 3 May 2026 2 J. Xiao et al."},{"citing_arxiv_id":"2604.21686","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldMark: A Unified Benchmark Suite for Interactive Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T13:50:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18215","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-20T13:00:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To enable explicit viewpoint control, recent efforts inject camera poses into pre- trained video diffusion models through conditioning modules, utilizing represen- tations such as camera extrinsic and intrinsic parameters [49], dense Plücker ray embeddings [1,15], or learnable pose tokens [14,29,34,40,41]. Some works lever- age synthetic game-engine data to train models conditioned on discrete action commands [2,10,17,42,47,56], simplifying the interface at the cost of fine-grained poseaccuracy.Others [5,36,57,58]incorporateexplicit3Dconstraintsbywarping reference frames via estimated depth and target poses to guide generation. While these control signals, combined with the strong generative priors of foundation models, enable responsive user interaction and coherent exploration within short"},{"citing_arxiv_id":"2604.15911","ref_index":290,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13036","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lyra 2.0: Explorable Generative 3D Worlds","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For instance, MotionCtrl [114] flattens per-frame camera pose matrices into vectors and injects them into intermediate feature representations of a pre-trained video diffusion model. Subsequent works [3, 4, 26, 120] adopt dense ray-based encodings using Plücker coordinates [10, 93], enabling pixel-wise camera conditioning and improved viewpoint control. Following the success of Genie 3 [5], an increasingly popular line of work [28, 50, 70, 100, 138] formulates camera control as an action-conditioning problem, where viewpoint changes are driven by discrete control signals such as keyboard inputs. To further improve geometric faithfulness, recent approaches [48, 81, 117, 129, 130, 139] introduce more structured 3D guidance signals beyond per-frame pose conditioning. These methods condition generation on renderings of estimated"},{"citing_arxiv_id":"2604.08995","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory","primary_cat":"cs.CV","submitted_at":"2026-04-10T06:00:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"isolate which design choices drive long-horizon stability versus latency. Accordingly, a growing body of public studies [10, 15, 16] targets controllable and interactive world models, yet simultaneously satisfying long-horizon memory consistency, high-resolution fidelity, and true real-time interaction remains rare in open works. Prior works such as Matrix-Game 2.0 [15] and HY-Gamecraft-2 [34] achieve real-time streaming interactive generation via causal autoregressive few-step diffusion, but lack memory mechanisms for stable minute-long consistency; Lingbot-World [38] improves long- horizon geometric consistency by scaling context length of the diffusion model, yet simultaneously maintaining memory capability and robust real-time streaming deployment remains challenging."},{"citing_arxiv_id":"2604.07348","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoRight: Motion Control Done Right","primary_cat":"cs.CV","submitted_at":"2026-04-08T17:59:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5253-5262, 2025. 3, 6, 7 [31] H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 8 [32] X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 2 [33] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium."},{"citing_arxiv_id":"2604.07209","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-08T15:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gen3C [69], MVGenMaster [13], TrajectoryCrafter [60], and others [7, 21, 25, 48, 64, 67, 90, 97, 101, 102]. Furthermore, several training-free methods have been proposed to achieve flexible camera con- trol [36, 38, 54, 91]. For open-ended generation and dynamic scene exploration, methods like Infinite- World [87], and CameraCtrl II [31], LingBot-World [78], Google Genie 3 [5], World Labs RTFM [86], Matrix-game 2.0 [32] target unbounded horizons. However, these prior methods fundamentally suffer from spatial persistence degradation due to a lack of effective memory mechanisms and explicit geomet- ric guidance, a synthetic-to-real gap in visual statistics caused by an over-reliance on synthetic training data, and insufficient control precision reflecting a deficiency in underlying spatial geometric reason-"},{"citing_arxiv_id":"2604.06425","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Computers","primary_cat":"cs.LG","submitted_at":"2026-04-07T20:01:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"All full-resolution pages are in Appendix E; below we keep clickable thumbnails at the original location for quick navigation. 11 CLIGen Visualization Thumbnails Click any thumbnail to jump to its full-resolution page in Appendix CLIGen (General) Visualizations TheterminaldisplaysaseriesofANSIescapecodeformattedtextswithchangingbackgroundandforegroundcolors,executingcommandslike `\\\\u001b[48;2;255;128;128;38;2;0;0;0m`whichsetthebackgroundtoashadeofpinkandtexttoblack,andprintingnumberedlistswithcolors.Theoutputincludesspecific numbers,suchas\\\"1\\\",\\\"5\\\",\\\"7\\\",and\\\"9\\\",indifferentcolors,creatingavisuallydynamicandcolorfuldisplay,buttheexactusername,hostname,andpatharenot specifiedintheprovidedterminalsessioncontent. Theusertypesthecommand`CREATETABLEposts(IDINTEGER)`,withtheterminaldisplayingthecommandinadarkbackgroundwithcoloredsyntaxhighlighting, includinggreenandyellowtext,andthecursormovingcharacter-by-characterastheusertypes,withsomecorrectionsandbackspacingalongtheway.Theoutputshows thecommandbeingexecuted,withkeywordslike`CREATE`and`TABLE`indistinctcolors,andthefilename`posts`appearinginthecommandline. Samples A Atthe`root@localhost:~#`prompt,theusertypesthe`date`command,whichdisplaysthecurrentdateandtimeinaplaintextformatas\\\"2021.10.11.22:47:43KST\\\",then beginstypingthe`cat`command. Theterminaldisplayingprogressbars,packagenameslike`pillow`,`notebook`,and`tzlocal`,andversionchangesingreenandredtext.Theoutputshowsdownloading andinstallingstatuses,includingpercentages,forpackageslike`smmap`,`tomli`,and`protobuf`,withtheterminalscrollingthroughtheoutputrapidly. Samples B Attheunspecifiedusername@hostnameprompt,theterminaldisplaysapartitioneditorwithadiskimagefilenamed\\\"sd.img\\\"(128MiB)andtheuserinteractswithit, creatinganewLinuxpartitionfromfreespace,withkeyoutputcontentshowingpartitiondetailsinatableformat,including\\\"sd.img1\\\"and\\\"sd.img2\\\"withtheirrespective sizesandtypes,andanewpartition\\\"sd.img3\\\"with55MsizeandLinuxtype(83).Theterminalshowsamixofblack"},{"citing_arxiv_id":"2604.04707","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OpenWorldLib: A Unified Codebase and Definition of Advanced World Models","primary_cat":"cs.CV","submitted_at":"2026-04-06T14:19:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":", single images or image sequences) paired with diverse interaction signals, which include textual instructions, directional movement controls (forward, backward, left, right), and camera rotation commands. As shown in Fig. 4, we evaluate and analyze the generation performance of various methods. In the context of navigation video generation, early approaches like Matrix-Game-2 [43, 157] offer fast generation speeds but suffer from noticeable color shifting during long-horizon generation. In contrast, recent models such as Lingbot-World [116], Hunyuan-GameCraft [65, 111], and YUME-1.5 [96, 97] successfully support high-quality navigation video generation, with Hunyuan-WorldPlay [110] achieving the best overall visual performance."},{"citing_arxiv_id":"2604.02799","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UNICA: A Unified Neural Framework for Controllable 3D Avatars","primary_cat":"cs.CV","submitted_at":"2026-04-03T07:09:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, these methods operate at the skeletal level and must be combined with separate rigging and physical simulation. World Models.World models are systems designed to understand and predict the evolution of an environment given historical observations [12]. Game simula- tion is an ongoing research direction for world models, predicting future scenes conditioned on action inputs [7,21,29,81]. Closely related to our idea, GameN- Gen [64] fits an entire first-person video game with a multi-frame diffusion model conditioned on historical frames and player actions. State-of-the-art world mod- els [4,17,39,43,60] have demonstrated the ability to synthesize videos across diverse domains, with some approaches including third-person human motion."},{"citing_arxiv_id":"2603.28489","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"layouts, and ego trajectories can be injected to constrain scene geometry and motion, as demonstrated in driving- oriented models such as MagicDrive-V2 [40]. In interactive world models, action signals can be represented as discrete tokens, latent actions, or control embeddings, and integrated into generation to obtain action-conditioned rollouts, as in Genie [41], Matrix-Game 2.0 [42], and Cosmos-Predict [43]. Audio conditions are typically encoded by a speech or motion encoder and used to guide temporal dynamics such as lip motion, facial expression, or speech rhythm [44]-[49]. These conditions are injected into the generative backbone through cross-attention, adaptive normalization, or token merging. For example, autoregressive frameworks such as iVideoGPT [50]"},{"citing_arxiv_id":"2602.02958","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization","primary_cat":"cs.LG","submitted_at":"2026-02-03T00:54:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.20540","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Open-source World Models","primary_cat":"cs.CV","submitted_at":"2026-01-28T12:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.21714","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AstraNav-World: World Model for Foresight Control and Consistency","primary_cat":"cs.CV","submitted_at":"2025-12-25T15:31:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14614","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling","primary_cat":"cs.CV","submitted_at":"2025-12-16T17:22:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22940","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer","primary_cat":"cs.CV","submitted_at":"2025-11-28T07:30:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26782","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models","primary_cat":"cs.LG","submitted_at":"2025-10-30T17:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRWM uses temporal contrastive learning to geometrically regularize latent spaces in world models for high-fidelity cloning of deterministic 3D worlds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.02283","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Forcing++: Towards Minute-Scale High-Quality Video Generation","primary_cat":"cs.CV","submitted_at":"2025-10-02T17:55:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24527","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Agents Inside of Scalable World Models","primary_cat":"cs.AI","submitted_at":"2025-09-29T09:42:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}