{"total":15,"items":[{"citing_arxiv_id":"2606.30544","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent Actions from Factorized Transition Effects under Agent Ambiguity","primary_cat":"cs.AI","submitted_at":"2026-06-29T16:45:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OTF decomposes transitions into reusable primitives to form action-like latents in OTF-LAM and OTF-LAM-Dino, enabling zeroshot transfer and competitive policy learning under visual ambiguity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21139","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-06-19T06:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18558","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction","primary_cat":"cs.CV","submitted_at":"2026-06-17T00:19:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09813","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"iMaC: Translating Actions into Motion and Contact Images for Embodied World Models","primary_cat":"cs.RO","submitted_at":"2026-06-08T17:55:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"iMaC introduces image-based action tokens in a dual-branch architecture to improve future state prediction and control in embodied world models over vector-based baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04130","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization","primary_cat":"cs.RO","submitted_at":"2026-06-02T18:40:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28865","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:31:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VAE world model trained on embodied exploration develops latent representations aligned with physical geometry, with metrics improving together and collapsing together under high KL regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19242","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhyWorld: Physics-Faithful World Model for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T01:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15725","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiLA: Disentangled Latent Action World Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:22:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16412","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCAR: Self-Supervised Continuous Action Representation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:23:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20223","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Why Latent Actions Fail, and How to Prevent It","primary_cat":"cs.CV","submitted_at":"2026-05-13T09:54:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06298","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"sampled at uniform intervals ∆y, it can be unambiguously reconstructed if and only if f contains no spatial frequencies exceedingξ Nyquist = 1 2∆y . For a grid of H pixels over y∈[−1,1] , the sampling interval is ∆y= 2/H , giving ξNyquist =H/4 periods per unit length. The standard Fourier feature encoding assigns frequency ξk = 2k−1 to band kas in [Tancik et al., 2020] γ(y) k = sin(2kπy),cos(2 kπy) \u0001 , k∈ {0,1, . . . , K−1},(7) or in full, γ(y) = sin(20πy),cos(2 0πy), . . . ,sin(2 K−1 πy),cos(2 K−1 πy) \u0001⊤ ∈R 2K.(8) The no-aliasing conditionξ k < ξNyquist requires: 2k−1 < H 4 =⇒k max <log 2(H)−1.(9) For MiniGrid (H= 72 ), kmax <5.17 ; therefore, all K= 6 bands remain within the safe manifold. However, for WeatherBench (H= 32 ), kmax <4 , meaning that the highest-frequency bands must"},{"citing_arxiv_id":"2605.01694","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent State Design for World Models under Sufficiency Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-03T03:19:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A trajectory model qualifies as a world model only when its hidden state is sufficient for counterfactual action evaluation rather than behavior cloning on observed trajectories. State-space and Transformer architectures.Mamba-style selective state-space models offer linear-time recurrent compression, while Transformer world models such as IRIS replace recurrence with attention [21, 42]. Their appeal is efficient long-context filtering: the recurrent state update can compress extended histories without the quadratic cost of full attention. In latent-state terms, an SSM is attractive when its recurrent state approximates the belief statistic of Proposition 1 better than a finite context window, especially under occlusion, delayed cues, or revisitation."},{"citing_arxiv_id":"2604.22615","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GazeVLA: Learning Human Intention for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-24T14:46:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The first line of work introduces explicit 2D visual primitives-such as keypoints [4,38,60,63], bounding boxes [60], or trajecto- ries [28,65,71]-as intermediate supervision signals. The second paradigm treats humans as an alternative embodiment, either by jointly training on human and robot data [6,12,35,72,82] or by aligning behaviors through a shared latent ac- tion space [5,7,10,11,14-16,24,46,59,70,73,77]. A third line of research leverages human data to learn general visual representations [48-50] or predictive world GazeVLA: Learning Human Intention for Robotic Manipulation 5 Background Coor. Alignment Background Actions Objects Fig. 2:We curate a large-scale egocentric dataset from diverse sources, containing both hand and gaze annotations with masks indicating validity."},{"citing_arxiv_id":"2604.16592","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"VideoWeave [46] 2026 2D✓ ✗ ✗ ✗ ✗ ✗ ✗Splice short captioned videos into synthetic long videos to cheaply train better video-language models. Helios [209] 2026 2D✓ ✗ ✗ ✓ ✗ ✗ ✗14B video generation model running real-time on one H100 via context compression and drift-aware train- ing. Marble World Model [169, 98] 2025 3D✗ ✓ ✗ ✓ ✗ ✗ ✗Multimodal 3D world generator. Garrido et al. [52] 2026 2D✗ ✓ ✗ ✗ ✗ ✗ ✗Learn action-conditioned World Models from unlabeled in-the-wild videos by inferring continuous latent actions via inverse dynamics. Cambrian-S [200] 2025 2D✗ ✗ ✗ ✓ ✗ ✗ ✗Define spatial supersensing hi- erarchy, benchmark it, and use prediction-error surprise to drive memory/attention in long videos. ViewRope [192] 2026 2D+depth✓ ✓ ✗ ✓ ✗ ✗ ✗Replace pixel-grid positional embed-"},{"citing_arxiv_id":"2602.06949","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos","primary_cat":"cs.RO","submitted_at":"2026-02-06T18:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Vista: AGeneralizableDrivingWorldModelwithHighFidelityandVersatileControllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 7, 10 [23] Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning Adaptable World Models with Latent Actions. InProc. of the International Conf. on Machine learning (ICML), 2025. 2, 6, 7, 16 [24] Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning Latent Action World Models In The Wild.arXiv preprint arXiv:2601.05230, 2026. 16 [25] RaktimGautamGoswami, AmirBar, David Fan, Tsung-YenYang, GaoyueZhou, PrashanthKrishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World Models Can Leverage Human Videos for"}],"limit":50,"offset":0}