{"total":25,"items":[{"citing_arxiv_id":"2605.23878","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:34:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22164","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics","primary_cat":"cs.LG","submitted_at":"2026-05-21T08:34:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRM trains a small horizon-matched pairwise head on trajectory data to improve terminal-state ranking in latent MPC, raising success from 7% to 97% on TwoRoom and 32.7% to 84% on PLDM without changing the encoder or dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21963","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data","primary_cat":"cs.LG","submitted_at":"2026-05-21T03:50:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CMWM is a recurrent latent world model for forecasting patient trajectories like annual eGFR in CKD, reporting 7.28% lower MAE than a tuned GPT-5.5 baseline on a 2232-patient cohort with gains from dialogue data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21800","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation","primary_cat":"cs.LG","submitted_at":"2026-05-20T22:58:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15153","ref_index":31,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action","primary_cat":"cs.RO","submitted_at":"2026-05-14T17:50:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16412","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCAR: Self-Supervised Continuous Action Representation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:23:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13013","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-13T05:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09693","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Do multimodal models imagine electric sheep?","primary_cat":"cs.CV","submitted_at":"2026-05-10T18:25:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"states from actions [34, 35] have seen renewed interest in the context of large generative models. Maes et al. [36] present LeWorldModel, a joint-embedding predictive architecture that achieves stable end-to-end training from pixels. Wiedemer et al. [37] demonstrate emergent zero-shot reasoning in video models including maze solving and physical reasoning. Wang et al. [38] reveal that reasoning in video diffusion models emerges along denoising steps through a \"chain-of-steps\" process. Wu et al. [39] use video generation models to augment multimodal models with geometric cues. Magne et al. 3 Figure 3:Spatial reasoning puzzles.We use twelve puzzles that require a variety of skills such as shape perception, visualization, and planning."},{"citing_arxiv_id":"2605.07554","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProteinJEPA: Latent prediction complements protein language models","primary_cat":"cs.LG","submitted_at":"2026-05-08T10:30:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07514","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T09:44:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"InRobotics: Science and Systems, 2026. [26] Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. [27] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. [28] Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. [29] Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots."},{"citing_arxiv_id":"2605.07390","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T07:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Using adaptive layer normalization (adaLN), at generates affine parameters to modulate st into an action-conditioned state s ′ t. This integration enforces kinematic constraints during autoregressive modeling. Cognition-based World Model Reasoning.To deduce future cognition, we instantiate a causal Trans- former predictor inspired by LeWorldModel [20]. It processes the historical sequence [s ′ t−T , . . . , s ′ t] to infer the subsequent latent state ˆst+1. A cross-attention state resampling decoder then projects this prediction into the cognition space, utilizing the original graph Gt ST as queries and ˆst+1 as keys and values to iteratively construct the future cognition ˆGt+1 ST . This reasoning ensures future states preserve"},{"citing_arxiv_id":"2605.07278","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Predictive but Not Plannable: RC-aux for Latent World Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:43:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"cost and contain intermediate predicted states from which the goal is estimated to be reachable within the remaining horizon. Thus, reachability becomes an explicit search signal rather than only a training-time regularizer. We instantiate RC-aux on LeWorldModel (LeWM), a compact reconstruction-free JEPA world model for goal-conditioned planning from pixels [29]. RC-aux preserves the LeWM backbone, allowing us to isolate the effect of correcting the training horizon, shaping finite-horizon reachability geometry, and using the learned reachability signal during search. Although our experiments use LeWM, the objective itself is backbone-agnostic: it requires only latent rollouts, goal latents, and finite-horizon"},{"citing_arxiv_id":"2605.06841","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites","primary_cat":"cs.AI","submitted_at":"2026-05-07T18:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06298","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05586","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-07T02:11:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AeroJEPA applies joint-embedding predictive learning to produce scalable, semantically organized latent representations for 3D aerodynamic fields that support both field reconstruction and downstream design tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"distribution, namely SIGReg [Balestriero and LeCun, 2025, Maes et al., 2026]. This yields two closely related training objectives depending on whether the decoder is optimized jointly or separately. For coupled end-to-end training, we optimize Ltotal =λ ℓ Llat +λ r Lrec +λ s Lsig,(5) whereas the decoupled latent-only stage uses Llatent-only =λ ℓ Llat +λ s Lsig.(6) The latent matching term is Llat = ˆZt −Z t 2 2 ,(7) which aligns the predictor output with the target encoder output token by token. When the decoder is trained jointly, the reconstruction term is Lrec =E q∈Ω \u0014 fdec ϕ ( ˆZt, q)− F(q) 2 2 \u0015 ,(8) with the supervised channels depending on whether the task is surface-only or volumetric."},{"citing_arxiv_id":"2605.03413","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to Theorize the World from Observation","primary_cat":"cs.LG","submitted_at":"2026-05-05T06:39:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"x, y), with ˆτ /∈ Ttrain. We interpret successful inference in this regime as evidence ofLearning-to-Theorize. Evaluation: Program Transferability.We evaluate in- duced theories at the execution level by testing whether inferred programs act as reusable and transferable compo- sitional explanations. At test time, we consider pairs of phenomena (x(1), y(1)) and (x(2), y(2)) generated by the same latent program τ∈ T test. The model first infers a program ˆτfrom (x(1), y(1)) and then applies it to x(2) to obtain ˆy(2) =D θ(fˆτ(x(2))). Performance is measured by an observation-space error dobs(ˆy(2), y(2)). This protocol assesses whether the learned theory captures a transferable generative mechanism, rather than merely fitting individual"},{"citing_arxiv_id":"2605.01694","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent State Design for World Models under Sufficiency Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-03T03:19:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Position Training objective Year range Representative methods Reconstruction- heavy Pixel reconstruction or generative modeling 2018-2024 World Models [22], SimPLe [34] Token compression Discrete-token prediction 2022-2025 IRIS [42], GAIA-1 [30], GAIA-2 [50] Representation prediction Embedding-space prediction 2023-2026 I-JEPA [2], V-JEPA 2 [3], V-JEPA 2.1 [44], LeWorldModel [40] Reward / value-shaped Reward and policy-relevant supervision 2019-2021 TPC [46], value-aligned latent planning [28] Value-equivalent Bellman-relevant statistics only 2020-2023 MuZero [52], EfficientZero [66], TD-MPC2 [26] Causal / counterfactual Intervention-sensitive structural variables 2026 Causal-JEPA [45], CausalV AE-WM [14]"},{"citing_arxiv_id":"2605.00080","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Model for Robot Learning: A Comprehensive Survey","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27411","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift","primary_cat":"cs.LG","submitted_at":"2026-04-30T04:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24662","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data","primary_cat":"physics.data-an","submitted_at":"2026-04-27T16:24:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18058","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity","primary_cat":"cs.LG","submitted_at":"2026-04-20T10:26:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-device wearables.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sub-quadratic hybrid architectures that collapse the pa- rameter cost of long-context modelling; and nascent world models of the body that recast physiological forecasting as a generative prior over kinematic and autonomic state. Wearable foundation models.RelCon [ 23] trains on one billion accelerometry segments from 87,376 par- ticipants using relative contrastive learning. The Large Sensor Model [24] scales this to 40 million hours across six modalities and derives scaling laws for wearable sensing. NormWear [25] handles arbitrary physiological signal configurations through wavelet tokenisation and channel-aware attention across 18 downstream tasks. Sen- sorLM [26] bridges sensor data and natural language, while AccelFM [27] distils PPG representations into an"},{"citing_arxiv_id":"2604.11302","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS","primary_cat":"cs.RO","submitted_at":"2026-04-13T11:01:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05157","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents","primary_cat":"cs.AI","submitted_at":"2026-04-06T20:39:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.29496","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Metriplector: From Field Theory to Neural Architecture","primary_cat":"cs.AI","submitted_at":"2026-03-31T09:40:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small parameter counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.28489","ref_index":224,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ctrl-World [205], VideoAgent [206], VIPER [207], WorldEval [208], Genie Envisioner [209], World-Gymnast [210], DreamDojo [211] GR-1 [212], VILP [213], UV A [214], RoboEnvision [215], GEVRM [216], EnerVerse [217], LingBot-V A [218], Cosmos Policy [219],Fast-W AM [220],LeWorld- Model [221],DreamZero [222] Game & Interactive World Simulation GameGen-X [223], GameFactory [224], MineWorld [225], Matrix-Game [42], [226], GenieRedux-G [227], Hunyuan-GameCraft [228], [229], PlayGen [230], WorldPlay [231], Yume1.5 [129], LingBot-World [232], Cosmos-Predict2.5 [43], Dreamer 4 [233], Genie 3 [21] but also long-horizon and interactive generation. In practice, this means that parallelism, caching, pruning, and quantization should work together rather than be applied separately."}],"limit":50,"offset":0}