Physically Interpretable World Models via Weakly Supervised Representation Learning

Ivan Ruchkin; Mrinall Eashaan Umasudhan; Zhenjiang Mao

arxiv: 2412.12870 · v6 · submitted 2024-12-17 · 💻 cs.LG

Physically Interpretable World Models via Weakly Supervised Representation Learning

Zhenjiang Mao , Mrinall Eashaan Umasudhan , Ivan Ruchkin This is my paper

Pith reviewed 2026-05-23 06:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords world modelsrepresentation learningweak supervisionphysical interpretabilitylatent dynamicscyber-physical systemsimage-based prediction

0 comments

The pith

PIWM learns world models whose latent states correspond to physical variables and evolve according to known dynamics using only weak distribution-based supervision from images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PIWM as a way to give world models physical interpretability by making their latent states match real physical quantities and their changes obey partially known physical equations. It achieves this without ground-truth annotations by applying weak supervision that reflects the natural uncertainty in real sensing. A sympathetic reader would care because standard learned world models produce opaque latents that hinder reliability in safety-critical control tasks. The method combines a vector-quantized visual encoder, a transformer physical encoder, and a dynamics model tied to known equations, and it is tested on three image-based control environments. If the approach works, it shows that partial physical knowledge plus distributional supervision can ground representations without full labels.

Core claim

PIWM defines physical interpretability as latent states that match meaningful physical variables and temporal evolution that follows physically consistent dynamics; it realizes this through weak distribution-based supervision on state uncertainty together with a VQ visual encoder, transformer physical encoder, and learnable dynamics grounded in known equations, yielding accurate long-horizon prediction and recovery of true system parameters on Cart Pole, Lunar Lander, and Donkey Car without ground-truth annotations.

What carries the argument

The PIWM architecture that integrates weak distribution-based supervision with a VQ-based visual encoder, transformer-based physical encoder, and dynamics model constrained by known physical equations to enforce alignment and consistency.

If this is right

Accurate long-horizon prediction becomes possible directly from raw images.
True system parameters can be recovered without explicit annotations.
Physical grounding improves markedly compared with purely data-driven world models.
The same weak-supervision approach works across multiple distinct control domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may reduce the data requirements for training reliable controllers in new physical environments.
Partial knowledge of dynamics equations could be combined with this supervision to handle partially observed or noisy real-world sensors.
If the alignment holds, the resulting models could support verification or safety analysis that treats the latent states as interpretable physical quantities.

Load-bearing premise

Weak distribution-based supervision derived from sensing uncertainty is enough to force latent states to align with actual physical variables and to keep their dynamics physically consistent.

What would settle it

On a held-out image-based control task the trained PIWM model produces long-horizon rollouts whose recovered parameters deviate substantially from the known ground-truth values or whose predicted trajectories violate the known physical constraints.

Figures

Figures reproduced from arXiv: 2412.12870 by Ivan Ruchkin, Mrinall Eashaan Umasudhan, Zhenjiang Mao.

**Figure 2.** Figure 2: Overview of world model architectures. (Left) Standard world model uses an encoder-decoder structure with a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Prediction performance of CartPole. The Root Mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Learned physical parameters vs. ground truth. For Cart Pole and Lunar Lander, the parameters learned by our [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of 30-step trajectory rollouts under [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Extrinsic–discrete PIWM architecture. The vision encoder [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of 30-step trajectory roll [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Learning predictive models from high-dimensional sensory observations is fundamental for cyber-physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety-critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real-world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground-truth physical annotations, PIWM employs weak distribution-based supervision that captures state uncertainty naturally arising from real-world sensing pipelines. The architecture integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIWM combines VQ encoding, transformers, and physics dynamics under weak distribution supervision to target interpretability in world models, but the abstract supplies no metrics or protocol so the performance claims stay uncheckable.

read the letter

The core idea is to align latent states with physical quantities and enforce consistent dynamics without ground-truth labels, relying instead on distribution-based weak signals that reflect sensing uncertainty. The architecture puts a VQ visual encoder in front, a transformer physical encoder, and a dynamics model that incorporates known equations. That combination for interpretability without annotations is the main new piece relative to standard world-model work. It addresses a practical gap in cyber-physical systems where black-box latents limit safety and generalization. The choice of Cart Pole, Lunar Lander, and Donkey Car as test cases is reasonable because they allow post-hoc checks against known parameters. The weak-supervision route is a direct attempt to avoid expensive annotations while still recovering meaningful variables. The main limitation is that the abstract states results on long-horizon prediction and parameter recovery but reports none of the numbers, ablations, or training details needed to evaluate them. Claims of significant improvement over purely data-driven models therefore cannot be weighed yet. No internal contradictions appear in the high-level description, and the approach does not look circular on its face. This work is aimed at researchers building world models for control and robotics who already care about physical grounding. A reader looking for concrete techniques to add interpretability constraints might extract useful architecture pointers even if the quantitative case needs strengthening. It is coherent enough on its own terms to warrant a full review rather than a desk reject, mainly to see whether the experiments actually support the parameter-recovery and grounding claims once the full text and results are examined.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Physically Interpretable World Models (PIWM), a framework that learns latent representations from high-dimensional images aligned with physical variables and whose dynamics follow partially known physical equations. Physical interpretability is defined via two properties: correspondence of latents to meaningful physical variables and physically consistent temporal evolution. The method uses weak distribution-based supervision (capturing sensing uncertainty) together with a VQ visual encoder, transformer physical encoder, and learnable dynamics model. On Cart Pole, Lunar Lander, and Donkey Car, the paper claims accurate long-horizon prediction, recovery of true system parameters, and improved physical grounding relative to purely data-driven baselines.

Significance. If substantiated, the approach would offer a practical route to interpretable world models for cyber-physical systems without requiring ground-truth state annotations. The reliance on weak distributional supervision and partial physics knowledge is a potentially valuable contribution for real-world sensing pipelines, and successful parameter recovery would strengthen the case for using such models in safety-critical settings.

major comments (1)

[Abstract] Abstract: the claims of accurate long-horizon prediction, true parameter recovery, and significantly improved physical grounding are stated without any quantitative metrics, ablation results, or experimental protocol details, preventing evaluation of whether the central claims hold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The single major comment concerns the abstract's lack of quantitative support for its claims. We address this directly below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of accurate long-horizon prediction, true parameter recovery, and significantly improved physical grounding are stated without any quantitative metrics, ablation results, or experimental protocol details, preventing evaluation of whether the central claims hold.

Authors: We agree that the abstract would be strengthened by including representative quantitative results. The full manuscript reports these details in Sections 4 and 5 (e.g., long-horizon MSE on Cart Pole/Lunar Lander/Donkey Car, parameter recovery errors with standard deviations, and ablation studies comparing against data-driven baselines). To make the abstract self-contained, we will revise it to incorporate key metrics such as prediction accuracy over 50-step horizons and parameter estimation errors, while retaining the high-level summary. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain integrates externally known physical equations into the dynamics model and applies weak distribution-based supervision to align latents, without any reduction of predictions or parameter recovery to the supervision signal by construction. The VQ/transformer architecture and case-study verification against independent ground truth remain independent of the fitted values. No self-citation load-bearing steps, self-definitional quantities, or ansatz smuggling appear in the abstract or described framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the approach assumes standard components (VQ encoder, transformer) and domain knowledge of physical equations.

axioms (1)

domain assumption Partially known physical dynamics can be integrated into a learnable model to constrain latent evolution
Stated in the description of the dynamics model grounded in known physical equations.

pith-pipeline@v0.9.0 · 5747 in / 1207 out tokens · 27510 ms · 2026-05-23T06:58:05.450907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 9 internal anchors

[1]

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. 2018. Safe reinforcement learning via shielding. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018
[2]

Martin Asenov, Michael Burke, Daniel Angelov, Todor Davchev, Kartic Subr, and Subramanian Ramamoorthy. 2019. Vid2param: Modeling of dynamics parameters from video.IEEE Robotics and Automation Letters5, 2 (2019), 414–421

work page 2019
[3]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Jonathan Balloch, Zhiyu Lin, Robert Wright, Xiangyu Peng, Mustafa Hussain, Aarun Srinivas, Julia Kim, and Mark O Riedl. 2023. Neuro-symbolic world models for adapting to open world novelty.arXiv preprint arXiv:2301.06294(2023)

work page arXiv 2023
[5]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul- man, Jie Tang, and Wojciech Zaremba. 2016. Openai gym.arXiv preprint arXiv:1606.01540(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Brunton, Joshua L

Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. 2016. Sparse Identifi- cation of Nonlinear Dynamics with Control (SINDYc).IFAC-PapersOnLine49, 18 (2016), 710–715. doi:10.1016/j.ifacol.2016.10.249 10th IFAC Symposium on Nonlinear Control Systems NOLCOS 2016

work page doi:10.1016/j.ifacol.2016.10.249 2016
[7]

Boyuan Chen, Kuang Huang, Sunand Raghupathi, Ishaan Chandratreya, Qiang Du, and Hod Lipson. 2022. Automated discovery of fundamental variables hidden in experimental data.Nature Computational Science2, 7 (2022), 433–442

work page 2022
[8]

Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders.Advances in neural information processing systems31 (2018)

work page 2018
[9]

Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, and Nemanja Djuric. 2020. Deep kinematic models for kinemati- cally feasible vehicle trajectory predictions. In2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10563–10569

work page 2020
[10]

Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. 2018. End-to-end differentiable physics for learning and control. Advances in neural information processing systems31 (2018)

work page 2018
[11]

Franck Djeumou, Cyrus Neary, and Ufuk Topcu. 2023. How to learn and gen- eralize from three minutes of data: Physics-constrained and uncertainty-aware neural stochastic differential equations.arXiv preprint arXiv:2306.06335(2023)

work page arXiv 2023
[12]

Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. 2023. Fo- cus: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427(2023)

work page arXiv 2023
[13]

David Fridovich-Keil, Andrea Bajcsy, Jaime F Fisac, Sylvia L Herbert, Steven Wang, Anca D Dragan, and Claire J Tomlin. 2020. Confidence-aware motion prediction for real-time collision avoidance1.The International Journal of Robotics Research39, 2-3 (2020), 250–265

work page 2020
[14]

Xiang Fu, Ge Yang, Pulkit Agrawal, and Tommi Jaakkola. 2021. Learning task informed abstractions. InInternational Conference on Machine Learning. PMLR, 3480–3491

work page 2021
[15]

David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.10122(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2020. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2023. Mas- tering diverse domains through world models.arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Osman Hasan and Sofiene Tahar. 2015. Formal verification methods. InEncyclo- pedia of Information Science and Technology, Third Edition. IGI global, 7162–7170

work page 2015
[19]

Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework.ICLR (Poster)3 (2017)

work page 2017
[20]

Kai-Chieh Hsu, Karen Leung, Yuxiao Chen, Jaime F Fisac, and Marco Pavone. 2023. Interpretable Trajectory Prediction for Autonomous Vehicles via Counterfactual Responsibility. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5918–5925

work page 2023
[21]

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR)50, 2 (2017), 1–35

work page 2017
[22]

Masha Itkina and Mykel Kochenderfer. 2023. Interpretable self-aware neural networks for robust trajectory prediction. InConference on Robot Learning. PMLR, 606–617

work page 2023
[23]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

work page 1996
[24]

Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der Smagt

work page
[25]

Deep variational bayes filters: Unsupervised learning of state space models from raw data.arXiv preprint arXiv:1605.06432(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Sydney M Katz, Anthony L Corso, Christopher A Strong, and Mykel J Kochender- fer. 2022. Verification of image-based neural network controllers using generative models.Journal of Aerospace Information Systems19, 9 (2022), 574–584

work page 2022
[27]

Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. InInterna- tional conference on machine learning. PMLR, 2649–2658

work page 2018
[28]

Diederik P Kingma. 2013. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[29]

George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. 2018. From skills to symbols: Learning symbolic representations for abstract high-level planning.Journal of Artificial Intelligence Research61 (2018), 215–289

work page 2018
[30]

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. 2025. Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels.arXiv preprint arXiv:2508.17437(2025)

work page arXiv 2025
[31]

Anson Lei, Bernhard Schölkopf, and Ingmar Posner. 2024. SPARTAN: A sparse transformer learning local causation.arXiv preprint arXiv:2411.06890(2024)

work page arXiv 2024
[32]

Anjian Li, Liting Sun, Wei Zhan, Masayoshi Tomizuka, and Mo Chen. 2021. Prediction-based reachability for collision avoidance in autonomous driving. In 2021 Intl. Conf. on Robotics and Automation (ICRA)

work page 2021
[33]

Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, João F Henriques, and Kevin Ellis. 2024. Visualpredicator: Learning abstract world models with neuro-symbolic predicates for robot planning.arXiv preprint arXiv:2410.23156(2024)

work page arXiv 2024
[34]

TP Lillicrap. 2015. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

Lars Lindemann, Xin Qin, Jyotirmoy V Deshmukh, and George J Pappas. 2023. Conformal prediction for stl runtime verification. InProceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023). 142–153

work page 2023
[36]

Ori Linial, Neta Ravid, Danny Eytan, and Uri Shalit. 2021. Generative ODE modeling with known unknowns. InProceedings of the Conference on Health, Inference, and Learning (CHIL ’21). Association for Computing Machinery, New York, NY, USA, 79–94. doi:10.1145/3450439.3451866

work page doi:10.1145/3450439.3451866 2021
[37]

Juanwu Lu, Wei Zhan, Masayoshi Tomizuka, and Yeping Hu. 2024. Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 4717–4725

work page 2024
[38]

Yanbing Mao, Yuliang Gu, Lui Sha, Huajie Shao, Qixin Wang, and Tarek Ab- delzaher. 2025. Phy-Taylor: Partially Physics-Knowledge-Enhanced Deep Neural Networks via NN Editing.IEEE Transactions on Neural Networks and Learning Systems36, 1 (Jan. 2025), 447–461. doi:10.1109/TNNLS.2023.3325432

work page doi:10.1109/tnnls.2023.3325432 2025
[39]

Zhenjiang Mao, Siqi Dai, Yuang Geng, and Ivan Ruchkin. 2024. Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models.arXiv preprint arXiv:2404.00462(2024)

work page arXiv 2024
[40]

Vincent Micheli, Eloi Alonso, and François Fleuret. 2022. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588(2022)

work page arXiv 2022
[41]

Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. 2023. Uniworld: Au- tonomous driving pre-training via world models.arXiv preprint arXiv:2308.07234 (2023)

work page arXiv 2023
[42]

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. 2025. SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels. arXiv:2410.08822 [cs.LG] https://arxiv.org/abs/2410.08822

work page arXiv 2025
[43]

Kensuke Nakamura and Somil Bansal. 2023. Online update of safety assurances using confidence-based predictions. In2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 12765–12771

work page 2023
[44]

Andi Peng, Mycal Tucker, Eoin Kenny, Noga Zaslavsky, Pulkit Agrawal, and Julie A Shah. 2024. Human-guided complexity-controlled abstractions.Advances in Neural Information Processing Systems36 (2024)

work page 2024
[45]

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin

work page
[46]

Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025)

work page arXiv 2025
[47]

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems32 (2019)

work page 2019
[48]

Ivan Ruchkin, Matthew Cleaveland, Radoslav Ivanov, Pengyuan Lu, Taylor Car- penter, Oleg Sokolsky, and Insup Lee. 2022. Confidence composition for monitors of verification assumptions. In2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS). IEEE, 1–12

work page 2022
[49]

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. 2020. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 683–700

work page 2020
[50]

Dylan Sam, Rattana Pukdee, Daniel P Jeong, Yewon Byun, and J Zico Kolter

work page
[51]

Bayesian neural networks with domain knowledge priors.arXiv preprint arXiv:2402.13410(2024)

work page arXiv 2024
[52]

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. 2023. Masked world models for visual control. InConference on Robot Learning. PMLR, 1332–1344. Mao et al

work page 2023
[53]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in neural information processing systems28 (2015)

work page 2015
[54]

Kaustubh Sridhar, Souradeep Dutta, James Weimer, and Insup Lee. 2023. Guaran- teed conformance of neurosymbolic models to natural constraints. InLearning for Dynamics and Control Conference. PMLR, 76–89

work page 2023
[55]

Renukanandan Tumu, Lars Lindemann, Truong Nghiem, and Rahul Mangharam

work page
[56]

In 2023 IEEE Intelligent Vehicles Symposium (IV)

Physics constrained motion prediction with uncertainty quantification. In 2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1–8

work page 2023
[57]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.Advances in neural information processing systems30 (2017)

work page 2017
[58]

Ari Viitala, Rinu Boney, Yi Zhao, Alexander Ilin, and Juho Kannala. 2021. Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning. In 2021 20th International Conference on Advanced Robotics (ICAR). IEEE, 275–281

work page 2021
[59]

Masaki Waga, Ezequiel Castellano, Sasinee Pruekprasert, Stefan Klikovits, Toru Takisaka, and Ichiro Hasuo. 2022. Dynamic shielding for reinforcement learning in black-box environments. InInternational Symposium on Automated Technology for Verification and Analysis. Springer, 25–41

work page 2022
[60]

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. 2023. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Gold- berg. 2023. Daydreamer: World models for physical robot learning. InConference on robot learning. PMLR, 2226–2240

work page 2023
[62]

Yifan Xue, Michael Ding, and Xinghua Lu. 2019. Supervised vector quantized variational autoencoder for learning interpretable global representations.arXiv preprint arXiv:1909.11124(2019)

work page arXiv 2019
[63]

Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. 2024. Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356(2024)

work page arXiv 2024
[64]

Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang

work page
[65]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Causalvae: Disentangled representation learning via neural structural causal models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9593–9602

work page
[66]

Dingling Yao, Caroline Muller, and Francesco Locatello. 2024. Marrying causal representation learning with dynamical systems for science.Advances in Neural Information Processing Systems37 (2024), 71705–71736

work page 2024
[67]

Zhe Zhao, Mengshi Qi, and Huadong Ma. 2024. Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation. InEuropean Conference on Computer Vision. Springer, 447–463

work page 2024
[68]

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. 2024. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision. Springer, 55–72

work page 2024
[69]

Weiheng Zhong and Hadi Meidani. 2023. PI-VAE: Physics-Informed Variational Auto-Encoder for stochastic differential equations.Computer Methods in Applied Mechanics and Engineering403 (2023), 115664. doi:10.1016/j.cma.2022.115664

work page doi:10.1016/j.cma.2022.115664 2023
[70]

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. 2024. Gaussianworld: Gaussian world model for streaming 3d occupancy prediction. arXiv preprint arXiv:2412.10373(2024). Physically Interpretable World Models via Weakly Supervised Representation Learning Appendix Model Architecture and Training Details.This appendix pro- vides detailed imp...

work page arXiv 2024
[71]

The dynamics model learns between 4 and 10 environment- specific parameters using 50 weak supervision samples per time step

Intrinsic variants instead embed the physical portion directly into 𝑧∗ 𝑡 . The dynamics model learns between 4 and 10 environment- specific parameters using 50 weak supervision samples per time step. I. Training Hyperparameters in Text.All experiments use batch size 32, cosine learning-rate decay with 5 warmup epochs, a maxi- mum of 200 epochs with patien...

work page

[1] [1]

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. 2018. Safe reinforcement learning via shielding. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018

[2] [2]

Martin Asenov, Michael Burke, Daniel Angelov, Todor Davchev, Kartic Subr, and Subramanian Ramamoorthy. 2019. Vid2param: Modeling of dynamics parameters from video.IEEE Robotics and Automation Letters5, 2 (2019), 414–421

work page 2019

[3] [3]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Jonathan Balloch, Zhiyu Lin, Robert Wright, Xiangyu Peng, Mustafa Hussain, Aarun Srinivas, Julia Kim, and Mark O Riedl. 2023. Neuro-symbolic world models for adapting to open world novelty.arXiv preprint arXiv:2301.06294(2023)

work page arXiv 2023

[5] [5]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul- man, Jie Tang, and Wojciech Zaremba. 2016. Openai gym.arXiv preprint arXiv:1606.01540(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Brunton, Joshua L

Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. 2016. Sparse Identifi- cation of Nonlinear Dynamics with Control (SINDYc).IFAC-PapersOnLine49, 18 (2016), 710–715. doi:10.1016/j.ifacol.2016.10.249 10th IFAC Symposium on Nonlinear Control Systems NOLCOS 2016

work page doi:10.1016/j.ifacol.2016.10.249 2016

[7] [7]

Boyuan Chen, Kuang Huang, Sunand Raghupathi, Ishaan Chandratreya, Qiang Du, and Hod Lipson. 2022. Automated discovery of fundamental variables hidden in experimental data.Nature Computational Science2, 7 (2022), 433–442

work page 2022

[8] [8]

Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders.Advances in neural information processing systems31 (2018)

work page 2018

[9] [9]

Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, and Nemanja Djuric. 2020. Deep kinematic models for kinemati- cally feasible vehicle trajectory predictions. In2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10563–10569

work page 2020

[10] [10]

Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. 2018. End-to-end differentiable physics for learning and control. Advances in neural information processing systems31 (2018)

work page 2018

[11] [11]

Franck Djeumou, Cyrus Neary, and Ufuk Topcu. 2023. How to learn and gen- eralize from three minutes of data: Physics-constrained and uncertainty-aware neural stochastic differential equations.arXiv preprint arXiv:2306.06335(2023)

work page arXiv 2023

[12] [12]

Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. 2023. Fo- cus: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427(2023)

work page arXiv 2023

[13] [13]

David Fridovich-Keil, Andrea Bajcsy, Jaime F Fisac, Sylvia L Herbert, Steven Wang, Anca D Dragan, and Claire J Tomlin. 2020. Confidence-aware motion prediction for real-time collision avoidance1.The International Journal of Robotics Research39, 2-3 (2020), 250–265

work page 2020

[14] [14]

Xiang Fu, Ge Yang, Pulkit Agrawal, and Tommi Jaakkola. 2021. Learning task informed abstractions. InInternational Conference on Machine Learning. PMLR, 3480–3491

work page 2021

[15] [15]

David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.10122(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2020. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[17] [17]

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2023. Mas- tering diverse domains through world models.arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Osman Hasan and Sofiene Tahar. 2015. Formal verification methods. InEncyclo- pedia of Information Science and Technology, Third Edition. IGI global, 7162–7170

work page 2015

[19] [19]

Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework.ICLR (Poster)3 (2017)

work page 2017

[20] [20]

Kai-Chieh Hsu, Karen Leung, Yuxiao Chen, Jaime F Fisac, and Marco Pavone. 2023. Interpretable Trajectory Prediction for Autonomous Vehicles via Counterfactual Responsibility. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5918–5925

work page 2023

[21] [21]

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR)50, 2 (2017), 1–35

work page 2017

[22] [22]

Masha Itkina and Mykel Kochenderfer. 2023. Interpretable self-aware neural networks for robust trajectory prediction. InConference on Robot Learning. PMLR, 606–617

work page 2023

[23] [23]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Rein- forcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

work page 1996

[24] [24]

Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der Smagt

work page

[25] [25]

Deep variational bayes filters: Unsupervised learning of state space models from raw data.arXiv preprint arXiv:1605.06432(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Sydney M Katz, Anthony L Corso, Christopher A Strong, and Mykel J Kochender- fer. 2022. Verification of image-based neural network controllers using generative models.Journal of Aerospace Information Systems19, 9 (2022), 574–584

work page 2022

[27] [27]

Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. InInterna- tional conference on machine learning. PMLR, 2649–2658

work page 2018

[28] [28]

Diederik P Kingma. 2013. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[29] [29]

George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. 2018. From skills to symbols: Learning symbolic representations for abstract high-level planning.Journal of Artificial Intelligence Research61 (2018), 215–289

work page 2018

[30] [30]

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. 2025. Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels.arXiv preprint arXiv:2508.17437(2025)

work page arXiv 2025

[31] [31]

Anson Lei, Bernhard Schölkopf, and Ingmar Posner. 2024. SPARTAN: A sparse transformer learning local causation.arXiv preprint arXiv:2411.06890(2024)

work page arXiv 2024

[32] [32]

Anjian Li, Liting Sun, Wei Zhan, Masayoshi Tomizuka, and Mo Chen. 2021. Prediction-based reachability for collision avoidance in autonomous driving. In 2021 Intl. Conf. on Robotics and Automation (ICRA)

work page 2021

[33] [33]

Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, João F Henriques, and Kevin Ellis. 2024. Visualpredicator: Learning abstract world models with neuro-symbolic predicates for robot planning.arXiv preprint arXiv:2410.23156(2024)

work page arXiv 2024

[34] [34]

TP Lillicrap. 2015. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[35] [35]

Lars Lindemann, Xin Qin, Jyotirmoy V Deshmukh, and George J Pappas. 2023. Conformal prediction for stl runtime verification. InProceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023). 142–153

work page 2023

[36] [36]

Ori Linial, Neta Ravid, Danny Eytan, and Uri Shalit. 2021. Generative ODE modeling with known unknowns. InProceedings of the Conference on Health, Inference, and Learning (CHIL ’21). Association for Computing Machinery, New York, NY, USA, 79–94. doi:10.1145/3450439.3451866

work page doi:10.1145/3450439.3451866 2021

[37] [37]

Juanwu Lu, Wei Zhan, Masayoshi Tomizuka, and Yeping Hu. 2024. Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 4717–4725

work page 2024

[38] [38]

Yanbing Mao, Yuliang Gu, Lui Sha, Huajie Shao, Qixin Wang, and Tarek Ab- delzaher. 2025. Phy-Taylor: Partially Physics-Knowledge-Enhanced Deep Neural Networks via NN Editing.IEEE Transactions on Neural Networks and Learning Systems36, 1 (Jan. 2025), 447–461. doi:10.1109/TNNLS.2023.3325432

work page doi:10.1109/tnnls.2023.3325432 2025

[39] [39]

Zhenjiang Mao, Siqi Dai, Yuang Geng, and Ivan Ruchkin. 2024. Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models.arXiv preprint arXiv:2404.00462(2024)

work page arXiv 2024

[40] [40]

Vincent Micheli, Eloi Alonso, and François Fleuret. 2022. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588(2022)

work page arXiv 2022

[41] [41]

Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. 2023. Uniworld: Au- tonomous driving pre-training via world models.arXiv preprint arXiv:2308.07234 (2023)

work page arXiv 2023

[42] [42]

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. 2025. SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels. arXiv:2410.08822 [cs.LG] https://arxiv.org/abs/2410.08822

work page arXiv 2025

[43] [43]

Kensuke Nakamura and Somil Bansal. 2023. Online update of safety assurances using confidence-based predictions. In2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 12765–12771

work page 2023

[44] [44]

Andi Peng, Mycal Tucker, Eoin Kenny, Noga Zaslavsky, Pulkit Agrawal, and Julie A Shah. 2024. Human-guided complexity-controlled abstractions.Advances in Neural Information Processing Systems36 (2024)

work page 2024

[45] [45]

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin

work page

[46] [46]

Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025)

work page arXiv 2025

[47] [47]

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems32 (2019)

work page 2019

[48] [48]

Ivan Ruchkin, Matthew Cleaveland, Radoslav Ivanov, Pengyuan Lu, Taylor Car- penter, Oleg Sokolsky, and Insup Lee. 2022. Confidence composition for monitors of verification assumptions. In2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS). IEEE, 1–12

work page 2022

[49] [49]

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. 2020. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 683–700

work page 2020

[50] [50]

Dylan Sam, Rattana Pukdee, Daniel P Jeong, Yewon Byun, and J Zico Kolter

work page

[51] [51]

Bayesian neural networks with domain knowledge priors.arXiv preprint arXiv:2402.13410(2024)

work page arXiv 2024

[52] [52]

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. 2023. Masked world models for visual control. InConference on Robot Learning. PMLR, 1332–1344. Mao et al

work page 2023

[53] [53]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting.Advances in neural information processing systems28 (2015)

work page 2015

[54] [54]

Kaustubh Sridhar, Souradeep Dutta, James Weimer, and Insup Lee. 2023. Guaran- teed conformance of neurosymbolic models to natural constraints. InLearning for Dynamics and Control Conference. PMLR, 76–89

work page 2023

[55] [55]

Renukanandan Tumu, Lars Lindemann, Truong Nghiem, and Rahul Mangharam

work page

[56] [56]

In 2023 IEEE Intelligent Vehicles Symposium (IV)

Physics constrained motion prediction with uncertainty quantification. In 2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1–8

work page 2023

[57] [57]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.Advances in neural information processing systems30 (2017)

work page 2017

[58] [58]

Ari Viitala, Rinu Boney, Yi Zhao, Alexander Ilin, and Juho Kannala. 2021. Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning. In 2021 20th International Conference on Advanced Robotics (ICAR). IEEE, 275–281

work page 2021

[59] [59]

Masaki Waga, Ezequiel Castellano, Sasinee Pruekprasert, Stefan Klikovits, Toru Takisaka, and Ichiro Hasuo. 2022. Dynamic shielding for reinforcement learning in black-box environments. InInternational Symposium on Automated Technology for Verification and Analysis. Springer, 25–41

work page 2022

[60] [60]

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. 2023. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Gold- berg. 2023. Daydreamer: World models for physical robot learning. InConference on robot learning. PMLR, 2226–2240

work page 2023

[62] [62]

Yifan Xue, Michael Ding, and Xinghua Lu. 2019. Supervised vector quantized variational autoencoder for learning interpretable global representations.arXiv preprint arXiv:1909.11124(2019)

work page arXiv 2019

[63] [63]

Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. 2024. Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356(2024)

work page arXiv 2024

[64] [64]

Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang

work page

[65] [65]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Causalvae: Disentangled representation learning via neural structural causal models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9593–9602

work page

[66] [66]

Dingling Yao, Caroline Muller, and Francesco Locatello. 2024. Marrying causal representation learning with dynamical systems for science.Advances in Neural Information Processing Systems37 (2024), 71705–71736

work page 2024

[67] [67]

Zhe Zhao, Mengshi Qi, and Huadong Ma. 2024. Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation. InEuropean Conference on Computer Vision. Springer, 447–463

work page 2024

[68] [68]

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. 2024. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision. Springer, 55–72

work page 2024

[69] [69]

Weiheng Zhong and Hadi Meidani. 2023. PI-VAE: Physics-Informed Variational Auto-Encoder for stochastic differential equations.Computer Methods in Applied Mechanics and Engineering403 (2023), 115664. doi:10.1016/j.cma.2022.115664

work page doi:10.1016/j.cma.2022.115664 2023

[70] [70]

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. 2024. Gaussianworld: Gaussian world model for streaming 3d occupancy prediction. arXiv preprint arXiv:2412.10373(2024). Physically Interpretable World Models via Weakly Supervised Representation Learning Appendix Model Architecture and Training Details.This appendix pro- vides detailed imp...

work page arXiv 2024

[71] [71]

The dynamics model learns between 4 and 10 environment- specific parameters using 50 weak supervision samples per time step

Intrinsic variants instead embed the physical portion directly into 𝑧∗ 𝑡 . The dynamics model learns between 4 and 10 environment- specific parameters using 50 weak supervision samples per time step. I. Training Hyperparameters in Text.All experiments use batch size 32, cosine learning-rate decay with 5 warmup epochs, a maxi- mum of 200 epochs with patien...

work page