WorldString: Actionable World Representation

Isabella Liu; Jianglong Ye; Jitao Li; Kunqi Xu; Sifei Liu; Tianshu Tang; Xueyan Zou

arxiv: 2605.18743 · v2 · pith:H4PHB7T6new · submitted 2026-05-18 · 💻 cs.AI

WorldString: Actionable World Representation

Kunqi Xu , Jitao Li , Jianglong Ye , Tianshu Tang , Isabella Liu , Sifei Liu , Xueyan Zou This is my paper

Pith reviewed 2026-05-21 07:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelsobject state manifolddigital twinpoint cloudRGB-D videoactionable representationneural dynamicsphysical simulation

0 comments

The pith

WorldString is a neural architecture that learns the full state manifold of real objects directly from point clouds or RGB-D video to serve as an actionable digital twin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a single neural model that captures how everyday objects change their internal states when acted upon, trained straight from raw 3D sensor data. Current approaches handle object dynamics either through video prediction or scene reconstruction but do not treat the underlying action state as a unified, learnable manifold. If the architecture succeeds, it would supply a differentiable building block that world-model systems could plug into policy learning or dynamics simulation without extra engineering. The authors position the method as a foundational primitive for physical intelligence, analogous to how language models abstract human knowledge. They emphasize that the structure remains fully differentiable so it can later combine with reinforcement learning or neural physics engines.

Core claim

We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models. Its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

What carries the argument

WorldString, the fully differentiable neural network that directly regresses an object's state manifold from raw point-cloud or RGB-D input streams.

If this is right

Objects become first-class primitives inside world models instead of being handled through indirect video generation or separate reconstruction pipelines.
The representation can be trained on either static point clouds or dynamic RGB-D sequences, giving flexibility across sensor types.
Full differentiability lets the model be inserted into larger systems that optimize policies or simulate long-horizon physical interactions.
A shared manifold for object states could reduce the need for task-specific engineering when moving from perception to control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic planners could query the learned manifold to test hypothetical actions before execution, potentially lowering sample complexity in real-world reinforcement learning.
The same architecture might support few-shot adaptation to new object categories by treating unseen instances as points on an already-learned manifold.
Integration with existing physics engines could be tested by replacing rigid-body parameters with the neural state predictions and measuring simulation fidelity on manipulation tasks.

Load-bearing premise

That a single neural network can extract a complete, actionable state manifold for arbitrary objects from raw sensor streams without hand-crafted structure, extra supervision, or separate modules.

What would settle it

A controlled benchmark in which WorldString is trained on object interaction videos and then asked to predict the next state after a novel action; failure would be shown if its predictions are no more accurate than a strong dynamic reconstruction baseline on held-out physical sequences.

read the original abstract

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldString names a neural architecture for learning object state manifolds from raw point clouds or RGB-D but supplies no architecture, losses, or results to show it works.

read the letter

The main thing here is that the paper proposes WorldString as a neural architecture that models the state manifold of real-world objects directly from point clouds or RGB-D streams and serves as a differentiable building block for physical world models. The abstract frames this as filling a gap between video generation and dynamic scene reconstruction, but the text stays at the level of a high-level proposal without any concrete implementation or evaluation.

Referee Report

2 major / 1 minor

Summary. The paper proposes WorldString, a neural architecture for modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. It is presented as a versatile digital twin and foundational building block for physical world models, with a fully differentiable structure to enable integration with policy learning and neural dynamics. The work contrasts this with existing video generation and dynamic scene reconstruction methods, claiming a unified and principled approach to actionable object representations.

Significance. If the central claims hold and the architecture is shown to extract intrinsic action states (e.g., articulation parameters and affordances) without explicit priors or supervision, the result would be significant for physical world modeling. It could provide a reusable, differentiable primitive that bridges perception and control, potentially improving upon methods that rely on canonical frames or action labels. The emphasis on direct learning from raw sensor streams aligns with goals in robotics and simulation, but the absence of any implementation or validation details leaves the practical impact speculative.

major comments (2)

[Abstract] Abstract: The core claim that WorldString 'models the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams' and does so 'in a unified, principled way' without 'additional structure or supervision' is unsupported. No architecture, encoder/decoder structure, loss function, or representation (e.g., how action states are parameterized) is described, so the assertion that raw streams alone suffice cannot be evaluated.
[Abstract] Abstract: The statement that 'none explicitly model this basic element in a unified, principled way' is not accompanied by any comparison to prior dynamic reconstruction techniques (such as 4D NeRF variants or object-centric dynamics models). These methods typically introduce explicit structure precisely because raw geometry or appearance sequences underdetermine the manifold; without addressing this, the novelty and necessity of WorldString remain ungrounded.

minor comments (1)

The manuscript would benefit from a dedicated section outlining the network architecture, input/output formats, and training objective to make the proposal concrete and reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The core claim that WorldString 'models the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams' and does so 'in a unified, principled way' without 'additional structure or supervision' is unsupported. No architecture, encoder/decoder structure, loss function, or representation (e.g., how action states are parameterized) is described, so the assertion that raw streams alone suffice cannot be evaluated.

Authors: We agree that the abstract, as a concise summary, does not detail the architecture. The full manuscript describes the neural architecture for processing point clouds and RGB-D streams, the differentiable components, loss functions for manifold learning, and parameterization of action states. We will revise the abstract to include a brief reference to these elements and ensure the main text makes the technical approach explicit. revision: yes
Referee: [Abstract] Abstract: The statement that 'none explicitly model this basic element in a unified, principled way' is not accompanied by any comparison to prior dynamic reconstruction techniques (such as 4D NeRF variants or object-centric dynamics models). These methods typically introduce explicit structure precisely because raw geometry or appearance sequences underdetermine the manifold; without addressing this, the novelty and necessity of WorldString remain ungrounded.

Authors: The manuscript contrasts WorldString with video generation and dynamic scene reconstruction approaches in the introduction. We acknowledge that an explicit comparison to 4D NeRF variants and object-centric models would better ground the novelty claim. We will revise the abstract to include a short sentence highlighting how prior methods rely on explicit structures while WorldString learns the manifold directly from raw data without such priors. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; proposal is purely conceptual

full rationale

The paper offers a high-level proposal for WorldString as a neural architecture that models object state manifolds directly from point clouds or RGB-D streams, but contains no equations, loss functions, architectural specifications, or explicit derivation steps. Absent any mathematical chain or fitted parameters, no reductions to inputs by construction, self-definitional loops, or load-bearing self-citations can be identified. The claims function as design assertions rather than derived predictions, rendering the work self-contained at the conceptual level with no circularity to flag.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects the high-level proposal rather than verified technical content. The central claim rests on the assumption that a unified neural representation of object states is both learnable and useful.

axioms (1)

domain assumption Objects are actionable entities whose states are determined by intrinsic properties.
Stated directly in the abstract as the motivation for modeling state manifolds.

invented entities (1)

WorldString no independent evidence
purpose: Neural architecture for modeling object state manifolds from point clouds or RGB-D video.
Newly introduced name and concept presented as the core contribution.

pith-pipeline@v0.9.0 · 5711 in / 1306 out tokens · 28658 ms · 2026-05-21T07:43:50.766486+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams... unified residual attention mechanism
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cross-attention is a relaxation of (3.3): it keeps convex mixing but replaces analytic (α_i, v_i) by learned, state-dependent ones

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Cosmos world foundation model platform for physical ai. Technical report, NVIDIA, 2025. Technical report; available as arXiv:2501.03575

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

ROS 2 Documentation and ROS Wiki, 2026

Urdf (unified robot description format). ROS 2 Documentation and ROS Wiki, 2026. https://docs.ros.org/en/humble/Tutorials/Intermediate/URDF/URDF-Main.html andhttps://wiki.ros.org/urdf/XML/model(accessed: 2026-03-04)

work page 2026
[3]

Akenine-Möller, E

T. Akenine-Möller, E. Haines, N. Hoffman, A. Pesce, M. Iwanicki, and S. Hillaire.Real-Time Rendering. Taylor & Francis, 4th edition, 2018. ISBN 978-1-138-62700-0

work page 2018
[4]

E.Aljalbout, J.Xing, A.Romero, I.Akinola, C.R.Garrett, E.Heiden, A.Gupta, T.Hermans, Y.Narang, D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

work page arXiv 2025
[5]

VideoPhy: Evaluating Physical Commonsense for Video Generation

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation, 2024. URLhttps://arxiv. org/abs/2406.03520

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Bathe.Finite Element Procedures

K.-J. Bathe.Finite Element Procedures. K. J. Bathe, Watertown, MA, second edition edition, 2014

work page 2014
[7]

J. F. Blinn and M. E. Newell. Texture and reflection in computer generated images.Commun. ACM, 19(10):542–547, Oct. 1976. ISSN 0001-0782. doi: 10.1145/360349.360353. URLhttps: //doi.org/10.1145/360349.360353

work page doi:10.1145/360349.360353 1976
[8]

Bloomenthal and C

J. Bloomenthal and C. Bajaj, editors.Introduction to Implicit Surfaces. Morgan Kaufmann, 1997. ISBN 1-55860-233-X

work page 1997
[9]

Bonet and R

J. Bonet and R. D. Wood.Nonlinear Continuum Mechanics for Finite Element Analysis. Cambridge University Press, 2nd edition, 2008

work page 2008
[10]

Botsch, L

M. Botsch, L. Kobbelt, M. Pauly, P. Alliez, and B. Lévy.Polygon Mesh Processing. A K Peters, Natick, 2010

work page 2010
[11]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. D. Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. In R. Sa...

work page 2024
[12]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Y. Chan, N. Heess, L. Gonzalez, S.Osindero, S.Ozair, S.Reed, J.Zhang, K.Zolna, J.Clune, N.DeFreitas, S.Singh, andT.Rocktäschel. Genie: Generative interactive environments. InProceedings of...

work page 2024
[13]

X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[14]

Y. Chen, Z. Chen, C. Zhang, F. Wang, X. Yang, Y. Wang, Z. Cai, L. Yang, H. Liu, and G. Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[15]

Edstedt, Q

J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, and M. Felsberg. RoMa: Robust Dense Feature Matching. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[16]

Erleben, J

K. Erleben, J. Sporring, K. Henriksen, and H. Dohlmann.Physics-based Animation. Charles River Media, Hingham, Mass., 2005. ISBN 1-58450-380-7

work page 2005
[17]

Gross and H

M. Gross and H. Pfister, editors.Point-Based Graphics. Morgan Kaufmann, 2007. ISBN 978-0-12- 370604-1

work page 2007
[18]

Ha and J

D. Ha and J. Schmidhuber. World models, 2018

work page 2018
[19]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. URLhttps://arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

work page 2025
[21]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. URLhttps://arxiv.org/abs/2505.14357

work page arXiv 2025
[22]

Huang, Y.-T

Y.-H. Huang, Y.-T. Sun, Z. Yang, X. Lyu, Y.-P. Cao, and X. Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[23]

J. F. Hughes, A. van Dam, M. McGuire, D. F. Sklar, J. D. Foley, S. K. Feiner, and K. Akeley.Computer Graphics: Principles and Practice. Addison-Wesley, 3rd edition, 2014. ISBN 978-0-321-39952-6

work page 2014
[24]

Jiang, H.-Y

H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[25]

Jiang, H.-Y

H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos, 2025. URLhttps://arxiv.org/abs/ 2503.17973

work page arXiv 2025
[26]

Karaev, I

N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InProc. arXiv:2410.11831, 2024

work page arXiv 2024
[27]

Karunratanakul, A

K. Karunratanakul, A. Spurr, Z. Fan, O. Hilliges, and S. Tang. A skeleton-driven neural occupancy representation for articulated hands. InInternational Conference on 3D Vision (3DV), 2021

work page 2021
[28]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 2023. 16

work page 2023
[29]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam. inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[30]

Krishnamurthy and M

V. Krishnamurthy and M. Levoy. Fitting smooth surfaces to dense polygon meshes. InProceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, page 313–324, New York, NY, USA, 1996. Association for Computing Machinery. ISBN 0897917464. doi: 10.1145/237170.237270. URLhttps://doi.org/10.1145/237170.237270

work page doi:10.1145/237170.237270 1996
[31]

Levoy, K

M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital michelangelo project: 3d scanning of large statues. InProceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, page 131–144, USA, 2000. ACM Press/Addison-...

work page doi:10.1145/344779.344849 2000
[32]

R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering, 2024. URLhttps: //arxiv.org/abs/2410.13851

work page arXiv 2024
[33]

R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 117–129. PMLR, 06–09 Nov 2025. URL https://proceedings.mlr.press/v270/liu25a.html

work page 2025
[34]

Y.-L. Liu, C. Gao, A. Meuleman, H.-Y. Tseng, A. Saraf, C. Kim, Y.-Y. Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[35]

X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y. Liu, Z. Shu, Y. Lu, S. Wang, X. Wei, W. Li, W. Yin, Y. Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai. A survey: Learning embodied intelligence from physical simulators and world models, 2025. URLhttps://arxiv.org/abs/2507.00917

work page arXiv 2025
[36]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015

work page 2015
[37]

G. Lu, B. Jia, P. Li, Y. Chen, Z. Wang, Y. Tang, and S. Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9263–9274, October 2025

work page 2025
[38]

Luiten, G

J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision (3DV), 2024

work page 2024
[39]

K. M. Lynch and F. C. Park.Modern Robotics: Mechanics, Planning, and Control. Cambridge University Press, 2017. ISBN 978-1-108-50969-5

work page 2017
[40]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Rep- resenting scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1): 99–106, 2021

work page 2021
[41]

Parent.Computer Animation: Algorithms and Techniques

R. Parent.Computer Animation: Algorithms and Techniques. Morgan Kaufmann, 3rd edition, 2012. 17

work page 2012
[42]

S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[43]

Pumarola, E

A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[44]

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

work page 2024
[45]

Sakagami, F

R. Sakagami, F. S. Lay, A. Dömel, M. J. Schuster, A. Albu-Schäffer, and F. Stulp. Robotic world models—conceptualization, review, and engineering best practices.Frontiers in Robotics and AI, 10,

work page
[46]

URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

doi: 10.3389/frobt.2023.1253049. URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

work page doi:10.3389/frobt.2023.1253049 2023
[47]

M. R. Samsami, A. Zholus, J. Rajendran, and S. Chandar. Mastering memory tasks with world models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=1vDArHJ68h

work page 2024
[48]

M. W. Spong, S. Hutchinson, and M. Vidyasagar.Robot Modeling and Control. John Wiley & Sons, 2006

work page 2006
[49]

J. Tang, M. Lev, W. Bi, T. Justus, and M. Nießner. Neural shape deformation priors. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[50]

Turk and M

G. Turk and M. Levoy. Zippered polygon meshes from range images. InProceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, page 311–318, New York, NY, USA, 1994. Association for Computing Machinery. ISBN 0897916670. doi: 10.1145/ 192161.192241. URLhttps://doi.org/10.1145/192161.192241

work page doi:10.1145/192161.192241 1994
[51]

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[52]

Structured 3D Latents for Scalable and Versatile 3D Generation

J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Q. Xu, J. Liu, S. Yu, Y. Wang, Y. Zhou, J. Zhou, J. Cui, Y.-S. Ong, and H. Zhang. Neuspring: Neural spring fields for reconstruction and simulation of deformable objects from videos. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2511.08310

work page arXiv 2026
[54]

W. Xu, H. Fu, H. Dong, Z. Zhou, and C. Chen. Deal: Diffusion evolution adversarial learning for sim-to-real transfer. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=284GWLFtjU. Poster

work page 2025
[55]

X. Yang, Z. Ji, and Y.-K. Lai. Differentiable physics-based system identification for robotic manipula- tion of elastoplastic materials, 2024. URLhttps://arxiv.org/abs/2411.00554. 18

work page arXiv 2024
[56]

H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5916–5926, June 2025

work page 2025
[57]

Zhang, D

C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. E. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments, 2025. URLhttps://arxiv.org/abs/2504. 02918

work page 2025
[58]

Zheng, Z

J. Zheng, Z. Zhu, V. Bieri, M. Pollefeys, S. Peng, and I. Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, June 2025

work page 2025
[59]

O. C. Zienkiewicz, R. L. Taylor, and D. D. Fox.The Finite Element Method for Solid and Structural Mechanics. Elsevier/Butterworth-Heinemann, Amsterdam, 7th edition, 2014

work page 2014
[60]

Zuffi, A

S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017. 19

work page 2017

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Cosmos world foundation model platform for physical ai. Technical report, NVIDIA, 2025. Technical report; available as arXiv:2501.03575

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

ROS 2 Documentation and ROS Wiki, 2026

Urdf (unified robot description format). ROS 2 Documentation and ROS Wiki, 2026. https://docs.ros.org/en/humble/Tutorials/Intermediate/URDF/URDF-Main.html andhttps://wiki.ros.org/urdf/XML/model(accessed: 2026-03-04)

work page 2026

[3] [3]

Akenine-Möller, E

T. Akenine-Möller, E. Haines, N. Hoffman, A. Pesce, M. Iwanicki, and S. Hillaire.Real-Time Rendering. Taylor & Francis, 4th edition, 2018. ISBN 978-1-138-62700-0

work page 2018

[4] [4]

E.Aljalbout, J.Xing, A.Romero, I.Akinola, C.R.Garrett, E.Heiden, A.Gupta, T.Hermans, Y.Narang, D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

work page arXiv 2025

[5] [5]

VideoPhy: Evaluating Physical Commonsense for Video Generation

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation, 2024. URLhttps://arxiv. org/abs/2406.03520

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Bathe.Finite Element Procedures

K.-J. Bathe.Finite Element Procedures. K. J. Bathe, Watertown, MA, second edition edition, 2014

work page 2014

[7] [7]

J. F. Blinn and M. E. Newell. Texture and reflection in computer generated images.Commun. ACM, 19(10):542–547, Oct. 1976. ISSN 0001-0782. doi: 10.1145/360349.360353. URLhttps: //doi.org/10.1145/360349.360353

work page doi:10.1145/360349.360353 1976

[8] [8]

Bloomenthal and C

J. Bloomenthal and C. Bajaj, editors.Introduction to Implicit Surfaces. Morgan Kaufmann, 1997. ISBN 1-55860-233-X

work page 1997

[9] [9]

Bonet and R

J. Bonet and R. D. Wood.Nonlinear Continuum Mechanics for Finite Element Analysis. Cambridge University Press, 2nd edition, 2008

work page 2008

[10] [10]

Botsch, L

M. Botsch, L. Kobbelt, M. Pauly, P. Alliez, and B. Lévy.Polygon Mesh Processing. A K Peters, Natick, 2010

work page 2010

[11] [11]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. D. Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. In R. Sa...

work page 2024

[12] [12]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Y. Chan, N. Heess, L. Gonzalez, S.Osindero, S.Ozair, S.Reed, J.Zhang, K.Zolna, J.Clune, N.DeFreitas, S.Singh, andT.Rocktäschel. Genie: Generative interactive environments. InProceedings of...

work page 2024

[13] [13]

X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[14] [14]

Y. Chen, Z. Chen, C. Zhang, F. Wang, X. Yang, Y. Wang, Z. Cai, L. Yang, H. Liu, and G. Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[15] [15]

Edstedt, Q

J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, and M. Felsberg. RoMa: Robust Dense Feature Matching. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[16] [16]

Erleben, J

K. Erleben, J. Sporring, K. Henriksen, and H. Dohlmann.Physics-based Animation. Charles River Media, Hingham, Mass., 2005. ISBN 1-58450-380-7

work page 2005

[17] [17]

Gross and H

M. Gross and H. Pfister, editors.Point-Based Graphics. Morgan Kaufmann, 2007. ISBN 978-0-12- 370604-1

work page 2007

[18] [18]

Ha and J

D. Ha and J. Schmidhuber. World models, 2018

work page 2018

[19] [19]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. URLhttps://arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

work page 2025

[21] [21]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. URLhttps://arxiv.org/abs/2505.14357

work page arXiv 2025

[22] [22]

Huang, Y.-T

Y.-H. Huang, Y.-T. Sun, Z. Yang, X. Lyu, Y.-P. Cao, and X. Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[23] [23]

J. F. Hughes, A. van Dam, M. McGuire, D. F. Sklar, J. D. Foley, S. K. Feiner, and K. Akeley.Computer Graphics: Principles and Practice. Addison-Wesley, 3rd edition, 2014. ISBN 978-0-321-39952-6

work page 2014

[24] [24]

Jiang, H.-Y

H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[25] [25]

Jiang, H.-Y

H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos, 2025. URLhttps://arxiv.org/abs/ 2503.17973

work page arXiv 2025

[26] [26]

Karaev, I

N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InProc. arXiv:2410.11831, 2024

work page arXiv 2024

[27] [27]

Karunratanakul, A

K. Karunratanakul, A. Spurr, Z. Fan, O. Hilliges, and S. Tang. A skeleton-driven neural occupancy representation for articulated hands. InInternational Conference on 3D Vision (3DV), 2021

work page 2021

[28] [28]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 2023. 16

work page 2023

[29] [29]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam. inria.fr/fungraph/3d-gaussian-splatting/

work page 2023

[30] [30]

Krishnamurthy and M

V. Krishnamurthy and M. Levoy. Fitting smooth surfaces to dense polygon meshes. InProceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, page 313–324, New York, NY, USA, 1996. Association for Computing Machinery. ISBN 0897917464. doi: 10.1145/237170.237270. URLhttps://doi.org/10.1145/237170.237270

work page doi:10.1145/237170.237270 1996

[31] [31]

Levoy, K

M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital michelangelo project: 3d scanning of large statues. InProceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, page 131–144, USA, 2000. ACM Press/Addison-...

work page doi:10.1145/344779.344849 2000

[32] [32]

R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering, 2024. URLhttps: //arxiv.org/abs/2410.13851

work page arXiv 2024

[33] [33]

R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 117–129. PMLR, 06–09 Nov 2025. URL https://proceedings.mlr.press/v270/liu25a.html

work page 2025

[34] [34]

Y.-L. Liu, C. Gao, A. Meuleman, H.-Y. Tseng, A. Saraf, C. Kim, Y.-Y. Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[35] [35]

X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y. Liu, Z. Shu, Y. Lu, S. Wang, X. Wei, W. Li, W. Yin, Y. Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai. A survey: Learning embodied intelligence from physical simulators and world models, 2025. URLhttps://arxiv.org/abs/2507.00917

work page arXiv 2025

[36] [36]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015

work page 2015

[37] [37]

G. Lu, B. Jia, P. Li, Y. Chen, Z. Wang, Y. Tang, and S. Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9263–9274, October 2025

work page 2025

[38] [38]

Luiten, G

J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision (3DV), 2024

work page 2024

[39] [39]

K. M. Lynch and F. C. Park.Modern Robotics: Mechanics, Planning, and Control. Cambridge University Press, 2017. ISBN 978-1-108-50969-5

work page 2017

[40] [40]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Rep- resenting scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1): 99–106, 2021

work page 2021

[41] [41]

Parent.Computer Animation: Algorithms and Techniques

R. Parent.Computer Animation: Algorithms and Techniques. Morgan Kaufmann, 3rd edition, 2012. 17

work page 2012

[42] [42]

S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[43] [43]

Pumarola, E

A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[44] [44]

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

work page 2024

[45] [45]

Sakagami, F

R. Sakagami, F. S. Lay, A. Dömel, M. J. Schuster, A. Albu-Schäffer, and F. Stulp. Robotic world models—conceptualization, review, and engineering best practices.Frontiers in Robotics and AI, 10,

work page

[46] [46]

URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

doi: 10.3389/frobt.2023.1253049. URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

work page doi:10.3389/frobt.2023.1253049 2023

[47] [47]

M. R. Samsami, A. Zholus, J. Rajendran, and S. Chandar. Mastering memory tasks with world models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=1vDArHJ68h

work page 2024

[48] [48]

M. W. Spong, S. Hutchinson, and M. Vidyasagar.Robot Modeling and Control. John Wiley & Sons, 2006

work page 2006

[49] [49]

J. Tang, M. Lev, W. Bi, T. Justus, and M. Nießner. Neural shape deformation priors. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[50] [50]

Turk and M

G. Turk and M. Levoy. Zippered polygon meshes from range images. InProceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, page 311–318, New York, NY, USA, 1994. Association for Computing Machinery. ISBN 0897916670. doi: 10.1145/ 192161.192241. URLhttps://doi.org/10.1145/192161.192241

work page doi:10.1145/192161.192241 1994

[51] [51]

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[52] [52]

Structured 3D Latents for Scalable and Versatile 3D Generation

J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Q. Xu, J. Liu, S. Yu, Y. Wang, Y. Zhou, J. Zhou, J. Cui, Y.-S. Ong, and H. Zhang. Neuspring: Neural spring fields for reconstruction and simulation of deformable objects from videos. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2511.08310

work page arXiv 2026

[54] [54]

W. Xu, H. Fu, H. Dong, Z. Zhou, and C. Chen. Deal: Diffusion evolution adversarial learning for sim-to-real transfer. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=284GWLFtjU. Poster

work page 2025

[55] [55]

X. Yang, Z. Ji, and Y.-K. Lai. Differentiable physics-based system identification for robotic manipula- tion of elastoplastic materials, 2024. URLhttps://arxiv.org/abs/2411.00554. 18

work page arXiv 2024

[56] [56]

H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5916–5926, June 2025

work page 2025

[57] [57]

Zhang, D

C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. E. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments, 2025. URLhttps://arxiv.org/abs/2504. 02918

work page 2025

[58] [58]

Zheng, Z

J. Zheng, Z. Zhu, V. Bieri, M. Pollefeys, S. Peng, and I. Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, June 2025

work page 2025

[59] [59]

O. C. Zienkiewicz, R. L. Taylor, and D. D. Fox.The Finite Element Method for Solid and Structural Mechanics. Elsevier/Butterworth-Heinemann, Amsterdam, 7th edition, 2014

work page 2014

[60] [60]

Zuffi, A

S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017. 19

work page 2017