arxiv: 2510.26782 · v3 · submitted 2025-10-30 · 💻 cs.LG · cs.AI· cs.CV

Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

Zaishuo Xia , Yukuan Lu , Xinyi Li , Yifan Xu , Yubei Chen This is my paper

Pith reviewed 2026-05-18 02:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords world modelslatent geometrytemporal contrastive learningdeterministic environmentslong-horizon predictiongeometric regularizationautoencoders

0 comments

The pith

High-fidelity cloning of deterministic 3D worlds is feasible once latent representations capture the geometry of the physical state manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to build world models that accurately simulate deterministic settings such as fixed mazes and static robot navigation over long time horizons. Diagnostic experiments show that cloning itself is achievable and that accumulated errors in long predictions stem mainly from poor geometric structure in the latent space rather than weaknesses in the dynamics predictor. The authors apply temporal contrastive learning to regularize the latent space inside standard autoencoders, producing representations that better match the underlying physical manifold. This regularization supplies a stable foundation for the dynamics model without requiring changes to its architecture. The result is a simple pipeline called Geometrically-Regularized World Models that improves fidelity by focusing on representation quality.

Core claim

Through diagnostic experiments we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. Building on this insight we show that applying temporal contrastive learning as geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold; we call this approach Geometrically-Regularized World Models. At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders.

What carries the argument

Geometrically-Regularized World Models (GRWM), a lightweight geometric regularization module based on temporal contrastive learning that is added to standard autoencoders to align their latent space with the physical state manifold.

If this is right

High-fidelity cloning becomes possible for deterministic 3D worlds such as fixed-map mazes and static robot navigation.
The dynamics model itself is not the dominant limit on long-horizon performance once latent geometry is addressed.
Temporal contrastive learning supplies an effective inductive bias that stabilizes world-model predictions.
A lightweight regularization module can be added to existing autoencoder-based world models without redesigning the predictor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive regularization might reduce error accumulation in stochastic or partially observed environments if the contrastive pairs are chosen to reflect observable state transitions.
Improved latent manifolds could allow planning algorithms to generate longer reliable trajectories without frequent replanning.
Applying the method to real-robot sensor data would test whether the learned manifold remains consistent when observation noise and minor stochasticity are present.

Load-bearing premise

The diagnostic experiments correctly identify latent geometry rather than model capacity or training data as the main source of long-horizon error.

What would settle it

A controlled test in which latent geometry is forced to match the physical manifold yet long-horizon prediction error remains high would falsify the claim that geometry is the primary bottleneck.

Figures

Figures reproduced from arXiv: 2510.26782 by Xinyi Li, Yifan Xu, Yubei Chen, Yukuan Lu, Zaishuo Xia.

**Figure 1.** Figure 1: Representation quality is the primary bottleneck for world model fidelity. Frame-wise MSE on the Maze 3x3 dataset. (Left) An oracle model using ground-truth states (black dotted) achieves nearzero error, establishing a performance upper bound. In contrast, a standard VAE-based world model (blue dashed) accumulates error rapidly. Our GRWM (green solid) significantly closes this gap by learning a more struc… view at source ↗

**Figure 2.** Figure 2: Top-down visualizations of our three closed environments: M3×3-DET, M9×9-DET, and MC-DET. These maps illustrate the overall layout and are for visualization purposes only; they are not provided as input to the agent. The agent’s input is restricted to first-person observations. For a more representative depiction of the agent’s surroundings, high-angle perspective views are also included in Appendix, offer… view at source ↗

**Figure 3.** Figure 3: Rollout Performance. Frame-wise MSE between predicted and ground-truth trajectories on (a) M3x3-DET, (b) M9x9-DET, and (c) MC-DET datasets. The oracle model (black dotted line), which operates on the true underlying states, establishes a lower bound on error. For all three dynamics models—Diffusion Forcing (DF), Video Diffusion (VD), and Standard Diffusion (SD)—our GRWM (solid lines) consistently outperfor… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of medium-horizon rollouts in M9x9-DET. We visualize consecutive frames around frame 100 and frame 400. Our method (GRWM) maintains high similarity to the ground truth throughout, while the baseline VAE-WM gets trapped near the pink wall, indicating that VAE-WM tends to “teleport” between visually similar but distinct locations. The results reveal a critical failure mode in the basel… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of medium-horizon rollouts in MC-DET. We visualize rollouts from a baseline VAE-based world model (VAE-WM, middle) and our method (GRWM, bottom) against the ground truth (top). The baseline VAE-WM fails to model the complex camera trajectory, diverging significantly and rendering incorrect objects (e.g., trees instead of the stone wall at frame 60). Our method (GRWM) successfully tra… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of ultra long-horizon rollouts on the Maze 9x9-CE dataset. Frames are sampled every 1000 steps from a 10,000-step rollout. The baseline VAE-WM frequently gets stuck generating the same color states, failing to explore the environment effectively. In contrast, GRWM produces a coherent and diverse trajectory, successfully exploring different regions while preserving long-term temporal … view at source ↗

**Figure 7.** Figure 7: Visualization of latent space structure through clustering analysis. We perform k-means clustering (k = 20) on the latent representations of frames. Each point in the plots corresponds to a frame, positioned according to its true (x, y) coordinates in the environment. The (x, y) coordinates are normalized and lie within [−1, 1]. Points are colored based on their assigned latent cluster ID. The top row (VAE… view at source ↗

**Figure 9.** Figure 9: Ablation study on the impact of latent dimension. GRWM (solid lines) consistently and significantly outperforms the vanilla VAE baseline (dashed lines) across all tested latent dimensions (16, 32, 64, and 128). Notably, our method’s performance is remarkably robust to the choice of latent dimension, while the baseline’s performance is highly sensitive. The benefits of our regularization are independent of… view at source ↗

**Figure 10.** Figure 10: Visualization of generated frames at multiple time points from a single starting state. We show [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of generated frames from the MC-DET sequence. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Representative trajectories from the three datasets. Each plot shows a sample trajectory overlaid [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: High-angle perspective views of the three evaluation environments. These renderings provide an intuitive, three-dimensional understanding of the maze layouts that complements the 2D top-down maps in the main text. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims latent geometry is the main bottleneck for long-horizon deterministic world models and offers temporal contrastive regularization as the fix, but the supporting diagnostics remain undescribed.

read the letter

The main thing here is that the authors argue high-fidelity cloning of deterministic 3D worlds is achievable once the latent representation matches the physical state manifold, and they propose temporal contrastive learning as the tool to enforce that geometry. They introduce GRWM, a lightweight regularization module that plugs into standard autoencoders to reshape the latent space before the dynamics model runs. The diagnostic experiments are said to show that geometry, not the predictor itself, drives the long-horizon errors in settings like mazes or static robot navigation. That separation of concerns is a useful framing if it holds up, because it directs effort toward representation rather than just bigger dynamics networks. The approach is simple enough that it could be tested quickly on existing world-model pipelines. The soft spots are straightforward. Only the abstract is available, so there are no numbers, no description of the metrics for long-horizon fidelity, and no account of how the diagnostics held model capacity, data volume, and optimizer fixed while changing only the latent geometry. Without those controls the claim that geometry is the primary bottleneck cannot be evaluated, and the stress-test concern about missing ablations is on target. Contrastive methods are already common in representation learning, so the novelty sits mainly in the targeted application rather than a new principle. This is for people working on world models for planning in structured, deterministic environments. A reader who cares about long-horizon fidelity in robotics or agents could extract a practical idea from the geometric regularization step. The paper has a focused claim and a concrete method, so it deserves a serious referee who can check the experimental design and results. I would send it for peer review rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The paper claims that high-fidelity cloning of deterministic 3D worlds is feasible in world models and that the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation rather than the dynamics model itself. It introduces Geometrically-Regularized World Models (GRWM), which apply temporal contrastive learning as a geometric regularization to curate a latent space that better reflects the underlying physical state manifold; this lightweight module can be integrated into standard autoencoders to improve stability and fidelity.

Significance. If the diagnostic experiments hold, the result would be significant for model-based reinforcement learning and planning: it would shift emphasis from dynamics predictors to representation geometry and offer a simple, integrable regularization technique for accurate simulation in fixed deterministic settings such as mazes and robot navigation.

major comments (1)

[Abstract] Abstract: the central claim rests on diagnostic experiments that 'quantitatively demonstrate' high-fidelity cloning is feasible and that latent geometry—not the dynamics model—is the primary bottleneck. No description is given of the experimental design, the quantitative metrics for long-horizon fidelity, or the controls that isolate latent geometry while holding model capacity, training data volume, and optimizer settings fixed. This is load-bearing for the claim that geometry is the dominant factor.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and outline a targeted revision to the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on diagnostic experiments that 'quantitatively demonstrate' high-fidelity cloning is feasible and that latent geometry—not the dynamics model—is the primary bottleneck. No description is given of the experimental design, the quantitative metrics for long-horizon fidelity, or the controls that isolate latent geometry while holding model capacity, training data volume, and optimizer settings fixed. This is load-bearing for the claim that geometry is the dominant factor.

Authors: We agree that the abstract does not describe the experimental design, metrics, or controls in detail, which is a valid point since these support our central claim. We will revise the abstract to incorporate a concise summary of the diagnostic experiments. This will include mentioning the use of long-horizon prediction accuracy as the quantitative metric for fidelity and noting that controls were implemented by holding model capacity, training data volume, and optimizer settings fixed while varying the latent geometry regularization. We believe this will better substantiate the claim that latent geometry is the primary bottleneck. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents a two-part claim: diagnostic experiments identify latent geometry as the primary bottleneck for long-horizon fidelity, followed by the introduction of temporal contrastive learning as geometric regularization in GRWM. No equations, fitted parameters, or derivation steps are provided that reduce a claimed prediction or result to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the text. The approach treats contrastive learning as an external inductive bias rather than a self-referential fit, and the diagnostic claim is framed as an empirical observation rather than a tautological renaming or forced prediction. The derivation chain therefore remains self-contained against the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces GRWM but provides no explicit free parameters or new entities in the abstract; relies on standard autoencoder assumptions.

axioms (1)

domain assumption Temporal contrastive learning serves as an effective inductive bias for shaping latent spaces to match physical manifolds.
Invoked in the description of GRWM.

pith-pipeline@v0.9.0 · 7559 in / 1009 out tokens · 38894 ms · 2026-05-18T02:47:37.441884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 11 internal anchors

[1]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

WorldVLA: Towards Autoregressive Action World Model

11 Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024a

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024a. Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compre...

work page arXiv
[4]

Hierarchical autoregressive image models with auxiliary decoders.arXiv preprint arXiv:1903.04933,

Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan. Hierarchical autoregressive image models with auxiliary decoders.arXiv preprint arXiv:1903.04933,

work page arXiv 1903
[5]

Towards unraveling and improving generalization in world models.arXiv preprint arXiv:2501.00195,

Qiaoyi Fang, Weiyu Du, Hang Wang, and Junshan Zhang. Towards unraveling and improving generalization in world models.arXiv preprint arXiv:2501.00195,

work page arXiv
[6]

On the duality between contrastive and non-contrastive self-supervised learning.arXiv preprint arXiv:2206.02574,

Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Lecun. On the duality between contrastive and non-contrastive self-supervised learning.arXiv preprint arXiv:2206.02574,

work page arXiv
[7]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Yuchao Gu, weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

World Models

David Ha and J¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

12 Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

work page arXiv
[13]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Variational autoen- coders and nonlinear ica: A unifying framework

David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding.arXiv preprint arXiv:2007.10930,

work page arXiv 2007
[15]

Eq-vae: Equivariance regularized latent space for improved generative image modeling

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

work page arXiv
[16]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62,

work page 2022
[17]

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop.arXiv preprint arXiv:2503.09595,

work page arXiv
[18]

Evaluating long-term memory in 3d mazes.arXiv preprint arXiv:2210.13383,

Jurgis Pasukonis, Timothy Lillicrap, and Danijar Hafner. Evaluating long-term memory in 3d mazes.arXiv preprint arXiv:2210.13383,

work page arXiv
[19]

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Recent Advances in Autoencoder-Based Representation Learning

Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Denoised mdps: Learning world models better than the world itself.arXiv preprint arXiv:2206.15477,

14 Tongzhou Wang, Simon S Du, Antonio Torralba, Phillip Isola, Amy Zhang, and Yuandong Tian. Denoised mdps: Learning world models better than the world itself.arXiv preprint arXiv:2206.15477,

work page arXiv
[22]

Equivariant graph neural operator for modeling 3d dynamics.arXiv preprint arXiv:2401.11037,

Minkai Xu, Jiaqi Han, Aaron Lou, Jean Kossaifi, Arvind Ramanathan, Kamyar Azizzadenesheli, Jure Leskovec, Stefano Ermon, and Anima Anandkumar. Equivariant graph neural operator for modeling 3d dynamics.arXiv preprint arXiv:2401.11037,

work page arXiv
[23]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Rep- resentation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Table 4: Training hyperparameters for AutoEncoder and Dynamics models

For reproducibility, we will release the code and all configuration files upon paper acceptance. Table 4: Training hyperparameters for AutoEncoder and Dynamics models. AutoEncoder Training Setting Epochs 50 Optimizer Adam (lr5×10 −4) Scheduler Warmup-linear (1000 warmup, 10,000 total, min ratio 0.1) Architecture Layers [2, 1, 2, 2, 1, 1, 2]; Encoder chann...

work page 2022