Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

Jiayi Fang

arxiv: 2605.28865 · v1 · pith:S2SD3QU7new · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

Jiayi Fang This is my paper

Pith reviewed 2026-06-30 15:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords world modelsVAEembodied explorationlatent representationsphysical geometrysemantic structureunsupervised learningKL regularization

0 comments

The pith

Physical exploration induces world model latents organized by real geometry without any language input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a VAE-based world model trained solely on random physical interactions develops a latent space whose structure mirrors the geometry of the explored environment. Direction accuracy reaches 0.677 and position RSA reaches 0.192, far above random-encoder baselines, and these semantic measures rise in tandem with prediction accuracy across training checkpoints. A controlled change in KL regularization strength simultaneously disables geometric structure and collapses both prediction and alignment, then restores them when the regularization is relaxed. The shared-driver pattern indicates that physical geometry itself supplies the organizing principle for the emergent representations.

Core claim

Training a VAE-based world model on random embodied exploration leads to a latent space with spatial semantic structure that mirrors physical geometry. This is shown by direction accuracy of 0.677 versus 0.547 for random encoders and position RSA of 0.192 versus 0.029, a 6.6-fold gain. Prediction performance and semantic alignment improve together (Spearman r = -0.61). Standard beta regularization pushes the encoder away from geometric structure and drives both prediction and alignment to near chance; lowering beta restores geometric access and recovers both capabilities at once.

What carries the argument

The beta-regularized VAE encoder whose latent space is shaped by physical interaction trajectories, with lower beta values permitting the emergence of direction and position encodings that align with actual geometry.

If this is right

Prediction performance and semantic alignment with physical geometry improve together over the course of training.
Raising the KL regularization coefficient forces the latent space away from geometric structure and simultaneously destroys both prediction accuracy and semantic alignment.
Lowering the regularization coefficient restores access to geometric structure and recovers both prediction and alignment.
The same physical-interaction data that drives prediction also drives the formation of spatially meaningful representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design choices for embodied agents could prioritize interaction data volume and low regularization over explicit semantic supervision.
The same geometric-organization effect may appear in other world-model architectures trained on interaction trajectories in different environments.
Measuring latent alignment with physical geometry could serve as an early diagnostic for whether a world model will later support spatial reasoning tasks.

Load-bearing premise

The measured gains in direction accuracy and position RSA arise specifically from the geometric structure present in the physical interaction data rather than from the CNN architecture or other fixed training choices.

What would settle it

A trained model that shows no gain in direction accuracy or position RSA over its random-encoder baseline even after beta is lowered to 0.001 would falsify the claim that physical geometry organizes the representations.

Figures

Figures reproduced from arXiv: 2605.28865 by Jiayi Fang.

**Figure 2.** Figure 2: Double knockout in Empty-8x8: β = 0.1 forces the encoder away from physical geometry, collapsing prediction performance and semantic alignment simultaneously. Left: Direction accuracy drops 37.8 points by step 100k. Center: Y-position R2 growth reverses under collapse. Right: Mean pairwise distance reaches zero. β = 0.001 recovers both capabilities together. Replication in a larger environment. We replicat… view at source ↗

**Figure 3.** Figure 3: Double knockout replication in Empty-16x16. Left: Direction accuracy—β = 0.001 rises to 0.490, β = 0.1 stays at chance. Center: Y-position R2—near zero for both, as 100k steps are insufficient to learn position regression across 196 navigable cells; this does not affect the knockout conclusion. Right: Pairwise distance—β = 0.1 collapses by step 25k, β = 0.001 remains stable throughout. The simultaneous fai… view at source ↗

read the original abstract

What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near-chance by step 50,000 -- exactly as the shared-driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embodied VAE training produces measurable gains in direction accuracy and position RSA over random encoders, plus a beta ablation that links prediction and alignment, but the controls leave CNN spatial bias and policy sampling as open alternatives.

read the letter

The main result is that a VAE trained on random embodied trajectories ends up with latents that match physical direction at 0.677 accuracy versus 0.547 for a random encoder, and position RSA at 0.192 versus 0.029. They also show that prediction error and this alignment improve together across checkpoints (Spearman r=-0.61), and that raising beta to 0.1 collapses both while lowering it to 0.001 brings them back. That double-knockout and the temporal tracking are the concrete additions.

The beta manipulation is a useful way to test whether the same driver supports both capabilities, and reporting the actual numbers plus the random-encoder baseline gives something falsifiable to look at. It sits on top of existing world-model work without claiming a full rewrite.

The concern that stands out is whether the gains really trace to physical geometry rather than the CNN's built-in translation equivariance or the fact that the data comes from an embodied policy that already samples structured paths. The random-encoder control keeps the CNN but not the training signal, and nothing in the abstract shows a matched run on shuffled frames or static views. If those produce similar alignment, the geometry story weakens. The abstract also leaves dataset size, exact architecture, and statistical details thin, so the support stays moderate until the methods section is checked.

This is for people working on unsupervised representations in robotics and RL. A reader already following world models would pick up the ablation pattern and the correlation numbers. It is coherent enough on its own terms to go to a serious referee, though the review should focus on whether additional non-embodied controls close the gap.

Referee Report

2 major / 2 minor

Summary. The paper claims that a VAE-based world model trained on random embodied exploration trajectories develops latent representations organized by physical geometry without linguistic supervision. Evidence includes direction accuracy of 0.677±0.029 (vs. 0.547 for random encoder), position RSA of 0.192±0.047 (vs. 0.029, 6.6x improvement), co-improvement of prediction performance and semantic alignment across 20 checkpoints (Spearman r=-0.61, p=0.004), and a beta ablation where KL regularization at 0.1 collapses both metrics while reducing beta to 0.001 restores them.

Significance. If the attribution to physical geometry holds, the results would indicate that embodied interaction can induce semantically grounded structure in world models, informing design of embodied agents. The temporal correlation analysis and double-knockout beta experiment are positive elements that tie performance and alignment together.

major comments (2)

[Abstract (beta ablation and controls)] Abstract (controls paragraph): The random-initialized encoder baseline and beta=0.1 ablation do not isolate physical geometry from the CNN's translation-equivariant filters or the structured sampling produced by the embodied policy. A control training the same architecture on non-embodied data (static views or shuffled frames) is required to support the claim that the measured gains in direction accuracy and position RSA arise specifically from physical interaction rather than architecture or data-collection confounds. This directly affects the central assertion that physical world geometry is the organizing principle.
[Abstract (correlation analysis)] Abstract (correlation paragraph): The reported Spearman correlation is consistent with a shared driver but does not distinguish geometry induction from other co-varying factors during training; an additional non-embodied training condition would be needed to strengthen the interpretation.

minor comments (2)

Provide full details on the environment, dataset size, number of trajectories, and exact statistical procedures for computing direction accuracy and position RSA.
Clarify whether the random-encoder baseline uses the same weight initialization distribution as the trained model or a different one.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our controls. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract (beta ablation and controls)] Abstract (controls paragraph): The random-initialized encoder baseline and beta=0.1 ablation do not isolate physical geometry from the CNN's translation-equivariant filters or the structured sampling produced by the embodied policy. A control training the same architecture on non-embodied data (static views or shuffled frames) is required to support the claim that the measured gains in direction accuracy and position RSA arise specifically from physical interaction rather than architecture or data-collection confounds. This directly affects the central assertion that physical world geometry is the organizing principle.

Authors: We agree that a non-embodied control (static views or shuffled frames) would more cleanly isolate the contribution of physical interaction from architecture and data-collection structure. The random encoder already shows that training is required beyond CNN biases, and the beta ablation demonstrates that high regularization collapses both prediction and alignment on the same embodied data. To strengthen the central claim, we will add the requested non-embodied training conditions and report direction accuracy and position RSA for them in the revised manuscript. revision: yes
Referee: [Abstract (correlation analysis)] Abstract (correlation paragraph): The reported Spearman correlation is consistent with a shared driver but does not distinguish geometry induction from other co-varying factors during training; an additional non-embodied training condition would be needed to strengthen the interpretation.

Authors: We acknowledge that the Spearman correlation by itself leaves room for alternative co-varying factors. The beta ablation already functions as an internal control by removing geometric structure while preserving the training data and objective. We will extend the correlation analysis to include the non-embodied training runs, allowing direct comparison of the co-improvement trajectory under conditions that lack physical interaction. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements and ablations are independent of definitional reduction

full rationale

The paper reports results from training a VAE-based world model on embodied trajectories, then measuring emergent properties (direction accuracy, position RSA) against random-encoder baselines and performing beta-regularization ablations. These steps rely on experimental outcomes and statistical correlations rather than any equation or definition that reduces the claimed geometric organization to a fitted input or self-citation by construction. No load-bearing premise is justified solely by prior author work, and the derivation chain consists of observable training dynamics and controls that remain falsifiable outside the reported numbers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard VAE assumptions about latent-space organization and the validity of RSA and accuracy as proxies for semantic structure; beta is treated as a tunable hyperparameter.

free parameters (1)

beta = 0.1 / 0.001
KL regularization coefficient set to 0.1 or 0.001 to produce the knockout and recovery effects.

axioms (1)

domain assumption VAE encoder can capture geometric structure from embodied trajectories when KL weight is low.
Invoked to interpret the beta ablation results.

pith-pipeline@v0.9.1-grok · 5750 in / 1236 out tokens · 44371 ms · 2026-06-30T15:39:45.958307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 8

2023
[2]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun. LeJEPA: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

VICReg: Variance-invariance-covariance regularization for self-supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-invariance-covariance regularization for self-supervised learning. InInternational Conference on Learning Representations (ICLR), 2022

2022
[4]

Barsalou

Lawrence W. Barsalou. Perceptual symbol systems.Behavioral and Brain Sciences, 22(4):577–660, 1999

1999
[5]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (CoRL), 2023

2023
[6]

MiniGrid: A minimalist gridworld environment for openai gym, 2018

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. MiniGrid: A minimalist gridworld environment for openai gym, 2018

2018
[7]

Oxford University Press, 2016

Andy Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 2016

2016
[8]

The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

Karl Friston. The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

2010
[9]

Learning latent action world models in the wild

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild. arXiv preprint arXiv:2601.05230, 2026

work page arXiv 2026
[10]

Bootstrap your own latent: A new approach to self- supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, et al. Bootstrap your own latent: A new approach to self- supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[11]

Mastering Diverse Domains through World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3):335–346, 1990

Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3):335–346, 1990

1990
[13]

Emergent multi-agent communication in the deep learning era

Angeliki Lazaridou and Marco Baroni. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006.02419, 2020

work page arXiv 2006
[14]

Don’t blame the ELBO! a linear V AE perspective on posterior collapse

James Lucas, George Tucker, Roger Grosse, and Mohammad Norouzi. Don’t blame the ELBO! a linear V AE perspective on posterior collapse. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019
[15]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. InInternational Conference on Machine Learning (ICML), 2017

2017
[16]

International Universities Press, 1952

Jean Piaget.The Origins of Intelligence in Children. International Universities Press, 1952

1952
[17]

Harvard University Press, 1999

Michael Tomasello.The Cultural Origins of Human Cognition. Harvard University Press, 1999

1999
[18]

World Action Models: The Next Frontier in Embodied AI

Siyin Wang, Junhao Shi, Zhaoyang Fu, et al. World action models: The next frontier in embodied AI. arXiv preprint arXiv:2605.12090, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational Conference on Machine Learning (ICML), 2021. 9

2021

[1] [1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 8

2023

[2] [2]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero and Yann LeCun. LeJEPA: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

VICReg: Variance-invariance-covariance regularization for self-supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-invariance-covariance regularization for self-supervised learning. InInternational Conference on Learning Representations (ICLR), 2022

2022

[4] [4]

Barsalou

Lawrence W. Barsalou. Perceptual symbol systems.Behavioral and Brain Sciences, 22(4):577–660, 1999

1999

[5] [5]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (CoRL), 2023

2023

[6] [6]

MiniGrid: A minimalist gridworld environment for openai gym, 2018

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. MiniGrid: A minimalist gridworld environment for openai gym, 2018

2018

[7] [7]

Oxford University Press, 2016

Andy Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 2016

2016

[8] [8]

The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

Karl Friston. The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

2010

[9] [9]

Learning latent action world models in the wild

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild. arXiv preprint arXiv:2601.05230, 2026

work page arXiv 2026

[10] [10]

Bootstrap your own latent: A new approach to self- supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, et al. Bootstrap your own latent: A new approach to self- supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[11] [11]

Mastering Diverse Domains through World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3):335–346, 1990

Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3):335–346, 1990

1990

[13] [13]

Emergent multi-agent communication in the deep learning era

Angeliki Lazaridou and Marco Baroni. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006.02419, 2020

work page arXiv 2006

[14] [14]

Don’t blame the ELBO! a linear V AE perspective on posterior collapse

James Lucas, George Tucker, Roger Grosse, and Mohammad Norouzi. Don’t blame the ELBO! a linear V AE perspective on posterior collapse. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019

[15] [15]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. InInternational Conference on Machine Learning (ICML), 2017

2017

[16] [16]

International Universities Press, 1952

Jean Piaget.The Origins of Intelligence in Children. International Universities Press, 1952

1952

[17] [17]

Harvard University Press, 1999

Michael Tomasello.The Cultural Origins of Human Cognition. Harvard University Press, 1999

1999

[18] [18]

World Action Models: The Next Frontier in Embodied AI

Siyin Wang, Junhao Shi, Zhaoyang Fu, et al. World action models: The next frontier in embodied AI. arXiv preprint arXiv:2605.12090, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational Conference on Machine Learning (ICML), 2021. 9

2021