pith. sign in

arxiv: 2511.15407 · v4 · pith:TQMC7C3Znew · submitted 2025-11-19 · 💻 cs.AI · cs.CV· cs.LG

IPR-1: Interactive Physical Reasoner

Pith reviewed 2026-05-17 20:51 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG
keywords physical reasoninginteractive agentsworld modelsvision-language modelsgame benchmarkszero-shot transfercausality learning
0
0 comments X

The pith

An interactive physical reasoner learns causal physics from game play and surpasses GPT-5 overall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether agents can develop human-like physical reasoning by observing and interacting with environments, internalizing physics and causality through experience. It introduces the Game-to-Unseen benchmark consisting of over 1000 heterogeneous games with visual domain gaps to test true understanding rather than pattern matching. Current VLMs and world models are limited, as VLMs lack interactive look-ahead and world models imitate visuals instead of analyzing causality. To address this, the authors propose the Interactive Physical Reasoner which uses world-model rollouts to score and reinforce a vision-language model's policy, aided by PhysCode that provides a physics-aligned action space. Experiments show the pretrained model handles tasks from basic intuition to complex goal reasoning, outperforms GPT-5, scales with more games and steps, and transfers zero-shot to new games, indicating that focused physical interaction enables progressive improvement in reasoning.

Core claim

Pretrained on more than 1000 games, the Interactive Physical Reasoner performs robustly across levels of physical reasoning from primitive intuition to goal-driven tasks, surpasses GPT-5, improves as training games and interaction steps increase, and zero-shot transfers to unseen games.

What carries the argument

IPR framework that employs world-model rollouts to score and reinforce VLM policy, combined with PhysCode, a physics-centric action code that aligns semantic intent with underlying dynamics to create a shared space for prediction and reasoning.

If this is right

  • Performance on physical reasoning tasks improves as the number of training games increases.
  • Additional interaction steps during inference further boost the agent's capabilities.
  • The approach enables zero-shot generalization to games not encountered in training.
  • Physics-centric interaction serves as an effective method for achieving steadily improving physical reasoning abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying this method to real robotic environments could test if the learned causality transfers beyond simulated games.
  • Exploring integration with other modalities like audio or tactile feedback might enhance the robustness of the physical models.
  • Investigating the minimal number of games needed for effective transfer could optimize training efficiency for future iterations.

Load-bearing premise

That the rollouts from the world model truly capture the underlying physics and causality of the environments instead of relying on superficial visual patterns, and that the variety in the G2U benchmark is enough to separate core reasoning from appearance-based shortcuts.

What would settle it

If an ablation study shows that IPR without the world-model rollout scoring performs no better than a standard VLM on the G2U benchmark, or if the model fails to generalize to a set of games introducing entirely new physical rules outside the training distribution.

Figures

Figures reproduced from arXiv: 2511.15407 by Guocan Xie, Jiting Cai, Lifeng Zhuo, Mingyu Zhang, Renjie Zhao, Tianxi Tan, Xian Nie, Yan Li, Yong-Lu Li, Ziyu Wang, Zizhu He.

Figure 1
Figure 1. Figure 1: Game-to-Unseen (G2U) problem. Humans accumu￾late interactive experience and rapidly adapt to new games. De￾spite different visuals and interfaces, many games share underly￾ing physical/causal mechanisms. We pretrain on 1,000+ visually and physically diverse games to test whether an agent can inter￾nalize these shared mechanisms and generalize to unseen games. embodied AI: what learning paradigm enables hum… view at source ↗
Figure 2
Figure 2. Figure 2: Three-level evaluation inspired by Maslow’s hierarchy of needs. We organize tasks into a pyramid of Survival, Curiosity, and Utility. Survival measures how long the agent can stay alive by avoiding risks. Curiosity measures how broadly it visits novel states; and Utility measures how well it achieves downstream goals. The three levels progress from physical intuition to goal-driven reasoning. Our IPR perfo… view at source ↗
Figure 3
Figure 3. Figure 3: Motivating failure cases in control semantics, lan￾guage grounding, and prediction. (1) Control conflict: the same key (e.g., UP) triggers different semantics across games (camera tilt up v.s. character move up), causing console aliasing. (2) Vision￾language distortion: text-only actions cannot specify precise vi￾sual magnitudes (e.g., jump height/speed), leading to systematic amplitude errors. (3) Missing… view at source ↗
Figure 4
Figure 4. Figure 4: Word cloud of action semantics across thousands of game worlds. These shared semantics provide the structural foundation for cross-domain transfer. Actions highlighted in red represent those shared with general robotic operations, while the size of each word reflects its frequency in our data recipe. sparse rewards [19, 35, 66, 68]. With the rise of VLM/VLA agents, web-based benchmarks and browser environm… view at source ↗
Figure 5
Figure 5. Figure 5: IPR training pipeline. Stage 1: PhysCode pre-training. Video clips with optical flow and action semantics are fed to a VQ￾based latent action model to learn discrete codes (PhysCode) that represent dynamics. Stage 2: Latent-conditioned world model. Given current features and PhysCode sequences, a world model is trained to predict future features and rewards under latent actions. Stage 3: Prediction-reinfor… view at source ↗
Figure 6
Figure 6. Figure 6: Game data distribution. Our dataset spans over 1,000 games categorized by game category, control interface, opera￾tion and visual complexity, physical and causal mechanisms. This wide coverage enables agents to experience diverse domains and learn transferable physical and causal understanding. game instructions. We perform a series of preprocess￾ing, including normalizing time intervals, removing non￾inte… view at source ↗
Figure 7
Figure 7. Figure 7: G2U zero-shot scaling on 50 held-out games. As the number of training games N increases, zero-shot performance on , , and improves steadily on the unseen set TU. on SN and directly evaluate zero-shot on TU without any adaptation or reward re-scaling. Across all three objectives, performance increases steadily with N, with the steepest early gains on , fol￾lowed by sustained improvements on and as more dive… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of our 1,000 games, containing old-fashioned retro games, HTML/canvas games, and modern commercial games. read a small set of game variables exposed in JavaScript (e.g., score, level, remaining lives) as auxiliary state. To unify control across heterogeneous HTML titles, we define a hybrid action space consisting of: (1) a discrete keyboard state vector (one-hot over pressed keys); (2) a continuou… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of our game-recording website tools. lightweight semantic tags for each short action seg￾ment. These tags describe both what characters are doing: they include action semantics (e.g., jump, dodge, charge, aim, grab), local physical principles (e.g., gravity-driven fall, sliding under friction, momentum carry-over), and simple causal relations (e.g., hit switch → open door, push object → block haza… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of PhysCode in different game domains. Some action codes share across games, typically move right, jump, while others are separated according to different physical domains. as an interleaved image–text prompt and the target as the PhysCode sequence: [IMG(xt)] ‘‘Goal: g’’ → ⟨PCct,1 ⟩. . .⟨PCct,L ⟩. We train with a standard teacher-forced cross-entropy loss only on the PhysCode tokens, keeping … view at source ↗
Figure 11
Figure 11. Figure 11: ACT Case Study. The figure highlights four representative behaviors of ACT: (1) Line 1 shows that ACT can solve difficult segments by leveraging human demonstrations and extracting effective strategies; (2) Line 2 illustrates that imitation enables high scores on tasks with stable, low-variance dynamics; (3) Line 3 reveals that ACT also absorbs human failure patterns, reproducing suboptimal attempted acti… view at source ↗
Figure 12
Figure 12. Figure 12: Qwen-BC Case Study. The figure illustrates four characteristic behaviors of the BC-trained Qwen agent: (1) Line 1 shows that the agent can faithfully reproduce high-difficulty actions; (2) Line 2 demonstrates its strong temporal stability and highly consistent action repetition; (3) Line 3 reveals its poor generalization to novel or perturbed situations; and (4) Line 4 shows its tendency to collapse into … view at source ↗
Figure 13
Figure 13. Figure 13: PPO Case Study. The figure presents four typical behaviors of the PPO agent: (1) Line 1 demonstrates that PPO can learn ef￾fective action sequences, enabling the agent to simultaneously shoot while dodging bullets through rolling maneuvers; (2) Line 2 illustrates its capacity to not only acquire efficient key-press strategies but also identify primary movement directions that drive game progression; (3) L… view at source ↗
Figure 14
Figure 14. Figure 14: DQN Case Study. The figure presents four typical behaviors of the DQN agent: (1) Line 1 shows that it can correctly identify when specific actions should be executed; (2) Line 2 illustrates its ability to detect and exploit advantageous environmental features (e.g., using rocks as cover); (3) Line 3 reveals that poorly shaped rewards can lead the agent to adopt degenerate strategies, such as repeatedly “d… view at source ↗
Figure 15
Figure 15. Figure 15: DreamerV3 Case Study. The figure illustrates four characteristic behaviors of the Dreamer agent: (1) Line 1 shows that Dreamer reliably exhibits risk-avoiding behavior and tends to choose actions that maximize short-term safety; (2) Line 2 demonstrates its strong temporal stability, often producing highly repetitive and consistent action sequences; (3) Line 3 reveals a biased policy that over￾relies on a … view at source ↗
Figure 16
Figure 16. Figure 16: V-JEPA2 Case Study. The figure illustrates four representative behaviors of the V-JEPA agent: (1) Line 1 shows that the agent maintains high action efficiency with minimal redundancy, avoiding the ineffective key combinations often observed in other models; (2) Line 2 demonstrates its capacity for strategic environmental exploitation, such as utilizing terrain features (e.g., rocks) to evade hazards; (3) … view at source ↗
Figure 17
Figure 17. Figure 17: Genie Case Study. The figure presents four key capabilities and limitations of our Genie-based world model: (1) Line 1 demonstrates enhanced motion trajectory prediction, enabling the agent to execute preemptive evasion maneuvers; (2) Line 2 reveals the emergence of strategic path planning, where the agent learns systematic navigation paths beyond reactive bullet avoidance; (3) Line 3 illustrates a critic… view at source ↗
Figure 18
Figure 18. Figure 18: GPT-4o Case Study. The figure illustrates four char- acteristic behaviors of the GPT-4o agent. (1) Line 1 shows that the agent demonstrates effective target engagement and reaction speed, discharging projectiles to neutralize an aerial threat; (2) Line 2 highlights its proficiency in precise spatial navigation, executing a controlled jump to successfully land on the target platform; (3) Line 3 reveals a b… view at source ↗
Figure 19
Figure 19. Figure 19: GPT-5 Case Study. The figure illustrates four characteristic behaviors of the GPT-5 agent. (1) Line 1 shows that the agent demonstrates accurate target acquisition and offensive capability, intercepting an aerial enemy to clear the path; (2) Line 2 highlights its proficiency in precision platforming and spatial navigation, executing a calculated jump to skip the enemies; (3) Line 3 reveals limitation in s… view at source ↗
Figure 20
Figure 20. Figure 20: Qwen3-VL-30B-A3B Case Study. The figure illustrates four representative behaviors of the Qwen3-VL-30B-A3B agent: (1) Line 1 shows that the agent demonstrates spatial reasoning and planning, rotating and tucking the tetromino into a precise gap to maintain clean board; (2) Line 2 highlights its proficiency in high-frequency temporal control, executing a timed jump to pass the obstacle (the fire ring); (3) … view at source ↗
Figure 21
Figure 21. Figure 21: IPR Case Study. The figure illustrates four representative behaviors of the IPR agent: (1) Line 1 shows that the agent demon￾strates precise reactive control, maneuvering to evade incoming projectiles; (2) Line 2 highlights its proficiency in dynamic environmental perception, allowing it to anticipate and dodge falling hazards (rocks); (3) Line 3 reveals vulnerability in rapid collision avoidance, where t… view at source ↗
read the original abstract

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games exhibiting visual domain gaps and proposes IPR (Interactive Physical Reasoner), which scores and reinforces a VLM policy via world-model rollouts, together with PhysCode, a physics-centric action code that aligns semantic intent with dynamics. It claims that a model pretrained on these games performs robustly from primitive intuition to goal-driven reasoning, surpasses GPT-5 overall, improves with additional training games and interaction steps, and zero-shot transfers to unseen games, thereby supporting physics-centric interaction as a route to steadily improving physical reasoning.

Significance. If the reported scaling, zero-shot transfer, and outperformance of GPT-5 are shown to arise from dynamics-aware rollouts rather than visual pattern matching, the work would be significant for AI physical reasoning: it would provide concrete evidence that interaction plus explicit physics alignment can yield more robust, generalizable mechanisms than current VLMs or world models alone. The large-scale heterogeneous benchmark and the PhysCode shared action space are concrete contributions that could be adopted by others.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Evaluation protocol): the central claim that IPR 'surpasses GPT-5 overall' and exhibits scaling with training games and zero-shot transfer is presented without any reported metrics, baselines, statistical tests, or ablation tables. Because these quantitative results are the sole empirical support for the superiority of physics-centric rollouts over visual heuristics, their absence renders the central claim unassessable.
  2. [§3 and §5] §3 (Method) and §5 (Experiments): no ablation isolates the contribution of PhysCode or the world-model rollout scoring from the base VLM or from visual pattern matching. The paper notes that existing world models 'imitate visual patterns' and that G2U has 'significant visual domain gaps,' yet provides no controls such as texture randomization, physics-parameter swaps, or counterfactual interventions that would be required to substantiate that gains arise from causal dynamics rather than appearance correlations.
  3. [§4 and §5] §4 and §5: performance is reported after training on the G2U benchmark itself, yet the evaluation protocol for 'unseen games' and the degree of overlap between training and test distributions are not specified. This leaves open the possibility that reported improvements and zero-shot transfer partly reflect fitting to the same game distribution rather than acquisition of independent physical mechanisms.
minor comments (2)
  1. [Abstract] The abstract and introduction repeatedly use 'robustly' and 'steadily improving' without defining the precise success criteria or success thresholds used in the G2U levels.
  2. [Figures] Figure captions and method diagrams should explicitly label which components are frozen versus trained and which data flow corresponds to the PhysCode alignment step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical presentation of IPR and the G2U benchmark. We address each major comment below and have revised the manuscript to improve clarity, add explicit quantitative details, and provide additional controls where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation protocol): the central claim that IPR 'surpasses GPT-5 overall' and exhibits scaling with training games and zero-shot transfer is presented without any reported metrics, baselines, statistical tests, or ablation tables. Because these quantitative results are the sole empirical support for the superiority of physics-centric rollouts over visual heuristics, their absence renders the central claim unassessable.

    Authors: We agree that the abstract summarizes findings at a high level without numerical values for brevity. Section 5 already contains the supporting tables with performance metrics (success rates, scaling curves vs. number of training games and interaction steps), direct comparisons to GPT-5 and other baselines, and zero-shot transfer results on held-out games. To make these immediately accessible, we have added a consolidated metrics table with statistical significance tests (paired t-tests, p<0.01) to the revised §4, along with explicit baseline descriptions. This directly addresses the assessability concern. revision: yes

  2. Referee: [§3 and §5] §3 (Method) and §5 (Experiments): no ablation isolates the contribution of PhysCode or the world-model rollout scoring from the base VLM or from visual pattern matching. The paper notes that existing world models 'imitate visual patterns' and that G2U has 'significant visual domain gaps,' yet provides no controls such as texture randomization, physics-parameter swaps, or counterfactual interventions that would be required to substantiate that gains arise from causal dynamics rather than appearance correlations.

    Authors: We acknowledge the value of more targeted isolations. The original manuscript provides comparative results against base VLMs and non-physics world models, but does not include dedicated ablations for PhysCode or rollout scoring. In the revision we have added these: (i) an ablation replacing PhysCode with standard semantic action spaces, and (ii) a rollout-vs-direct-prediction comparison. We have also incorporated texture-randomization and physics-parameter-swap controls on a subset of games, showing that performance gains persist under these interventions. These new results are reported in the updated §5. revision: yes

  3. Referee: [§4 and §5] §4 and §5: performance is reported after training on the G2U benchmark itself, yet the evaluation protocol for 'unseen games' and the degree of overlap between training and test distributions are not specified. This leaves open the possibility that reported improvements and zero-shot transfer partly reflect fitting to the same game distribution rather than acquisition of independent physical mechanisms.

    Authors: We have expanded §4 to explicitly define the evaluation protocol. The unseen games constitute a held-out partition of G2U with zero overlap in game mechanics, physics parameters, object affordances, and visual styles (quantified via perceptual similarity metrics). Training and test sets were constructed to maximize visual domain gaps while preserving the heterogeneous physics coverage. We now report separate results on this partition and on an additional set of entirely novel game templates never encountered during training, confirming that gains reflect generalization of physical mechanisms rather than distributional overlap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical AI system (IPR) pretrained on the G2U benchmark of 1000+ games, using world-model rollouts and PhysCode to improve VLM policies, with reported scaling of performance and zero-shot transfer to unseen games. No mathematical derivation chain, self-definitional equations, or fitted parameters renamed as predictions appear in the provided text. Claims rest on experimental results rather than reducing to inputs by construction. The central premise (physics-centric interaction yields robust reasoning) is supported by benchmark performance and scaling observations, which are independent of any self-citation load-bearing step or ansatz smuggling. This is a standard empirical setup with no load-bearing circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the unverified premise that world-model rollouts provide reliable physics signals and that the benchmark isolates causal understanding; no explicit free parameters, axioms, or invented entities are quantified.

invented entities (1)
  • PhysCode no independent evidence
    purpose: physics-centric action code aligning semantic intent with dynamics for shared prediction and reasoning space
    Introduced as a new component without external validation or independent evidence mentioned in the abstract.

pith-pipeline@v0.9.0 · 5585 in / 1311 out tokens · 34276 ms · 2026-05-17T20:51:51.651115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  2. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.