pith. machine review for the scientific record. sign in

arxiv: 2602.20323 · v6 · submitted 2026-02-23 · 💻 cs.RO · cs.AI

Recognition: no theorem link

PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords PhysMemtest-time memoryembodied physical reasoningvision-language model plannershypothesis verificationrobot manipulationphysical propertiesreal-world deployment
0
0 comments X

The pith

PhysMem lets VLM robot planners verify physical hypotheses through targeted interactions before applying them, cutting reliance on mismatched prior experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PhysMem as a memory system that records robot experiences, forms candidate hypotheses about specific physical properties like friction or stability, and tests those hypotheses via deliberate interactions before promoting validated facts for future planning. This matters because current vision-language planners can discuss physical principles in general but fail to predict how a particular object will behave on a given surface without fresh data. The framework keeps the underlying model fixed and instead builds task-specific knowledge at runtime by insisting on verification first. Experiments across simulation and real robots show the approach raises success rates substantially on manipulation tasks where direct reuse of past experience falls short.

Core claim

PhysMem records experiences, generates candidate hypotheses about physical properties, verifies them through targeted interaction before promoting validated knowledge to guide future decisions, and thereby reduces rigid reliance on prior experience when physical conditions change.

What carries the argument

The PhysMem memory framework, which records experiences, generates hypotheses, verifies them via targeted interactions, and promotes only validated knowledge.

If this is right

  • Robot planners can adapt to new physical conditions across objects and surfaces without any parameter updates to the underlying vision-language model.
  • Performance improves steadily over the course of a 30-minute deployment as more verified knowledge accumulates.
  • The same framework produces gains on three real-world manipulation tasks and across four different VLM backbones.
  • Direct application of retrieved experience is replaced by hypothesis testing that guards against mismatches when conditions change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verification loop could be extended to longer-horizon tasks where early physical assumptions affect many later steps.
  • Combining this runtime memory with existing long-term experience stores might reduce the number of fresh tests needed in repeated environments.
  • Measuring the time cost of verification interactions in increasingly cluttered scenes would test whether the approach remains practical at scale.

Load-bearing premise

Targeted verification interactions can be performed safely and efficiently in real environments without excessive time cost or risk of damage, and generated hypotheses are specific enough to be falsified by a small number of tests.

What would settle it

Running the same tasks with the verification step disabled and observing success rates fall back to the 23 percent level seen with direct experience retrieval would falsify the central benefit of the verification-before-application design.

Figures

Figures reproduced from arXiv: 2602.20323 by Hao Su, Haoyang Li, Leonidas Guibas, Yang You.

Figure 1
Figure 1. Figure 1: PhysMem learns physical principles through in￾teraction. (a) Continually Learn via the memory consoli￾dation system maintains principles P, hypotheses H, and experiences E, which guide the embodied agent’s actions a; world feedback (observations o, rewards r) generates new experiences e that refine knowledge. (b) Test-time learning on Parts Organization: PhysMem (blue) improves continuously while no-memory… view at source ↗
Figure 2
Figure 2. Figure 2: System overview of PhysMem. Left (Consolidation): A three-tier memory system stores raw experiences in episodic memory, clusters them into testable hypotheses in working memory, and promotes verified knowledge to long-term memory as principles. The consolidation process continuously refines memory through interaction. Top-right (Embodied Agent): A Vision-Language Model receives language instructions along … view at source ↗
Figure 4
Figure 4. Figure 4: Experimental environments. (a) Left: Real-world platform with xArm6 robot, fin-ray soft grippers, and multi￾view RealSense cameras in an enclosed workspace.Right: The partial props used in the experiments. (b) Reflect-VLM simula￾tion [20] with Franka Panda robot for large-scale experiments. and active hypotheses are injected into the VLM prompt as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory injection into VLM prompts. Verified principles (blue) and active hypotheses (yellow) are inserted into the planner’s context with confidence scores and typed constraints (PREFER, AVOID, SEQUENCE). where Eh denotes experiences relevant to h and α is the learn￾ing rate. This isolates specific action effects from confounding factors. d) Verification and Principle Promotion: Hypotheses achieving high c… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world tasks. Top: symbolic representations; bot￾tom: actual setups. (a) Grid layout and placement trajectories for Parts Organization. (b) Obstacle course and ball trajectory for Ball Navigation. (c) Stone arrangement and stacking posi￾tion for Balanced Stacking. toward the goal. A good plan accounts for rolling distance and obstacle rebounds. The challenge: surface friction and ball elasticity vary a… view at source ↗
Figure 6
Figure 6. Figure 6: End-state keyframes across tasks and OOD variants. A 2×2 ablation of prior knowledge and on-the-fly adaptation, shown as final states for four task instances. Rows: (i) Parts Organization on its OOD configuration (covered cells, lower is better); (ii) Ball Navigation with the in-distribution soccer ball; (iii) Ball Navigation with an OOD tennis ball; (iv) Balanced Stacking. Columns: (a) no prior, no adapta… view at source ↗
Figure 7
Figure 7. Figure 7: Resonance score evolution across tasks. Each panel shows RGB keyframes alongside the resonance curve. Resonance ρ measures prediction-outcome alignment: low values indicate surprising outcomes that trigger learning; high values (green region) indicate principled reasoning. Left: Parts Organization progresses from scattered placement to efficient packing. Middle: Ball Navigation learns ball dynamics through… view at source ↗
Figure 8
Figure 8. Figure 8: Test-time evolution across experience utilization levels. Each panel shows learning curves with shaded standard deviation bands (3 runs per condition). Blue: Full memory (100%); Green: 50% experience; Yellow: 25% experience; Gray: No memory (0%). Without memory, performance re￾mains flat across all tasks. Complex dynamics (Ball Naviga￾tion) benefit most from full experience, while simpler tasks (Balanced S… view at source ↗
Figure 10
Figure 10. Figure 10: Principle scaling across difficulty levels. Medium tasks (blue) show rapid learning between 2 and 8 principles before stabilizing. Easy tasks (green) saturate quickly with moderate improvement. Hard tasks (gray) show substantial improvement potentially through accumulated failure cases. Shaded region marks the “rapid learning” phase. key modules across all difficulty levels (100 episodes each, Gemini-3-Fl… view at source ↗
Figure 11
Figure 11. Figure 11: PhysMem overall pipeline. PhysMem enables VLM robot planners to learn physical principles through interaction. The scientific loop transforms raw experiences into verified knowledge: collecting observations, generating hypotheses, verifying through action-level attribution, and promoting validated principles to guide future decisions [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Assembly task demonstration across difficulty levels. Each row shows the oracle planner’s execution sequence from initial state to completion. Step numbers indicate execution order. Easy (green, 2 bricks): 5 steps total, minimal dependencies. Medium (blue, 5 bricks): 11 steps with moderate dependency chains. Hard (gray, 7 bricks): 15 steps requiring complex ordering to avoid blocking. The VLM must learn c… view at source ↗
Figure 13
Figure 13. Figure 13: Parts Organization: Component shapes and grid occupancy. Each panel shows a 3D-printed part over￾laid on its 2 × 2 template grid, illustrating which cells ([a,b,c,d]) each part occupies. Top row: red-L (2 cells, occupies [a,c]), red-q (3 cells, [a,c,d]), white-q with red-L interlocking example. Bottom row: white-U (4 cells, [a,b,c,d]), black-U (4 cells), black-I (2 cells, [a,c]). Red grid lines mark templ… view at source ↗
Figure 14
Figure 14. Figure 14: Out-of-distribution experimental settings and props. (a) Parts Organization OOD: Initial grid configuration is modified with parts pre-placed at different positions. Green cells (14–16, 23–25) are pre-occupied; hatched area indicates the constrained placement region (±3 columns from occupied boundary). Dashed trajectory shows example placement paths. (b) Ball Navigation OOD: Top shows stacking blocks (ill… view at source ↗
read the original abstract

Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PhysMem, a test-time memory framework for VLM-based robot planners that records experiences, generates candidate hypotheses about physical properties, verifies them through targeted interactions, and promotes validated knowledge for future decisions. The central claim is that verification-before-application improves physical manipulation performance over direct experience retrieval, with a reported 76% success rate versus 23% on a controlled brick insertion task and consistent qualitative gains across 30-minute real-world sessions on three manipulation tasks using four VLM backbones.

Significance. If the performance gains are robust, the work offers a practical route to scaling embodied physical reasoning without parameter updates, addressing a key limitation of current VLM planners in handling object- and environment-specific properties. The explicit verification step is a clear methodological contribution that could generalize beyond the evaluated tasks. However, the significance is limited by the lack of quantified costs for verification interactions, which are load-bearing for the scaling narrative.

major comments (2)
  1. [Experiments] Experiments section: the headline 76% vs. 23% success rates on the brick insertion task are reported without the number of trials, statistical significance tests, variance across runs, or failure-mode analysis, leaving the performance delta only partially supported.
  2. [Method] Method and Evaluation sections: no counts of verification attempts per decision, timing breakdowns, damage/failure rates during physical interactions, or evidence that hypotheses are specific enough to be falsified by few tests are provided; without these, it is unclear whether gains stem from principled memory use or simply from an increased interaction budget.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the number of real-world trials and the exact VLM backbones used.
  2. [Method] Notation for hypothesis generation and verification steps could be made more explicit with a small pseudocode block or diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional experimental details and methodological transparency will strengthen the paper. We address each major comment below and commit to revisions that provide the requested quantification and analysis without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline 76% vs. 23% success rates on the brick insertion task are reported without the number of trials, statistical significance tests, variance across runs, or failure-mode analysis, leaving the performance delta only partially supported.

    Authors: We agree that these statistical details are necessary for full support of the headline result. In the revised manuscript we will explicitly state that the brick insertion results are based on 50 independent trials per condition, report standard deviation across five random seeds (76% ± 4.2% vs. 23% ± 5.1%), include a paired t-test (p < 0.001), and add a failure-mode breakdown (e.g., 62% of baseline failures due to unverified friction assumptions versus 18% under PhysMem). These additions will be placed in a new subsection of the Experiments section. revision: yes

  2. Referee: [Method] Method and Evaluation sections: no counts of verification attempts per decision, timing breakdowns, damage/failure rates during physical interactions, or evidence that hypotheses are specific enough to be falsified by few tests are provided; without these, it is unclear whether gains stem from principled memory use or simply from an increased interaction budget.

    Authors: We accept that explicit cost accounting is required to rule out a simple budget explanation. The revision will add: (i) average verification attempts per decision (2.3 ± 0.7), (ii) per-decision timing (verification adds 38 s on average but reduces total session time via fewer retries), (iii) interaction failure rate (<4% across all runs, with no hardware damage observed), and (iv) concrete hypothesis examples showing single-test falsifiability (e.g., “brick A has μ > 0.6 on surface B” tested by one 5 cm push). We will also insert a budget-matched ablation that grants the baseline the same interaction count without verification; the performance gap remains, supporting that the structured verification step—not raw interaction volume—is responsible for the gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external task benchmarks, not self-referential definitions or fitted predictions

full rationale

The paper describes a memory framework (PhysMem) that records experiences, generates hypotheses, and verifies via targeted interactions before use. No equations, parameters, or derivations are presented that reduce to their own inputs by construction. Performance claims (76% vs 23% success on brick insertion; gains over 30-minute sessions) are reported as direct empirical comparisons between the proposed method and a baseline (direct experience retrieval) on the same tasks. No self-citations are load-bearing for the core mechanism, no uniqueness theorems are invoked, and no ansatz or renaming of known results occurs. The derivation chain is self-contained as an engineering system evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the framework introduces no explicit new physical constants or particles. It implicitly relies on the assumption that VLM-generated hypotheses can be turned into executable test actions and that verification outcomes are reliable signals. No free parameters are named.

axioms (2)
  • domain assumption VLM planners can generate testable hypotheses about physical properties from stored experiences
    Invoked in the description of the memory framework that records experiences and generates candidate hypotheses.
  • domain assumption Targeted interaction can falsify or confirm hypotheses without excessive cost or risk
    Central to the verification-before-application design choice.

pith-pipeline@v0.9.0 · 5481 in / 1373 out tokens · 41644 ms · 2026-05-15T20:02:42.873725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · 16 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Bal- aji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gober, Karol Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances. InProceedings of the Conference on Robot Learning (CoRL), pages 287–318, 2022

  3. [3]

    Learning dexterous in-hand manipulation.International Journal of Robotics Research, 39(1):3–20, 2020

    Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafał J´ozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation.International Journal of Robotics Research, 39(1):3–20, 2020

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. URL https://arxiv.org/abs/2511.21631

  5. [5]

    Physics context builders: A modular framework for physical reasoning in vision-language models

    Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, and Rahul G Krish- nan. Physics context builders: A modular framework for physical reasoning in vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7318–7328, 2025

  6. [6]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Ville- gas, Emma Wang, Jessica Yung, ...

  7. [7]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  8. [8]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot con- trol. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

  9. [9]

    Model-Free Episodic Control

    Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control.arXiv preprint arXiv:1606.04460, 2016

  10. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  11. [11]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  12. [12]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  13. [13]

    Decision transformer: Rein- forcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

  14. [14]

    Open X-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg...

  15. [15]

    Gemini 3 flash: Frontier intelligence built for speed, 2025

    Google DeepMind. Gemini 3 flash: Frontier intelligence built for speed, 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3-flash/. Accessed: 2026-01-21

  16. [16]

    Gemini embodied reasoning 1.5: Multi-step reasoning for robotic planning, 2025

    Google DeepMind. Gemini embodied reasoning 1.5: Multi-step reasoning for robotic planning, 2025. URL https://deepmind.google/technologies/gemini/. Accessed: 2026-01-21

  17. [17]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tomp- son, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  18. [18]

    SAM2Act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

    Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. SAM2Act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

  19. [19]

    Scene memory transformer for embodied agents in long-horizon tasks

    Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene memory transformer for embodied agents in long-horizon tasks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–547, 2019

  20. [20]

    Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation, 2025

    Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, and Jianlan Luo. Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation, 2025. URL https://arxiv.org/abs/ 2502.16707

  21. [21]

    Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025

    Figure AI Team. Helix: A vision-language-action model for generalist humanoid control.Technical Report, 2025

  22. [22]

    Model- agnostic meta-learning for fast adaptation of deep net- works

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep net- works. InInternational Conference on Machine Learning (ICML), pages 1126–1135, 2017

  23. [23]

    One-shot visual imitation learning via meta-learning

    Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. InProceedings of the Conference on Robot Learning (CoRL), pages 357–368, 2017

  24. [24]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

  25. [25]

    Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

    Gemini Robotics Team, Abbas Abdolmaleki, Anthony Brohan, Noah Brown, Konstantinos Bousmalis, Chelsea Finn, Karol Hausman, Sergey Levine, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  26. [26]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timo- thy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  27. [27]

    Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

    Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

  28. [28]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InProceedings of the Conference on Robot Learning (CoRL), pages 1769–1782, 2022

  29. [29]

    Vision-language-action models for robotics: A review towards real-world applications

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications. IEEE Access, 13:162467–162504, 2025

  30. [30]

    OpenVLA: An open- source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Chelsea Finn, Sergey Levine, and Percy Liang. OpenVLA: An open- source vision-language-action model. InProceedings of the Conference on Robot Learning (CoRL), 2024

  31. [31]

    Test-time adaptation for online vision- language navigation with feedback-based reinforcement learning

    Sungjune Kim, Gyeongrok Oh, Heeju Ko, Daehyun Ji, Dongwook Lee, Byung-Jun Lee, Sujin Jang, and Sangpil Kim. Test-time adaptation for online vision- language navigation with feedback-based reinforcement learning. InInternational Conference on Machine Learn- ing (ICML), 2025

  32. [32]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

  33. [33]

    In-context reinforcement learning with algorithm distil- lation

    Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steiber, DJ Strouse, Steven Hansen, Angelos Fiez, Max Simchowitz, et al. In-context reinforcement learning with algorithm distil- lation. InInternational Conference on Learning Repre- sentations (ICLR), 2023

  34. [34]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459– 9474, 2020

  35. [35]

    MemoNav: Working mem- ory model for visual navigation

    Hongxin Li, Zeyu Wang, Xu Yang, Yuran Yang, Shuqi Mei, and Zhaoxiang Zhang. MemoNav: Working mem- ory model for visual navigation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 17913–17922, 2024

  36. [36]

    MAP-VLA: Memory-augmented prompting for vision- language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025

    Runhao Li, Wenkai Guo, Zhenyu Wu, Huazhe Xu, et al. MAP-VLA: Memory-augmented prompting for vision- language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025

  37. [37]

    Towards generalist robot policies: What matters in building vision- language-action models

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

  38. [38]

    Code as policies: Language model programs for em- bodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2023

  39. [39]

    A lifelong learning approach to mobile robot navigation.IEEE Robotics and Automation Letters, 6(2):1090–1097, 2021

    Bo Liu, Xuesu Xiao, and Peter Stone. A lifelong learning approach to mobile robot navigation.IEEE Robotics and Automation Letters, 6(2):1090–1097, 2021

  40. [40]

    SonicSense: Object per- ception from in-hand acoustic vibration

    Jiaxun Liu and Boyuan Chen. SonicSense: Object per- ception from in-hand acoustic vibration. InProceedings of the Conference on Robot Learning (CoRL), 2024

  41. [41]

    Enhanced Condensation Through Rotation

    Yanjiang Luo, Zhecheng Wang, Xiaoyu Zhang, Zhixuan Xu, Zhengrong Lu, Yanjie Qu, and Huazhe Xu. Improv- ing vision-language-action model with online reinforce- ment learning.arXiv preprint arXiv:2501.01734, 2025

  42. [42]

    Self- refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self- refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

  43. [43]

    Preserving and combining knowledge in robotic lifelong reinforcement learning.Nature Machine Intelligence, 2025

    Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Preserving and combining knowledge in robotic lifelong reinforcement learning.Nature Machine Intelligence, 2025

  44. [44]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. GR00T N1: An open foundation model for gener- alist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  45. [45]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreber, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InPro- ceedings of Robotics: Science and Systems (RSS), 2024

  46. [46]

    GPT-5.1: Advanced multimodal reasoning model, 2025

    OpenAI. GPT-5.1: Advanced multimodal reasoning model, 2025. URL https://openai.com/gpt-5

  47. [47]

    Physical Intelligence Team, Kevin Black, Noah Brown, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Karl Pertsch, Lucy Xiaoyang Shi, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  48. [48]

    Rout- ledge, 1959

    Karl Popper.The Logic of Scientific Discovery. Rout- ledge, 1959

  49. [49]

    Neural episodic control

    Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adri`a Puigdom `enech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. InInternational Conference on Machine Learn- ing (ICML), pages 2827–2836, 2017

  50. [50]

    Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

    Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and Xudong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

  51. [51]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xi- angyu Zhang, and Gao Huang. Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  52. [52]

    Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

    Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. InProceedings of Robotics: Science and Systems (RSS), 2024

  53. [53]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

  54. [54]

    SIMA 2: A generalist embodied agent for virtual worlds

    SIMA Team, Adrian Bolton, Alexander Lerchner, et al. SIMA 2: A generalist embodied agent for virtual worlds. arXiv preprint arXiv:2512.04797, 2025

  55. [55]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

  56. [56]

    AP-VLM: Active perception en- abled by vision-language models.arXiv preprint arXiv:2409.17641, 2024

    Venkatesh Sripada, Samuel Carter, Frank Guerin, and Amir Ghalamzan. AP-VLM: Active perception en- abled by vision-language models.arXiv preprint arXiv:2409.17641, 2024

  57. [57]

    Test-time training with self- supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML), pages 9229–9248. PMLR, 2020

  58. [58]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for tem- poral abstraction in reinforcement learning.Artificial Intelligence, 112(1-2):181–211, 1999

  59. [59]

    Diffusion dynamics models with generative state estimation for cloth manipulation.Con- ference on Robot Learning (CoRL), 2025

    Tongxuan Tian, Haoyang Li, Bo Ai, Xiaodi Yuan, Zhiao Huang, and Hao Su. Diffusion dynamics models with generative state estimation for cloth manipulation.Con- ference on Robot Learning (CoRL), 2025

  60. [60]

    Domain ran- domization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain ran- domization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

  61. [61]

    Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, and Danny Driess

    Marcel Torne, Karl Pertsch, Homer Walke, Kyle Ved- der, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, and Danny Driess. Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

  62. [62]

    Test-time adapted reinforcement learning with action entropy regularization

    Shoukai Xu, Mingkui Tan, Liu Liu, Zhong Zhang, Peilin Zhao, et al. Test-time adapted reinforcement learning with action entropy regularization. InForty-second International Conference on Machine Learning

  63. [63]

    World model implanting for test-time adaptation of embodied agents

    Minjong Yoo, Jinwoo Jang, Sihyung Yoon, and Honguk Woo. World model implanting for test-time adaptation of embodied agents. InInternational Conference on Machine Learning (ICML), 2025

  64. [64]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InProceedings of the Conference on Robot Learning (CoRL), 2024

  65. [65]

    Lim, Yao Liu, and Rasool Fakoor

    Jesse Zhang, Minho Heo, Zuxin Liu, Erdem Biyik, Joseph J. Lim, Yao Liu, and Rasool Fakoor. EXTRACT: Efficient policy learning by extracting transferable robot skills from offline data. InProceedings of the Conference on Robot Learning (CoRL), 2024

  66. [66]

    Vlm4vla: Revisiting vision-language-models in vision-language-action mod- els.arXiv preprint arXiv:2601.03309, 2026

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action mod- els.arXiv preprint arXiv:2601.03309, 2026

  67. [67]

    Cot-vla: Visual chain- of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  68. [68]

    pick”, “insert

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. CoT-VLA: Visual chain- of-thought reasoning for vision-language-action models. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2025. APPENDIXA METHODIMPLEMENTATIONDETAILS This appendix provides comple...

  69. [69]

    3–5 sample experiences from the cluster

  70. [70]

    Extracted patterns (action types, object properties, out- comes)

  71. [71]

    Existing principles and hypotheses (to avoid duplication)

  72. [72]

    •Statement: Natural language rule •Applicable Actions: When this rule applies •Trigger Conditions: Contextual requirements C

    Task context Output Format.Each hypothesis includes: •Type: AVOID, PREFER, SEQUENCE, etc. •Statement: Natural language rule •Applicable Actions: When this rule applies •Trigger Conditions: Contextual requirements C. Experience Clustering We use hierarchical agglomerative clustering on symbolic state features:

  73. [73]

    Extract text embedding from symbolic state

  74. [74]

    Compute pairwise cosine similarity

  75. [75]

    Apply hierarchical clustering with thresholdτ= 0.6

  76. [76]

    Observation Visual keyframes and proprioceptive state

    Retain clusters with≥n min = 2experiences INPUT CONTEXT EXECUTION & FEEDBACKVLM Reasoner Synthesizes physical principles with current observation to generate action plans. Observation Visual keyframes and proprioceptive state. Task Spec Language-based goal and constraints. Principles Library of verified physical axioms. Action Motor primitives and traject...

  77. [77]

    Folded experiences older than TTL (100 episodes)

  78. [78]

    rotation=0 is safer for q-shapes

    Oldest experiences by priority (folded>old failures>old successes) Principle Decay.Principles decay following an Ebbinghaus forgetting curve: scoret+1 =score t ·γ, γ= 0.995(8) This yields approximately 50% retention after 138 episodes without reinforcement. F . Hyperparameter Summary Table V lists all hyperparameters used in our experiments. TABLE V:Hyper...

  79. [79]

    does brown block pink?

    Memory Architecture Ablations:Episodic memoryis the load-bearing tier. Without it, raw experiences are not stored at all and the system has nothing to draft hypotheses from; performance collapses to54%,37%, and14%across the three difficulties (about25to39points below the full system). Token usage drops to0.2–0.3×for the trivial reason that nothing is bein...

  80. [80]

    On harder tasks more principles accumulate (especially AVOIDconstraints from failures), and without filtering the planner receives a steady stream of conflicting guidance

    Mechanism Ablations:Resonance filteringretrieves principles based on prediction–outcome alignment rather than raw similarity, and like working memory it grows in impor- tance with difficulty (−8%on easy,−18%on medium and hard). On harder tasks more principles accumulate (especially AVOIDconstraints from failures), and without filtering the planner receive...

Showing first 80 references.