RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3
The pith
Memory representations for robotic policies show effectiveness that depends on the specific task rather than a single best design.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. This conclusion rests on a benchmark of sixteen manipulation tasks built under a taxonomy of temporal, spatial, object, and procedural memory, together with experiments on fourteen memory-augmented variants of a single base model.
What carries the argument
A taxonomy that divides memory requirements into temporal, spatial, object, and procedural categories, used to structure both the creation of test tasks and the comparison of integration strategies.
If this is right
- Model builders should match memory mechanisms to the dominant requirement of a task, such as counting steps for temporal needs or recovering from occlusions for object needs.
- Standardized benchmarks make it possible to measure incremental progress in history-dependent robotic manipulation instead of relying on isolated demonstrations.
- Generalist policies may need to incorporate multiple memory types or switch between them when facing varied task demands.
Where Pith is reading between the lines
- Hybrid memory systems that detect task features and activate the most suitable representation could extend performance across a wider range of scenarios.
- Applying the same taxonomy to physical robot experiments would reveal whether simulation results hold when sensor noise and actuation errors are present.
- Similar task-dependent patterns may appear in other sequential control problems such as navigation or assembly planning.
Load-bearing premise
The four memory categories accurately capture the needs of real long-horizon robotic manipulation and the sixteen tasks are representative enough to support general conclusions.
What would settle it
A follow-up test in which one memory design outperforms all others on every task in the set or on a fresh collection of long-horizon tasks that still fit the same overall description.
Figures
read the original abstract
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RoboMME, a standardized benchmark of 16 long-horizon robotic manipulation tasks organized under a taxonomy of temporal, spatial, object, and procedural memory. It constructs 14 memory-augmented variants of the π0.5 VLA backbone via different integration strategies (recurrent, attention-over-history, external memory) and reports that memory effectiveness is highly task-dependent, with each design showing distinct advantages and limitations.
Significance. If the results hold after addressing controls, this benchmark could standardize evaluation of memory mechanisms in VLA models and clarify design trade-offs for long-horizon robotics. The public release of code and videos supports reproducibility and is a clear strength.
major comments (2)
- [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.
- [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.
minor comments (2)
- [§3] The taxonomy is introduced without explicit validation against real-world long-horizon task distributions; a short discussion or reference to how the 16 tasks were selected would strengthen the claim of representativeness.
- [Figures/Tables] Figure legends and tables comparing the 14 variants should include explicit capacity metrics to aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experimental variants): The 14 variants are built by different integration strategies on the π0.5 backbone, yet the manuscript provides no parameter counts, FLOPs, or capacity-matched controls. This leaves open the possibility that task-dependent performance gaps reflect incidental differences in model capacity or training dynamics rather than intrinsic properties of the temporal/spatial/object/procedural taxonomy.
Authors: We agree that parameter counts and FLOPs are necessary for transparent comparison. In the revised version we will add a dedicated table reporting parameter counts and estimated FLOPs for each of the 14 variants relative to the π0.5 backbone. On capacity-matched controls, the variants modify only the memory integration module while freezing the core VLA weights and architecture; this keeps overall capacity differences modest (typically <5% additional parameters). Nevertheless, to directly address the concern we will include a new paragraph discussing capacity implications and, where feasible, report results from a capacity-matched ablation that equalizes total parameters across representative variants. revision: partial
-
Referee: [Results] Results section and abstract: The central claim of task-dependent effectiveness is presented without error bars, statistical significance tests, details on task construction, or data exclusion criteria. This makes it impossible to assess whether the reported patterns are reliable or generalizable beyond the specific 16 tasks.
Authors: We accept that the current presentation lacks sufficient statistical detail. We will augment all result figures with error bars computed over multiple random seeds and add statistical significance tests (paired t-tests with Bonferroni correction) between memory variants on each task. Task construction details appear in Section 3, but we will expand this section with explicit criteria used to isolate temporal, spatial, object, and procedural memory requirements. No data points were excluded from the reported results; we will state this explicitly and describe the full evaluation protocol (including number of trials per task) to improve reproducibility and generalizability assessment. revision: yes
Circularity Check
Empirical benchmark with no derivation chain or self-referential reductions
full rationale
This is a purely empirical benchmarking paper that introduces a taxonomy of memory types, constructs 16 tasks, and evaluates 14 variants on a fixed backbone through direct experimentation. No mathematical derivations, first-principles predictions, or fitted parameters are claimed to produce new results; the central claims rest on observed performance differences across tasks. The taxonomy and variants are presented as design choices for systematic comparison rather than outputs derived from prior results within the paper. No self-citation is used to justify uniqueness or forbid alternatives, and no step reduces to an input by construction. The skeptic concern about capacity matching is a validity issue, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The taxonomy of temporal, spatial, object, and procedural memory covers the relevant history-dependent aspects of robotic manipulation tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RoboMME categorizes memory into four cognitive dimensions: (1) temporal memory for event accumulation and ordering; (2) spatial memory for tracking object locations under occlusion...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
Human memory: A proposed system and its control processes
Richard C Atkinson and Richard M Shiffrin. Human memory: A proposed system and its control processes. InPsychology of learning and motivation, volume 2, pages 89–195. Elsevier, 1968
work page 1968
-
[2]
Stephanie J Babb and Ruth M Johnson. Object, spatial, and temporal memory: A behavioral analysis of visual scenes using a what, where, and when paradigm.Current psychology letters. Behaviour, brain & cognition, 26 (2, 2010), 2011
work page 2010
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, and et al.𝜋0.5: a vision-language-action model with open-world generalization. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Resea...
work page 2025
-
[5]
Learning to act anywhere with task-centric latent actions.RSS, 2025
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.RSS, 2025
work page 2025
-
[6]
Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022
work page 2022
-
[7]
Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023
Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023
-
[8]
The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002
Neil Burgess, Eleanor A Maguire, and John O’Keefe. The human hippocampus and spatial and episodic memory.Neuron, 35(4):625–641, 2002
work page 2002
-
[9]
History-Aware Visuomotor Policy Learning via Point Tracking, March 2026
Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-aware visuomotor policy learning via point tracking.arXiv preprint arXiv:2509.17141, 2025
-
[10]
Videollm-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024
work page 2024
-
[11]
Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning. InProceedings of the 7th Robot Learning Workshop at ICLR, 2025. arXiv:2502.10550
-
[12]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
work page 2025
-
[13]
Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.RSS, 2025. 11 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
work page 2025
-
[14]
Think, act, and ask: Open-world interactive personalized robot navigation
Yinpei Dai, Run Peng, Sikai Li, and Joyce Chai. Think, act, and ask: Open-world interactive personalized robot navigation. In2024 IEEE international conference on robotics and automation (ICRA), pages 3296–3303. IEEE, 2024
work page 2024
-
[15]
Racer: Rich language-guided failure recovery policies for imitation learning
Yinpei Dai, Jayjun Lee, Nima Fazeli, and Joyce Chai. Racer: Rich language-guided failure recovery policies for imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15657–15664. IEEE, 2025
work page 2025
-
[16]
Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025
-
[17]
Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.ICML, 2025
work page 2025
-
[18]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024
work page 2024
-
[19]
Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation.arXiv preprint arXiv:2506.06677, 2025
-
[20]
Michael E Hasselmo, James R Hinman, Holger Dannenberg, and Chantal E Stern. Models of spatial and temporal dimensions of memory.Current Opinion in Behavioral Sciences, 17:27–33, 2017
work page 2017
-
[21]
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020
work page 2020
-
[22]
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision- language-action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025
-
[23]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jin- woo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026
-
[27]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023
work page 2023
-
[28]
Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 12 RoboMME: Benchmarking and Understanding Memory for Rob...
work page 2024
-
[29]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
work page 2022
-
[31]
Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations
Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks, 2021
work page 2021
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[33]
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.ICLR, 2026
work page 2026
-
[34]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023
work page 2023
-
[35]
Larry R Squire. Memory systems of the brain: a brief history and current perspective.Neurobiology of learning and memory, 82(3):171–177, 2004
work page 2004
-
[36]
Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025
-
[37]
Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.ICML, 2025
work page 2025
-
[38]
Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025
Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning long-context diffusion policies via past-token prediction.arXiv preprint arXiv:2505.09561, 2025
-
[40]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[42]
Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026
-
[43]
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372, 2025
-
[44]
Timechat-online: 80% visual tokens are naturally redundant in streaming videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 13 RoboMME: Benchmarking and Understanding Memory for Robo...
work page 2025
-
[45]
Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025
work page 2025
-
[46]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. Mtil: Encoding full history with mamba for temporal imitation learning.IEEE Robotics and Automation Letters, 2025
work page 2025
-
[48]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 14 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Appendix Outline •A. Model Architectures in the ...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[49]
put one blue cube into the bin and press the button to stop
Placeonecube of a specified color.e.g., “put one blue cube into the bin and press the button to stop”
-
[50]
put one blue cube and two green cubes into the bin and press the button to stop
Place cubes oftwospecified colors.e.g., “put one blue cube and two green cubes into the bin and press the button to stop”
-
[51]
put one blue cube, one green cube and two red cubes into the bin and press the button to stop
Place cubes ofthreespecified colors.e.g., “put one blue cube, one green cube and two red cubes into the bin and press the button to stop”. Task CharacteristicsTo introduce dynamic and history-dependent behavior, we consider two settings randomly selected at environment initialization: 1.Static:all cubes are present at the beginning of the episode. 2.Strea...
-
[52]
pick up the blue cube and place it on the target, then press the button to stop
Pick and place foronetime.e.g., “pick up the blue cube and place it on the target, then press the button to stop”
-
[53]
Pickandplaceformultipletimes.e.g., “pickupthebluecubeandplaceitonthetarget, repeatingthispick-and-place action three times, then press the button to stop”. Successful Pick-and-PlaceA pick-and-place is considered successful when the robot lifts the cube above a predefined height threshold while maintaining a valid grasp, and then lowers it onto the target ...
-
[54]
The robot picks up a wrong cube
-
[55]
The button is pressed before all required repetitions are completed
-
[56]
The robot performs more pick-and-place repetitions than specified. 38 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure14:SwingXtimes Task Example.In this instance, the goal is pick up the green cube, first move it to the right-side target, then put down the cube on the left-side target (i.e., swing between targets one ...
-
[57]
Performoneswing cycle.e.g., “Pick up the green cube, move it to the right-side target and then put down the cube on the left-side target and press the button to stop”
-
[58]
Performmultipleswing cycles.e.g., “Pick up the green cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.”. Successful ReachA reach is successful when the cube is held nearly upright and positioned within a small tolerance of the...
-
[59]
The robot picks up the wrong cube
-
[60]
The button is pressed before all repetitions are completed
-
[61]
press the button to stop the cube exactly at the target on its third visit
The robot reaches either target more than the specified number of times (excessive swings). 39 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure15:StopCube Task Example.In this instance, the goal is press the button to stop the cube exactly at the target on its second visit. E.4. StopCube Table 16:StopCubeTask Configura...
-
[62]
StaticBehavior:Tosucceed, therobotmustpositionitsend-effectoroverthebuttonandremainstatic(hovering) until the correct timing
-
[63]
Immediate Stop:Pressing the button stops the cube instantly. The robot must press at the exact moment the cube reaches the target zone in the specified cycle, accounting for the motion delay from hover to button. Success Criteria • Precise Synchronization:The button must be pressed strictly within the time window when the cube overlaps with the target. •C...
-
[64]
Watch the video carefully, then pick up the container hiding the green cube
Pickonecontainer hiding a specified cube e.g., “Watch the video carefully, then pick up the container hiding the green cube.”
-
[65]
Picktwocontainers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green/blue) is...
-
[66]
First press the button on the table, then pick up the container hiding the green cube
Pickonespecified container by cube color e.g., “First press the button on the table, then pick up the container hiding the green cube.”
-
[67]
Picktwospecified containers sequentially (order matters) e.g., “First press the button on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot need to presses the button at the beginning. During the press, multiple containers are concurrently p...
-
[69]
The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 42 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure18:VideoUnmaskSwap Task Example.In this instance, the robot first watches the video, then picks up the container hiding the blue cube, followed by the one hiding the green cube. The vide...
-
[70]
Watch the video carefully, then pick up the container hiding the blue cube
Pickonespecified container e.g., “Watch the video carefully, then pick up the container hiding the blue cube.”
-
[71]
Picktwospecified containers sequentially (order matters) e.g., “Watch the video carefully, then pick up the container hiding the blue cube, and finally pick up the container hiding the red cube.” Task CharacteristicsEach episode consists of avideo phasefollowed by anexecution phase. • Video:Multiple containers are placed on the table. Each cube (red/green...
-
[72]
First press both buttons on the table, then pick up the container hiding the red cube
Pickonespecified container e.g., “First press both buttons on the table, then pick up the container hiding the red cube.”
-
[73]
Picktwospecified containers sequentially (order matters) e.g., “First press both buttons on the table, then pick up the container hiding the green cube, and finally pick up the container hiding the red cube.” Task Characteristics • Button Pressing:The robot needs to press both buttons. During the pressing, multiple containers are placed on the table to en...
-
[74]
The robot picks up any container before completing the button-press phase
-
[75]
first press the button, then pick up all highlighted cubes, finally press the button again to stop
The robot picks up an incorrect container (i.e., one hiding a non-specified cube). 44 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure20:PickHighlight Task Example.In this instance, the goal is first press the button, then pick up all cubeshighlighted by white area, finally press the button again to stop. During the bu...
-
[76]
The robot fails to press the button before attempting a pick
-
[77]
The robot picks up any wrong cubes (i.e., a cube that was not highlighted). 45 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure21:VideoRepick Task Example.In this instance, the robot first watches a video, then picks up the same previously picked cube twice, and finally presses the button to stop. Red-bordered frames d...
-
[78]
the robot picks up the wrong cube,
-
[79]
the button is pressed before finishing𝑁repetitions, or
-
[80]
the robot completes more than𝑁repetitions. 46 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies Figure22:VideoPlaceButton Task Example.In this example, the robot first observes a video depicting an interleaved sequence of cube placements and button presses, and then places the cube onto the correct target corresponding to its ...
-
[81]
The robot picks up an incorrect peg
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.