{"total":14,"items":[{"citing_arxiv_id":"2606.30318","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chronos: A Physics-Informed Full-History Framework for Non-Markovian Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-29T14:00:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27677","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DIM-WAM: World-Action Modeling with Diverse Historical Event Memory","primary_cat":"cs.RO","submitted_at":"2026-06-26T03:17:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20092","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-06-18T11:11:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12499","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action-Effect Memory Pretraining for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-10T13:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10363","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-09T03:22:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04172","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-02T19:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30877","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Wall-OSS-0.5 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-29T06:04:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03269","ref_index":32,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"fast, generalize better. InAdvances in Neural Information Processing Systems, 2025. [31] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024. [32] Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025. 30 RLDX-1 Technical Report [33] Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei,"},{"citing_arxiv_id":"2605.01448","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-02T13:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"where p∈R 3 is the end-effector position in the robot base frame, q∈H is a unit quaternion specifying orientation, and g is a binary gripper command (e.g.,g=1for open,g=0for close). A.2. Discrete action format for the LLM For LLM interaction, we discretize translation and rotation into integer bins and represent each action as a 7-tuple: a= [i x, iy, iz, ir, ip, iψ, g],(7) where (ix, iy, iz) are voxel indices for translation, (ir, ip, iψ) are discrete bins for Euler angles (roll, pitch, yaw), and g is the gripper command. A.3. Encoding: continuous control→discrete LLM tokens Translation discretization.Let the workspace bounds be an axis-aligned box bmin ∈R 3,b max ∈R 3,(8) and letVbe the number of uniform bins per axis (we useV=100)."},{"citing_arxiv_id":"2604.18933","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Memory Policy","primary_cat":"cs.RO","submitted_at":"2026-04-21T00:14:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"•A comprehensive empirical study in simulation and the real world, examining memory-related design choices and their impact on performance, robustness, and computa- tional efficiency. II. RELATEDWORK A. Structured Memory for Robot Policy Prior memory-based policies have typically utilizedaction parameterization or trajectory tracking, such as storing keyframe heatmaps [14], referencing object trajectories [6, 12], or visual trace overlays [53]. Alternatively,semantic and latent representationscan be used to store history, such as environment dynamics estimation [25], LLM-generated textual plans [33], or vector database retrieval of observation and prompt embeddings [1]. Despite their success, structured memory often requires manual design, such as predefined"},{"citing_arxiv_id":"2604.15483","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"angyu Zhang, and Gao Huang. Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025. [32] Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025. [33] Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025. [34] Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al."},{"citing_arxiv_id":"2602.20323","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning","primary_cat":"cs.RO","submitted_at":"2026-02-23T20:18:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"it interacts with the world? The literature offers four broad answers, summarized in Table I. The most ambitious option updates the model itself, through online reinforcement learn- ing [41, 62, 52, 47], meta-learning [22], test-time training [57], or imitation finetuning of a memory-augmented backbone, as in MemER [55], MemoryVLA [51, 35], MEM [61], and SAM2Act [18]. A second family keeps the base frozen and feeds it context through retrieval [34, 24, 36] or natural- language reflection [53, 42]. The piece they leave out is a check on whether a remembered experience still applies in the current scene; our experiments make this concrete: selective retrieval matches the no-memory baseline (53% on medium), and free-form reflection plateaus at 61%."},{"citing_arxiv_id":"2505.23617","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory","primary_cat":"cs.CV","submitted_at":"2025-05-29T16:25:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.02818","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields","primary_cat":"cs.RO","submitted_at":"2024-12-03T20:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}