Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action
Pith reviewed 2026-05-22 05:54 UTC · model grok-4.3
The pith
SOMA equips VLA models with persistent spatial memory to manipulate objects initially outside the camera view.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a spatial memory constructed from multi-view observations via a movable head camera, refined dynamically for consistency, and retrieved contextually during action, allows VLAs to reason and act effectively on targets that are out of the current visual field, resulting in higher success rates and more efficient behaviors such as reduced search and one-shot grasping.
What carries the argument
The three-part spatial memory framework: construction by aggregating angular observations into a unified representation, dynamic refinement to maintain consistency, and contextual retrieval of instruction-relevant cues.
If this is right
- Task success rates rise on five real-world out-of-vision manipulation tasks including multi-step and dual-arm scenarios.
- Behaviors shift to faster target localization with less viewpoint searching.
- Near one-shot grasping occurs under partial observability.
- The memory approach also improves or maintains performance in fully observable simulation settings like RoboCasa and SimplerEnv.
Where Pith is reading between the lines
- This memory mechanism could be adapted to other robot perception systems facing similar visibility constraints.
- Explicit spatial representations might become a standard addition to VLA architectures to handle real-world uncertainty.
- Reducing dependence on continuous visual search could lower energy use and wear on robot actuators.
Load-bearing premise
The approach assumes that angular-wise observations from a movable head camera can be reliably aggregated into a unified spatial-semantic representation that remains globally consistent over time without significant drift or semantic errors in real-world conditions.
What would settle it
Running the system in an environment where camera scans accumulate positional drift or misidentify objects would show whether the memory actually supports the claimed improvements or collapses to baseline performance.
Figures
read the original abstract
We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SOMA, a spatial memory framework for Vision-Language-Action models to handle out-of-vision manipulation. It builds a persistent spatial-semantic representation by aggregating multi-view observations from a movable head camera via Spatial Memory Construction, maintains consistency through Dynamic Memory Refinement, and retrieves relevant cues with Contextual Memory Retrieval. Experiments on five real-world out-of-vision tasks (including multi-step and dual-arm) plus RoboCasa GR1 and SimplerEnv simulations report improved success rates and qualitatively different behaviors such as faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability.
Significance. If the empirical claims hold with proper validation, this work addresses a key limitation in current VLAs (implicit full observability) and could meaningfully advance robust robotic manipulation in unstructured, partially observable environments. The focus on inducing qualitatively different behaviors, rather than incremental gains, is a notable strength, as is the release of code for reproducibility.
major comments (2)
- [§4.3 and §3.1] §4.3 (Dynamic Memory Refinement) and §3.1 (Spatial Memory Construction): the central claim that multi-view angular observations yield a globally consistent representation enabling 'near one-shot grasping' and 'reduced viewpoint search' requires quantitative evidence of cumulative localization error, semantic label stability, and failure modes under realistic head-camera motion, calibration drift, or occlusions; without metrics (e.g., positional drift after N scans), the attribution to memory rather than base VLA policy remains unverified.
- [§5] §5 (Experimental Results) and associated tables: the reported improvements in task success rates on the five real-world tasks lack explicit numerical values, strong baselines, ablations isolating each component, or statistical tests, which are load-bearing for supporting both the quantitative gains and the qualitative behavior claims.
minor comments (2)
- [Abstract] Abstract: adding concrete success-rate deltas or key quantitative highlights would improve readability without altering the narrative.
- [Figures] Figure 3 or equivalent (memory visualization): ensure scale bars, coordinate frames, and error annotations are present to allow readers to assess spatial consistency directly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight important opportunities to strengthen the empirical validation of SOMA's spatial memory components. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4.3 and §3.1] §4.3 (Dynamic Memory Refinement) and §3.1 (Spatial Memory Construction): the central claim that multi-view angular observations yield a globally consistent representation enabling 'near one-shot grasping' and 'reduced viewpoint search' requires quantitative evidence of cumulative localization error, semantic label stability, and failure modes under realistic head-camera motion, calibration drift, or occlusions; without metrics (e.g., positional drift after N scans), the attribution to memory rather than base VLA policy remains unverified.
Authors: We agree that quantitative metrics on memory consistency would provide stronger support for attributing performance gains to the proposed components. The current manuscript focuses on end-task success and observed behavioral changes, but does not report explicit measures such as positional drift, label stability, or detailed failure analysis under motion and occlusion. In the revised manuscript we will add a dedicated subsection with these metrics from our real-world setups, including average positional drift after successive scans, semantic consistency scores, and documented failure cases. This addition will help clarify the contribution of the memory framework relative to the base VLA policy. revision: yes
-
Referee: [§5] §5 (Experimental Results) and associated tables: the reported improvements in task success rates on the five real-world tasks lack explicit numerical values, strong baselines, ablations isolating each component, or statistical tests, which are load-bearing for supporting both the quantitative gains and the qualitative behavior claims.
Authors: We acknowledge that the experimental section would benefit from more explicit numerical reporting, additional baselines, component ablations, and statistical analysis. While the manuscript presents success rates across the five tasks and notes qualitative differences, the tables and text do not include all requested details. In the revision we will expand the results section and tables to report precise success percentages, introduce stronger baselines (including memory-ablated variants), provide ablations for each of the three modules, and include statistical significance tests over repeated trials. These changes will better substantiate both the quantitative improvements and the claims of qualitatively different behaviors. revision: yes
Circularity Check
No significant circularity; empirical framework validated by task performance
full rationale
The paper proposes SOMA, a spatial memory framework with three components (Spatial Memory Construction via scanning, Dynamic Memory Refinement, and Contextual Memory Retrieval) to enable out-of-vision manipulation in VLAs. Claims of improved success rates, faster localization, and near one-shot grasping rest entirely on experimental results from five real-world tasks plus RoboCasa and SimplerEnv evaluations. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central contribution is an empirical system whose effectiveness is measured directly against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Multi-view observations acquired via a movable head camera can be aggregated into a unified spatial-semantic representation through scanning.
- domain assumption Dynamic Memory Refinement can maintain global consistency over time without significant drift.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Spatial Memory Construction... aggregates angular-wise observations into a unified spatial–semantic representation through scanning; Dynamic Memory Refinement... similarity-aware fusion
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Contextual Memory Retrieval aligns instruction-grounded vision–language embeddings with the spatial memory
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Excessive Actuation Latency.Both gripper and joints commands are monitored for temporal alignment. Any se- quence exhibiting ≥300 ms delay between user input and recorded robot actuation is marked invalid, as such latency typically arises from communication bottlenecks or onboard processing delays and leads to distorted motion supervision
-
[2]
Unintentional Gripper Oscillation.Gripper trajecto- ries containing rapid, continuous open–close toggling are flagged as operator mis-triggers and discarded, as these patterns do not reflect meaningful manipulation intent
-
[3]
Gripper Value Spikes.The gripper sensor normally out- puts values within the 0–1000 range. We label as cor- rupted any trajectory containing systematic spiking (e.g., 1001–1010) caused by low-level hardware protocol noise or overflow. Such discontinuities compromise the smoothness required for imitation learning
-
[4]
These issues hinder vi- sual–kinematic alignment and degrade learning
Camera View Contamination.Demonstrations are re- jected if the robot cameras capture large human-body in- trusions, missing robot arms, or severe motion blur from aggressive operator movement. These issues hinder vi- sual–kinematic alignment and degrade learning
-
[5]
Invalid Reach Strategy.For humanoid-arm grasping, we enforce a consistent approach strategy. Specifically, trajec- tories employing top-down, table-type grasping are filtered 13 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action out, as such motions contradict the intended horizontal, in- hand-level grasping typical for humanoid robot...
-
[6]
Raw RGB-D recordings are capped at 25 FPS
Video–Joint Stream Misalignment (Frame Dropping). Raw RGB-D recordings are capped at 25 FPS. Any sequence with ≥2.5 dropped frames is removed to avoid misalignment between visual frames and joint trajectories, which is critical for paired multimodal supervision. Across all criteria, the review ensures that only temporally aligned, semantically valid, and ...
-
[7]
Invisible-to-Invisible PnPrequires the robot to grasp a target object that is outside the current field of view and place it at a target location that is also outside the field of view, testing the ability to operate entirely beyond direct visual observation
-
[8]
Visible-to-Invisible PnPinvolves grasping a target object that is initially visible and placing it at a target location out- side the current view, evaluating whether spatial information about unseen goal locations can be recalled and utilized
-
[9]
Invisible-to-Visible PnPrequires the robot to grasp an object that is initially outside the field of view and place it at a visible target location, testing spatial recall of unseen objects followed by precise execution
-
[10]
Sequential Dual-Object PnPextends the setting to multi- stage manipulation: the robot first performs pick-and-place for a visible object, and subsequently grasps and places a second object located outside the field of view, stressing memory persistence across sequential task stages
-
[11]
Each task involves three distinct objects and was collected with400expert demonstrations
Dual-Arm Coordination PnPrepresents the most chal- lenging scenario, where both the target object and target placement location are outside the field of view, and suc- cessful execution requires coordinated dual-arm handover followed by accurate placement based on a globally consis- tent spatial representation. Each task involves three distinct objects an...
work page 2024
-
[12]
Container Interaction Tasks(e.g., placing items into cab- inets, drawers, or microwaves and subsequently closing them), testing multi-stage pick–place–close behaviors with articulated receptacles
-
[13]
Cooking Preparation Tasks(e.g., transferring objects from a cutting board into pans, pots, or baskets), simulating preparatory cooking steps that require stable grasping and reliable object placement
-
[14]
Tabletop Serving Tasks(e.g., moving items from a place- mat to bowls or plates), capturing serving-style behaviors that depend on precise spatial reasoning
-
[15]
Dish Transfer Tasks(e.g., moving objects between dish- ware such as plate→bowl), evaluating fine-grained coordi- nation during mid-meal or cleanup scenarios
-
[16]
Tray Organization Tasks(e.g., transporting items from a tray to multi-level shelves or containers), assessing multi- target placement and organized spatial planning. 16 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action Task Category Diffusion Policy(Chi et al., 2025)GR00T N1.5(Bjorck et al., 2025)SOMA (Ours) 30 100 300 Full 30 100 30...
work page 2025
-
[17]
First-Fixation Time.First-Fixation Time measures the temporal latency between the moment the target object first enters the head camera’s field of view and the initiation of the robot’s grasping motion. Specifically, it is defined as the 17 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action Task Category Update Strategy Retrieval Modu...
work page 2020
-
[18]
Head Search Path Length.Head Search Path Length measures the total amount of head motion required to bring the target object into view. Let t∈ {0,1, . . . , T−1} index environment steps in an episode, and let (ψt, ϕt) denote the head pan (yaw) and tilt (pitch) angles at step t, measured in degrees. Let tvis denote the step at which the target object first...
-
[19]
Each grasp attempt is defined as a completed gripper closing action
Grasp Attempt Count.Grasp Attempt Count measures the number of grasp executions issued by the controller until a successful grasp is achieved. Each grasp attempt is defined as a completed gripper closing action. Lower values indicate more reliable target localization and manipulation, with values close to one corresponding to near one-shot grasping behavior
-
[20]
Time-to-Grasp.Time-to-Grasp measures the total number of environment steps from episode start until a successful grasp is completed. This metric aggregates both perception and manipulation efficiency, reflecting the overall speed of task execution under partial observability. D.3. Quantitative Results As shown in Table 9, SOMA achieves SOTA performance on...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.