3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

Bronislav Sidik; Dror Mizrahi

arxiv: 2604.11302 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.AI

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

Bronislav Sidik , Dror Mizrahi This is my paper

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic manipulationMonte Carlo Tree Searchworld modelsspatial memoryocclusionlookahead planningpersistent scene representationcontinuous action spaces

0 comments

The pith

3D-Anchored Lookahead Planning keeps a fixed world coordinate anchor so robots can plan reaches to objects hidden by occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3D-Anchored Lookahead Planning as a method that pairs Monte Carlo Tree Search with a 3D-consistent world model to maintain scene memory across camera views. Instead of reacting only to the current image, the approach anchors planning to a persistent camera-to-world transform that survives when objects leave direct view. In a five-step sequential reach task designed to test spatial memory, the method records success rates above 65 percent on steps that require recalling hidden locations, compared with near-zero performance from a greedy baseline that lacks memory. Ablation results attribute most of the gain to the tree-search component that reasons over the anchored model rather than to deeper search alone. The work also resolves specific structural issues that prevent standard UCT-MCTS from working reliably in continuous robot action spaces.

Core claim

3D-ALP maintains a persistent camera-to-world anchor inside an MCTS planner whose rollouts are generated by a 3D-consistent world model, allowing accurate value estimates and replanning for targets that are no longer visible in the current camera frame.

What carries the argument

The persistent camera-to-world (c2w) anchor combined with the 3D-consistent world model as the MCTS rollout oracle.

Load-bearing premise

The 3D-consistent world model keeps accurate object positions even after they are occluded and the adapted UCT-MCTS produces stable value estimates in continuous action spaces.

What would settle it

Running the same five-step task with the world model replaced by a purely reactive policy that sees only the current camera frame would drop memory-step success rates to near zero.

Figures

Figures reproduced from arXiv: 2604.11302 by Bronislav Sidik, Dror Mizrahi.

**Figure 1.** Figure 1: 3D-ALP overview. Left: The memory gap between reactive policies and 3D-ALP on occluded-object tasks. Greedy reactive agents collapse to 0.6% SR; 3D-ALP’s persistent SE(3) anchor enables 82.2% SR on the hardest chainedmemory step. Right: The 3D-ALP architecture: a persistent SE(3) spatial anchor updated via FK, a world-model-based MCTS tree that imagines future frames, and a hybrid geometric-semantic score… view at source ↗

**Figure 2.** Figure 2: 3D-ALP qualitative results. Top row: MCTS planning tree at steps 1, 3, and 5 (⋆). Each node is a candidate c2w pose; colour encodes Q-value (blue=low, teal=high); edge width = visit count; gold border = selected action. At step 5 (⋆), the planner navigates to a position that is no longer visible in the current frame — the c2w anchor retains its coordinates in the persistent tree. Bottom left: 3D EE trajec… view at source ↗

read the original abstract

We present 3D-Anchored Lookahead Planning (3D-ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D-consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D-ALP maintains a persistent camera-to-world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5-step sequential reach task requiring spatial memory (Experiment E3), 3D-ALP achieves 0.650 0.109 success rate on memory-required steps versus 0.006 0.008 for a greedy reactive baseline ({\Delta}=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT-MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces 3D-Anchored Lookahead Planning (3D-ALP), which augments Monte Carlo Tree Search (MCTS) with a 3D-consistent world model and a persistent camera-to-world (c2w) anchor to support robotic manipulation tasks that require spatial memory across occlusions. Unlike reactive greedy policies, 3D-ALP performs lookahead planning in a stable 3D frame. On a 5-step sequential reach task (Experiment E3), it reports 0.650 ± 0.109 success on memory-required steps versus 0.006 ± 0.008 for the baseline, with step-5 success of 0.822 versus 0.000. An ablation on 30 episodes across 3 seeds attributes +0.533 (82%) of the gain to tree-search spatial memory and +0.111 (17%) to deeper lookahead. The work also identifies and resolves four structural failure modes when applying UCT-MCTS to continuous robotic action spaces.

Significance. If the central assumptions hold, the approach offers a concrete mechanism for System-2-style lookahead in robotics that can maintain accurate beliefs about occluded objects, yielding large empirical gains on memory-dependent tasks. The provision of standard errors, a quantitative ablation isolating the memory component, and explicit treatment of MCTS failure modes are strengths that would make the result reproducible and extensible if the world-model accuracy is confirmed.

major comments (2)

[Experiment E3] Experiment E3 and ablation study: the headline delta (+0.645 success rate on memory-required steps) is attributed to the persistent c2w anchor enabling accurate MCTS rollouts to occluded object positions. No quantitative tracking error (e.g., mean position RMSE versus simulator ground truth for occluded objects) or noise-injection ablation into the rollout oracle is reported. Without this, the performance gain cannot be cleanly credited to 3D-anchored lookahead rather than an idealized simulator world model.
[MCTS failure modes discussion] Section on UCT-MCTS failure modes: the manuscript states that four structural failure modes were identified and resolved to produce reliable value estimates in continuous action spaces, yet provides no explicit verification, ablation, or metric showing that the proposed fixes systematically eliminate those modes under the experimental conditions. This verification is load-bearing for the claim that the resolved MCTS variants are reliable.

minor comments (2)

[Abstract] Abstract: success rates are written as '0.650 0.109' and '0.006 0.008' without the ± symbol or consistent formatting; standardize to conventional statistical reporting.
[Ablation study] Ablation study: the protocol states 30 episodes and 3 seeds, but it is unclear whether these numbers apply uniformly to all reported results (including the main E3 comparison) or only to the ablation; explicit statement would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiment E3] Experiment E3 and ablation study: the headline delta (+0.645 success rate on memory-required steps) is attributed to the persistent c2w anchor enabling accurate MCTS rollouts to occluded object positions. No quantitative tracking error (e.g., mean position RMSE versus simulator ground truth for occluded objects) or noise-injection ablation into the rollout oracle is reported. Without this, the performance gain cannot be cleanly credited to 3D-anchored lookahead rather than an idealized simulator world model.

Authors: We agree that the manuscript would be strengthened by direct quantitative evidence of world-model accuracy on occluded objects. The existing ablation already isolates the tree-search spatial memory component (+0.533 success, 82% of total gain) as the primary driver, which depends on the persistent c2w anchor for rollout accuracy. In the revised version we will add (i) mean position RMSE of the world-model predictions versus simulator ground truth specifically for occluded objects in Experiment E3 and (ii) a noise-injection ablation into the rollout oracle. These additions will allow readers to evaluate the fidelity of the 3D world model independently of the planning gains. revision: yes
Referee: [MCTS failure modes discussion] Section on UCT-MCTS failure modes: the manuscript states that four structural failure modes were identified and resolved to produce reliable value estimates in continuous action spaces, yet provides no explicit verification, ablation, or metric showing that the proposed fixes systematically eliminate those modes under the experimental conditions. This verification is load-bearing for the claim that the resolved MCTS variants are reliable.

Authors: We acknowledge that while the manuscript describes the four identified failure modes and the resolutions applied, it does not include explicit quantitative verification (e.g., before/after metrics on value-estimate reliability or failure-mode incidence) under the reported experimental conditions. In the revision we will add targeted analysis, such as metrics quantifying the reduction in each failure mode and the resulting improvement in value-estimate stability, evaluated on the same 30-episode, 3-seed setup used for the ablation study. This will provide the requested load-bearing evidence for the reliability of the resolved MCTS variants. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of self-defined quantities

full rationale

The paper reports success rates and ablation deltas from direct experimental comparisons (E3 task, 30 episodes, 3 seeds) against a greedy baseline and internal variants. No equations, fitted parameters, or predictions are presented that reduce the claimed gains (+0.645 overall, +0.533 from tree search) to quantities defined by the method itself. The 3D c2w anchor and MCTS rollout oracle are evaluated via external task metrics rather than by construction or self-citation chains. The derivation chain consists of algorithmic description plus empirical measurement and is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that a learned or provided 3D world model can serve as a faithful rollout oracle and that standard MCTS can be adapted to continuous spaces once four structural issues are fixed; no explicit free parameters or new physical entities are introduced beyond the proposed anchoring technique.

axioms (2)

domain assumption A 3D-consistent world model can accurately predict scene states from actions even under occlusion
Invoked when the model is used as the MCTS rollout oracle for replanning to unobserved object positions.
domain assumption UCT-MCTS can be made effective for continuous robotic manipulation once four structural failure modes are resolved
Stated directly in the abstract as a contribution of the work.

invented entities (1)

persistent camera-to-world (c2w) anchor no independent evidence
purpose: Maintains fixed 3D object positions across camera motion and occlusion for replanning
Core novel component enabling the memory capability; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5529 in / 1592 out tokens · 72142 ms · 2026-05-10T15:00:33.346246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 6 internal anchors

[1]

G. Best, O. Cliff, T. Patten, R. Mettu, and R. Fitch. Dec-mcts: Decentralized planning for multi-robot active perception.International Journal of Robotics Research, 38(2–3):316–337, 2019

work page 2019
[2]

Bhatt et al

S. Bhatt et al. Aligning robots’ uncertainty with inherent task ambiguity. InNeurIPS, 2024

work page 2024
[3]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,

GigaBrain Team. GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learn- ing.arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026
[4]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team. Gigaworld-0: World models as data engine to empower embodied AI.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025
[5]

W. Guo, G. Lu, H. Deng, Z. Wu, Y. Tang, and Z. Wang. Vla-reasoner: Empowering vision- language-action models with reasoning via on- line monte carlo tree search.arXiv preprint arXiv:2509.22643, 2025

work page arXiv 2025
[6]

Mastering Diverse Domains through World Models

D. Hafner et al. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[7]

Hansen, H

N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. InIn- ternational Conference on Learning Representations (ICLR), 2024

work page 2024
[8]

Population Based Training of Neural Networks

M. Jaderberg et al. Population based training of neural networks.arXiv preprint arXiv:1711.09846, 2017

work page Pith review arXiv 2017
[9]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Kocsis and C

L. Kocsis and C. Szepesvári. Bandit based monte- carlo planning. InEuropean Conference on Machine Learning (ECML), pages 282–293, 2006

work page 2006
[11]

Lauri, D

M. Lauri, D. Hsu, and J. Pajarinen. Partially ob- servable markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 2023

work page 2023
[12]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human-in-the- loop reinforcement learning.Science Robotics, 2024

work page 2024
[13]

L. Maes, Q. Le Lidec, D. Scieur, Y. LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312, 2026. [14]π 0 Team.π 0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2026
[14]

Schadd, M

M. Schadd, M. Winands, H. van den Herik, G. Chaslot, and J. Uiterwijk. Single-player monte- carlo tree search. InComputers and Games, pages 1–12, 2008. 5

work page 2008
[15]

Schrittwieser et al

J. Schrittwieser et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588:604–609, 2020

work page 2020
[16]

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

H.S. Shahgir, X. Chen, Y. Fu, E. Shayegani, N. Abu- Ghazaleh, Y. Kementchedjhieva, and Y. Dong. Vlms need words: Vision language models ignore visual detail in favor of semantic anchors.arXiv preprint arXiv:2604.02486, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Silver et al

D. Silver et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016

work page 2016
[18]

Snoek, H

J. Snoek, H. Larochelle, and R. Adams. Practi- cal bayesian optimization of machine learning algo- rithms. InNeurIPS, 2012

work page 2012
[19]

Xiao et al

B. Xiao et al. Florence-2: Advancing a unified repre- sentation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[20]

A. Ye, C. Ni, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026
[21]

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

X. Zhang et al. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Z. Zhou et al. Dino-wm: World models on pre- trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 6

work page internal anchor Pith review arXiv 2024

[1] [1]

G. Best, O. Cliff, T. Patten, R. Mettu, and R. Fitch. Dec-mcts: Decentralized planning for multi-robot active perception.International Journal of Robotics Research, 38(2–3):316–337, 2019

work page 2019

[2] [2]

Bhatt et al

S. Bhatt et al. Aligning robots’ uncertainty with inherent task ambiguity. InNeurIPS, 2024

work page 2024

[3] [3]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,

GigaBrain Team. GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learn- ing.arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026

[4] [4]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team. Gigaworld-0: World models as data engine to empower embodied AI.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025

[5] [5]

W. Guo, G. Lu, H. Deng, Z. Wu, Y. Tang, and Z. Wang. Vla-reasoner: Empowering vision- language-action models with reasoning via on- line monte carlo tree search.arXiv preprint arXiv:2509.22643, 2025

work page arXiv 2025

[6] [6]

Mastering Diverse Domains through World Models

D. Hafner et al. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023

[7] [7]

Hansen, H

N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. InIn- ternational Conference on Learning Representations (ICLR), 2024

work page 2024

[8] [8]

Population Based Training of Neural Networks

M. Jaderberg et al. Population based training of neural networks.arXiv preprint arXiv:1711.09846, 2017

work page Pith review arXiv 2017

[9] [9]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Kocsis and C

L. Kocsis and C. Szepesvári. Bandit based monte- carlo planning. InEuropean Conference on Machine Learning (ECML), pages 282–293, 2006

work page 2006

[11] [11]

Lauri, D

M. Lauri, D. Hsu, and J. Pajarinen. Partially ob- servable markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 2023

work page 2023

[12] [12]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human-in-the- loop reinforcement learning.Science Robotics, 2024

work page 2024

[13] [13]

L. Maes, Q. Le Lidec, D. Scieur, Y. LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels. arXiv preprint arXiv:2603.19312, 2026. [14]π 0 Team.π 0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2026

[14] [14]

Schadd, M

M. Schadd, M. Winands, H. van den Herik, G. Chaslot, and J. Uiterwijk. Single-player monte- carlo tree search. InComputers and Games, pages 1–12, 2008. 5

work page 2008

[15] [15]

Schrittwieser et al

J. Schrittwieser et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588:604–609, 2020

work page 2020

[16] [16]

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

H.S. Shahgir, X. Chen, Y. Fu, E. Shayegani, N. Abu- Ghazaleh, Y. Kementchedjhieva, and Y. Dong. Vlms need words: Vision language models ignore visual detail in favor of semantic anchors.arXiv preprint arXiv:2604.02486, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Silver et al

D. Silver et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016

work page 2016

[18] [18]

Snoek, H

J. Snoek, H. Larochelle, and R. Adams. Practi- cal bayesian optimization of machine learning algo- rithms. InNeurIPS, 2012

work page 2012

[19] [19]

Xiao et al

B. Xiao et al. Florence-2: Advancing a unified repre- sentation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[20] [20]

A. Ye, C. Ni, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026

[21] [21]

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

X. Zhang et al. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Z. Zhou et al. Dino-wm: World models on pre- trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 6

work page internal anchor Pith review arXiv 2024