Planning with the Views

Jiajun Wu; Kangrui Wang; Leonidas Guibas; Li Fei-Fei; Lijuan Wang; Linjie Li; Manling Li; Shiqi Chen; Zhengyuan Yang; Zihan Wang

arxiv: 2605.29563 · v3 · pith:VWW4R72Inew · submitted 2026-05-28 · 💻 cs.AI · cs.CV· cs.RO

Planning with the Views

Kangrui Wang , Linjie Li , Zhengyuan Yang , Shiqi Chen , Zihan Wang , Li Fei-Fei , Jiajun Wu , Leonidas Guibas

show 2 more authors

Lijuan Wang Manling Li

This is my paper

Pith reviewed 2026-06-29 07:51 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.RO

keywords view planningvision-language models3D scene understandingself-explorationview graphmulti-turn planninginteractive planningScanNet

0 comments

The pith

VLMs understand single camera moves but fail to chain them into multi-turn plans until self-exploration trajectories are distilled into a view graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision-language models can predict how one camera action changes a view and then compose many such predictions to reach a target viewpoint in 3D space. Experiments across 13 frontier models show they hold basic view-action knowledge yet cannot compose it for longer sequences, and the failure rate grows with greater distance between start and target views. The proposed remedy is an iterative loop that lets the model explore a scene, collects every trajectory into a single view graph, and distills that graph into varied supervised tasks. This process raises success on interactive view planning from 2.5 percent to 47.8 percent for a 7B model, exceeding the scores of much larger frontier systems.

Core claim

Frontier VLMs possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. An iterative framework that alternates self-exploration with view graph distillation, where all trajectories regardless of outcome form a compact view graph distilled into supervised tasks, overcomes sparse rewards and lifts Qwen2.5-VL-7B to 47.8 percent success, surpassing GPT-5.4 Pro and Gemini 3.1 Pro.

What carries the argument

View graph distillation, in which every self-exploration trajectory is aggregated into a compact graph of viewpoint connections and then turned into diverse supervised tasks that reshape the policy distribution.

If this is right

The performance gap between single-step view-action knowledge and multi-turn composition increases steadily with viewpoint distance.
Distilling the full set of exploration trajectories into supervised tasks reshapes the policy distribution and overcomes the sparse-reward problem without environment-specific tuning.
Self-exploration followed by graph distillation provides a route for VLMs to acquire active 3D reasoning and planning.
The same iterative alternation between exploration and distillation can be repeated to further improve planning skill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The view-graph approach could be tested on other sparse-reward embodied tasks such as object manipulation or navigation where only outcome data is cheap to collect.
Because the graph is built from all trajectories, the method may reduce the need for hand-curated demonstration datasets in training planning models.
The same distillation step might allow transfer of planning skill to new scenes that share similar viewpoint connectivity patterns.

Load-bearing premise

Trajectories collected during self-exploration, regardless of success, collectively form a view graph that compactly and unbiasedly captures how viewpoints connect across a scene.

What would settle it

Running the same supervised training on only successful trajectories or on standard reinforcement learning and finding no performance gain over the distilled view-graph version.

read the original abstract

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a concrete training loop that lifts a 7B VLM's multi-step 3D view planning from 2.5% to 47.8% on a new ScanNet benchmark by distilling self-exploration trajectories into a view graph, though the graph may simply echo the base policy's short-horizon bias.

read the letter

The main thing here is a new benchmark called ViewSuite plus an iterative loop that turns every self-exploration trajectory into a view graph and distills it into supervised tasks. This reportedly moves Qwen2.5-VL-7B from 2.5% to 47.8% success on interactive view planning and beats GPT-5.4 Pro and Gemini 3.1 Pro. The diagnosis that current VLMs know single view changes but cannot compose them over distance is clear from the numbers they report.

The work does two things well. It defines a clean task that sits between single-step action prediction and full robotics planning, and it ships code and data so others can check the numbers. The idea of using failed trajectories anyway to build connectivity is straightforward and avoids the usual sparse-reward dead end in pure RL.

The soft spot is exactly the one in the stress-test note. Because the starting policy already fails at long-horizon composition, the trajectories it collects are likely short and myopic. Nothing in the abstract shows that the resulting graph covers distant viewpoints or is unbiased relative to the true scene geometry. If the graph is mostly local connections the base model already understood, then the big jump could be an artifact of how the supervised tasks were constructed rather than evidence of better planning. I would want to see coverage statistics on the graph, comparisons to random or oracle trajectories, and whether the gains survive when the base model is already stronger.

This is aimed at people working on VLM agents for 3D spatial reasoning. It is concrete enough and ships artifacts, so it deserves a serious referee. I would send it out, but with a request for the graph-quality checks above.

Referee Report

3 major / 3 minor

Summary. The paper introduces ViewSuite, a 3D point-cloud benchmark on ScanNet scenes to evaluate VLMs on view planning: single-step view-action understanding versus multi-turn composition to reach target views. It reports that 13 frontier VLMs exhibit basic view-action knowledge but fail at composition, with the gap increasing with viewpoint distance. The authors propose an iterative self-exploration plus view-graph-distillation framework whose key claim is that all trajectories (success or failure) form a compact, unbiased view graph that can be distilled into supervised tasks to overcome sparse rewards; this raises Qwen2.5-VL-7B from 2.5% to 47.8% success, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).

Significance. If the central empirical result holds after addressing coverage and bias concerns, the work would demonstrate a scalable, self-supervised route to improve long-horizon 3D planning in VLMs without external labels or hand-crafted rewards, with potential relevance to embodied reasoning and navigation.

major comments (3)

[Abstract / §3] Abstract and §3 (framework description): the load-bearing claim that "all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene" is not accompanied by any quantitative verification that the graph covers long-range connections. Because trajectories are generated by the same initial policy shown to fail at multi-turn composition, the resulting graph may inherit the myopic distribution; without coverage metrics (e.g., fraction of distant viewpoint pairs connected) the 2.5%→47.8% gain cannot be confidently attributed to improved planning rather than dataset construction artifacts.
[§4] §4 (experiments) and associated tables: the headline numbers (Qwen2.5-VL-7B 47.8%, GPT-5.4 Pro 18.5%, Gemini 3.1 Pro 21.4%) are reported without error bars, number of scenes, statistical tests, or controls for prompt variation and scene selection. The review notes that these controls are absent, making it impossible to assess whether the claimed outperformance is robust.
[§3 / §4] §3 and §4: no ablation isolates the contribution of view-graph distillation from other factors in the iterative loop (e.g., simply collecting more trajectories or changing the supervised-task distribution). Without such controls the causal link between the proposed mechanism and the performance jump remains unverified.

minor comments (3)

[Abstract] Abstract: missing spaces after parenthetical numbers: "(1)understanding" and "(2)composing".
[Abstract / §4] Model nomenclature: "GPT-5.4 Pro" and "Gemini 3.1 Pro" are non-standard; clarify exact versions or checkpoints used.
The paper provides a code/data link, which is a positive reproducibility practice.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The three major comments identify important gaps in verification, statistical rigor, and causal attribution. We respond to each point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (framework description): the load-bearing claim that "all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene" is not accompanied by any quantitative verification that the graph covers long-range connections. Because trajectories are generated by the same initial policy shown to fail at multi-turn composition, the resulting graph may inherit the myopic distribution; without coverage metrics (e.g., fraction of distant viewpoint pairs connected) the 2.5%→47.8% gain cannot be confidently attributed to improved planning rather than dataset construction artifacts.

Authors: We agree that explicit quantitative coverage metrics are needed to substantiate the claim that the view graph captures long-range connections beyond the initial policy's distribution. In the revised manuscript we will add (i) the fraction of viewpoint pairs at varying distances that become connected after each iteration, (ii) a comparison of path-length distributions in the distilled graph versus the raw trajectories, and (iii) a visualization of the largest connected component across ScanNet scenes. These additions will allow readers to assess whether the performance gain stems from improved coverage rather than artifacts. revision: yes
Referee: [§4] §4 (experiments) and associated tables: the headline numbers (Qwen2.5-VL-7B 47.8%, GPT-5.4 Pro 18.5%, Gemini 3.1 Pro 21.4%) are reported without error bars, number of scenes, statistical tests, or controls for prompt variation and scene selection. The review notes that these controls are absent, making it impossible to assess whether the claimed outperformance is robust.

Authors: We acknowledge the absence of these statistical controls. The revised version will report: the exact number of ScanNet scenes and evaluation episodes, standard deviations or confidence intervals over multiple random seeds, results of paired statistical tests (e.g., McNemar or Wilcoxon), and an additional control experiment that varies prompt phrasing while keeping the model fixed. These changes will make the robustness of the 47.8 % result verifiable. revision: yes
Referee: [§3 / §4] §3 and §4: no ablation isolates the contribution of view-graph distillation from other factors in the iterative loop (e.g., simply collecting more trajectories or changing the supervised-task distribution). Without such controls the causal link between the proposed mechanism and the performance jump remains unverified.

Authors: We agree that an ablation isolating view-graph distillation is required. In the revision we will add a controlled experiment that (a) continues self-exploration for the same total number of trajectories without distillation and (b) replaces the distilled tasks with a uniform random sampling of the same trajectory data. Performance differences between these baselines and the full view-graph-distillation pipeline will be reported to establish the specific contribution of the distillation step. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with no derivations

full rationale

The paper presents an empirical study of VLMs on view planning in a 3D environment, identifying a performance gap and proposing an iterative self-exploration plus view-graph-distillation framework. No equations, mathematical derivations, fitted parameters, or first-principles results appear in the abstract or description. The reported gains (e.g., 2.5% to 47.8%) are described as experimental outcomes of training on tasks distilled from collected trajectories, not as quantities defined in terms of those same trajectories by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the view graph is described as emerging from exploration trajectories rather than postulated as a new entity.

pith-pipeline@v0.9.1-grok · 5811 in / 1375 out tokens · 25708 ms · 2026-06-29T07:51:37.271836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[3]

1`Cdx &h7է6_7_ <z hT; vMuS q msҰ .o |1k

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.1109/cvpr.2017.261 2025

[1] [1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[2] [2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[3] [3]

1`Cdx &h7է6_7_ <z hT; vMuS q msҰ .o |1k

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.1109/cvpr.2017.261 2025