SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Aman Chadha; Amitava Das; Niyati Rawal; Saksham Jain; Shah Alam Abir; Suranjana Trivedy; Sushant Ravva; Vinija Jain

arxiv: 2605.10376 · v1 · submitted 2026-05-11 · 💻 cs.CV

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Niyati Rawal , Sushant Ravva , Shah Alam Abir , Saksham Jain , Aman Chadha , Vinija Jain , Suranjana Trivedy , Amitava Das This is my paper

Pith reviewed 2026-05-12 04:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsvision-language navigationspatial reasoning3D scene understandinginstruction followingtrajectory predictionembodied reasoningbenchmarks

0 comments

The pith

SleepWalk benchmark shows current vision-language models systematically fail at grounded spatial reasoning for instructions in 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SleepWalk as a benchmark that tests whether vision-language models can turn natural-language instructions into trajectories that respect 3D scene geometry, avoid collisions, and end at suitable action spots. Tasks are split into three tiers of rising spatial and temporal difficulty across indoor and outdoor scenes generated from text and filtered for navigability. Evaluation of frontier models on thousands of environments reveals that outputs often lose spatial coherence or alignment under occlusion, interaction limits, or multi-step commands, with results worsening at higher tiers. This setup matters because it isolates failures in the kind of localized embodied reasoning needed for agents that act on language in physical spaces.

Core claim

SleepWalk organizes instruction-grounded trajectory prediction into three tiers of spatial and temporal difficulty in single-scene 3D worlds. Given rendered views and a language instruction, a model must output a trajectory that stays geometrically valid, collision-free, and action-compatible. Tests across 2,472 scenes and nine instructions per scene show systematic shortfalls in grounded reasoning that grow with difficulty, while models retain partial ability to produce coherent and executable paths.

What carries the argument

The three-tier SleepWalk benchmark that generates navigable 3D scenes from text descriptions and scores predicted trajectories for geometric validity, executability, and instruction alignment via a pointwise judge protocol.

If this is right

Accuracy declines as tasks move from simple spatial steps to occluded or multi-step instructions.
Models frequently violate scene geometry or interaction constraints even when trajectories remain broadly plausible.
The benchmark isolates localized interaction-centric failures that differ from long-range exploration benchmarks.
Partial success on basic tiers indicates models can handle some spatial coherence but not under compositional load.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results point to a need for training regimes that explicitly reward 3D geometry awareness rather than language-visual matching alone.
Extending the benchmark to dynamic scenes with moving objects could test whether current failures stem from static layout understanding.
Success on SleepWalk might serve as a proxy signal for improving real-robot instruction following in cluttered indoor settings.
The tiered structure offers a template for diagnosing similar grounding gaps in other multimodal planning tasks.

Load-bearing premise

Scenes created from textual descriptions and filtered for navigability form an unbiased stand-in for real-world spatial challenges, and the pointwise judge accurately measures how well trajectories match instructions.

What would settle it

Running the same models on physically captured real-world 3D scans or comparing pointwise judge scores against direct human ratings of trajectory quality would show whether the synthetic scenes and evaluation protocol produce the claimed failure patterns.

Figures

Figures reproduced from arXiv: 2605.10376 by Aman Chadha, Amitava Das, Niyati Rawal, Saksham Jain, Shah Alam Abir, Suranjana Trivedy, Sushant Ravva, Vinija Jain.

**Figure 1.** Figure 1: Overview of SleepWalk. A language instruction is converted into a single-scene 3D environment using Hunyuan3D-3.0. For each scene, given language instructions, different VLMs predict trajectories, which are visualized using top-down views. A fixed judge model scores trajectories in a pointwise manner, and rankings are aggregated across environments. clusion, respect environmental constraints, and generate … view at source ↗

**Figure 2.** Figure 2: SleepWalk pipeline. Starting from a natural-language scene description, we reconstruct a single-scene 3D environment with Hunyuan3D-3.0, render top-down and oblique observations, and use Qwen3-8B-VL to generate tiered navigation instructions (easy, medium, hard). Given the scene views and an instruction, a VLM predicts a continuous action trajectory, which is then evaluated by a judge model (GPT-5-mini) us… view at source ↗

**Figure 3.** Figure 3: Qualitative trajectory comparison across difficulty levels. Column (a) shows the reconstructed scene view, while columns (b), (c), and (d) show trajectory predictions from Qwen3-VL, Gemini Robotics ER-1.5, and GPT-5-mini, respectively. Across levels, the figure highlights differences in startlocation grounding, goal accuracy, obstacle avoidance, and robustness under compositional instructions. fails on b… view at source ↗

**Figure 4.** Figure 4: Factor-wise performance across difficulty tiers. Comparison of Qwen3-VL, Gemini Robotics ER-1.5, and GPT-5-mini on four evaluation factors—start location, goal location, obstacle avoidance, and trajectory efficiency—across easy, medium, and hard tiers. 3.2 Overall Model Ranking [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: From predicted path to humanoid execution. We animate the trajectory generated by GPT-5-mini for the instruction “Walk from the bookshelf to the wall-mounted lamp” and “Approach the yellow spherical object and then move to the northern tree” using TLControl (Wan et al., 2024) and MotionGPT (Jiang et al., 2024). This provides a qualitative check of whether a path that appears correct in top-down space rema… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SleepWalk adds a tiered single-scene benchmark that shows VLMs dropping on harder spatial instructions, but the text-generated scenes and judge need more checks to back the claims.

read the letter

The paper's core contribution is a new benchmark called SleepWalk that tests VLMs on predicting executable trajectories from instructions inside single 3D scenes. It splits tasks into three tiers of increasing difficulty around occlusion, interaction constraints, and multi-step reasoning, then reports that performance falls as the tiers get harder across 2472 scenes and three models. That trend is the main takeaway for anyone working on grounded navigation.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces SleepWalk, a benchmark for instruction-guided vision-language navigation consisting of 2,472 single-scene 3D environments generated from textual descriptions and filtered for navigability. Tasks are organized into three tiers of increasing spatial and temporal difficulty, with models required to predict trajectories from rendered observations and natural-language instructions that respect geometry, avoid collisions, and align with actions. Three frontier VLMs are evaluated using a standardized pointwise judge protocol, revealing performance drops with difficulty and systematic failures in grounded spatial reasoning under occlusion, interaction constraints, and multi-step instructions.

Significance. If the synthetic scenes and judge protocol prove representative, the benchmark would offer a scalable, controlled testbed for localized embodied reasoning in VLMs, complementing long-range exploration benchmarks. The tiered design and large scale (2,472 scenes, nine instructions each) enable fine-grained diagnosis of failures, which could usefully guide development of action-capable agents. The work explicitly provides a reproducible evaluation protocol and highlights gaps in current models' spatial grounding.

major comments (3)

[§3.1] §3.1 (Scene Generation): The description of how 3D scenes are constructed from textual descriptions and the exact navigability filtering criteria are insufficiently detailed. This is load-bearing for the central claim, as the reported systematic failures in occlusion handling and interaction constraints could arise from simplified geometry or language-visual correlations in the generated scenes rather than genuine model limitations.
[§4.2] §4.2 (Evaluation Protocol): The pointwise judge implementation is presented without human correlation studies, inter-annotator agreement, or ablation against alternative judges. This directly affects interpretation of the performance drops across tiers (e.g., in the results tables), since the judge may share VLM-like biases in scoring trajectory alignment.
[Results] Results (e.g., Table 2 and tier-wise analysis): While drops in success metrics with increasing difficulty are shown, no control baselines (random trajectories, language-only predictors, or non-embodied VLMs) are reported. This weakens the claim that current VLMs 'can somewhat produce' coherent trajectories, as the absolute levels of failure cannot be contextualized.

minor comments (3)

[Abstract] The abstract states 'nine instructions per scene' without specifying generation or selection criteria, which would aid reproducibility.
[Figures] Figure captions for tier examples could more explicitly label the specific spatial challenges (occlusion, constraints) being tested in each panel.
[§3] Notation for trajectory representation (e.g., sequence of actions or waypoints) is introduced but could be formalized earlier with a clear equation or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to improve reproducibility, evaluation rigor, and contextualization of results.

read point-by-point responses

Referee: [§3.1] §3.1 (Scene Generation): The description of how 3D scenes are constructed from textual descriptions and the exact navigability filtering criteria are insufficiently detailed. This is load-bearing for the central claim, as the reported systematic failures in occlusion handling and interaction constraints could arise from simplified geometry or language-visual correlations in the generated scenes rather than genuine model limitations.

Authors: We agree that greater detail on scene generation is needed for full reproducibility and to rule out artifacts. In the revised manuscript, we will expand §3.1 with the complete textual prompt templates, the precise navigability filtering criteria (including minimum path length, collision checks via the 3D engine, and interaction constraints), and quantitative scene statistics such as average object density, occlusion rates, and geometric complexity measures. These additions will clarify that the environments incorporate realistic 3D structure and support our interpretation of model failures as stemming from limitations in spatial grounding. revision: yes
Referee: [§4.2] §4.2 (Evaluation Protocol): The pointwise judge implementation is presented without human correlation studies, inter-annotator agreement, or ablation against alternative judges. This directly affects interpretation of the performance drops across tiers (e.g., in the results tables), since the judge may share VLM-like biases in scoring trajectory alignment.

Authors: We acknowledge the value of human validation for the judge protocol. The original design prioritized scalability for the 22,000+ evaluations. In revision, we will augment §4.2 with fuller implementation details, any available pilot inter-annotator agreement figures, and an ablation against alternative metrics (e.g., geometric overlap or action-sequence similarity). We will also add an explicit discussion of potential judge biases and limitations. A comprehensive new human correlation study is resource-intensive but we will pursue a targeted pilot if feasible for the camera-ready version. revision: partial
Referee: [Results] Results (e.g., Table 2 and tier-wise analysis): While drops in success metrics with increasing difficulty are shown, no control baselines (random trajectories, language-only predictors, or non-embodied VLMs) are reported. This weakens the claim that current VLMs 'can somewhat produce' coherent trajectories, as the absolute levels of failure cannot be contextualized.

Authors: We agree that control baselines would strengthen interpretation of the absolute performance levels and the tier-wise trends. In the revised manuscript, we will add results for random trajectory sampling, language-only predictors (visual input ablated), and non-embodied VLM baselines. These will be integrated into Table 2 and the tier-wise analysis to better contextualize the claim that current VLMs can produce somewhat coherent trajectories while exposing their specific shortcomings in grounded spatial reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external models and new scenes

full rationale

The paper introduces SleepWalk as a new benchmark consisting of 2,472 text-generated 3D scenes filtered for navigability, with tasks organized into three difficulty tiers. It evaluates three external frontier VLMs using a pointwise judge protocol and reports observed performance drops under occlusion, interaction constraints, and multi-step instructions. No equations, fitted parameters, self-defined quantities, or derivation chains are present; the central claims are direct empirical measurements on independent VLMs and scenes rather than any reduction of results to prior self-generated inputs by construction. Self-citations, if present, are not load-bearing for the reported failures.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper introduces a new benchmark rather than deriving a result from equations, so it rests on domain assumptions about VLM prompting and scene realism rather than free parameters or new physical entities.

axioms (2)

domain assumption Textual scene descriptions can be converted into 3D environments that are sufficiently realistic for testing spatial navigation after navigability filtering.
Invoked in the benchmark construction described in the abstract.
domain assumption A pointwise judge-based protocol can reliably assess whether predicted trajectories are spatially coherent, executable, and instruction-aligned.
Central to the standardized evaluation protocol stated in the abstract.

invented entities (1)

SleepWalk three-tier benchmark no independent evidence
purpose: To enable fine-grained stress-testing of instruction-grounded trajectory prediction in 3D scenes
Newly defined in this work with no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5597 in / 1602 out tokens · 66821 ms · 2026-05-12T04:23:01.320191+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability... three tiers of spatial and temporal difficulty... pointwise judge-based evaluation protocol
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

a single-scene, interaction-centric benchmark setting for grounded trajectory reasoning,

work page
[2]

a scalable construction pipeline that reconstructs 3D scenes, generates tiered instructions, and evaluates predicted paths, and

work page
[3]

Why this matters

an empirical diagnosis showing that current frontier VLMs degrade substantially on composi- tional, interaction-heavy, and multi-step instructions. Why this matters. The paper should not be read as claiming that any one component—for exam- ple, the judge or the scene generator—is the sole novelty. Rather, the novelty lies in defining and instantiating a b...

work page
[4]

Start Location Consistency,

work page
[5]

Obstacle Avoidance, and

work page
[6]

Trajectory Efficiency. This means the evaluation is designed to reflect not only where a model ends up, but whether it gets there through a path that is spatially plausible and aligned with the intended action. What the paper does and does not claim. The manuscript does not claim that judge-based eval- uation is perfect. It makes the narrower methodologic...

work page
[7]

the benchmark differentiates strong models,

work page
[8]

the revealed failures are systematic rather than trivial, and

work page
[9]

current VLMs are bad at navi- gation

performance degrades under higher compositional and interaction demands. What the results already show. Among the evaluated models, GPT-5-mini performs best across all four reported factors, but even the strongest system degrades as the tasks become harder. This supports the benchmark’s central diagnostic claim. Takeaway. Broader model coverage would cert...

work page
[10]

Image 1: Oblique View (Perspective)

work page
[11]

Image 2: Top-Down View (Map) ### Critical Pre-Condition Analyze the Top-Down View first. If the Top-Down view is incomplete, obstructed, or not clearly visible, you must ignore all other instructions and output ONLY this exact phrase:,→ ”The view is not clear to generate instructions” --- ### Phase 1: Environment Analysis If the views are clear, analyze b...

work page
[12]

Object Grounding: - Every START and END location must refer to a specific, visible object found in the images.,→ - Correct: ”Start: Near the red chair” - Incorrect: ”Start: Near the wall” or ”Start: At the start point”

work page
[13]

- BANNED WORDS: left, right, front, back, center, centre, middle, top, bottom, upper, lower.,→ - Tasks must be valid regardless of the agent 's facing direction

Forbidden Terminology (Spatial Hallucination): - NEVER use viewpoint-dependent or relative directional terms. - BANNED WORDS: left, right, front, back, center, centre, middle, top, bottom, upper, lower.,→ - Tasks must be valid regardless of the agent 's facing direction

work page
[14]

Only reference items clearly visible in the provided images

Object Consistency: - Do not hallucinate objects. Only reference items clearly visible in the provided images. --- ### Output Format Step 1: Provide an *ENVIRONMENT SUMMARY* (2-3 sentences describing the room type and listing 10-15 key visible objects).,→ Step 2: Output the task list following this exact schema: LEVEL_1 | TASK: <instruction> | START: Near...

work page

[1] [1]

a single-scene, interaction-centric benchmark setting for grounded trajectory reasoning,

work page

[2] [2]

a scalable construction pipeline that reconstructs 3D scenes, generates tiered instructions, and evaluates predicted paths, and

work page

[3] [3]

Why this matters

an empirical diagnosis showing that current frontier VLMs degrade substantially on composi- tional, interaction-heavy, and multi-step instructions. Why this matters. The paper should not be read as claiming that any one component—for exam- ple, the judge or the scene generator—is the sole novelty. Rather, the novelty lies in defining and instantiating a b...

work page

[4] [4]

Start Location Consistency,

work page

[5] [5]

Obstacle Avoidance, and

work page

[6] [6]

Trajectory Efficiency. This means the evaluation is designed to reflect not only where a model ends up, but whether it gets there through a path that is spatially plausible and aligned with the intended action. What the paper does and does not claim. The manuscript does not claim that judge-based eval- uation is perfect. It makes the narrower methodologic...

work page

[7] [7]

the benchmark differentiates strong models,

work page

[8] [8]

the revealed failures are systematic rather than trivial, and

work page

[9] [9]

current VLMs are bad at navi- gation

performance degrades under higher compositional and interaction demands. What the results already show. Among the evaluated models, GPT-5-mini performs best across all four reported factors, but even the strongest system degrades as the tasks become harder. This supports the benchmark’s central diagnostic claim. Takeaway. Broader model coverage would cert...

work page

[10] [10]

Image 1: Oblique View (Perspective)

work page

[11] [11]

Image 2: Top-Down View (Map) ### Critical Pre-Condition Analyze the Top-Down View first. If the Top-Down view is incomplete, obstructed, or not clearly visible, you must ignore all other instructions and output ONLY this exact phrase:,→ ”The view is not clear to generate instructions” --- ### Phase 1: Environment Analysis If the views are clear, analyze b...

work page

[12] [12]

Object Grounding: - Every START and END location must refer to a specific, visible object found in the images.,→ - Correct: ”Start: Near the red chair” - Incorrect: ”Start: Near the wall” or ”Start: At the start point”

work page

[13] [13]

- BANNED WORDS: left, right, front, back, center, centre, middle, top, bottom, upper, lower.,→ - Tasks must be valid regardless of the agent 's facing direction

Forbidden Terminology (Spatial Hallucination): - NEVER use viewpoint-dependent or relative directional terms. - BANNED WORDS: left, right, front, back, center, centre, middle, top, bottom, upper, lower.,→ - Tasks must be valid regardless of the agent 's facing direction

work page

[14] [14]

Only reference items clearly visible in the provided images

Object Consistency: - Do not hallucinate objects. Only reference items clearly visible in the provided images. --- ### Output Format Step 1: Provide an *ENVIRONMENT SUMMARY* (2-3 sentences describing the room type and listing 10-15 key visible objects).,→ Step 2: Output the task list following this exact schema: LEVEL_1 | TASK: <instruction> | START: Near...

work page