pith. sign in

arxiv: 2603.02854 · v2 · submitted 2026-03-03 · 💻 cs.RO · cs.AI

CoFL: Continuous Flow Fields for Language-Conditioned Navigation

Pith reviewed 2026-05-15 17:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords continuous flow fieldslanguage-conditioned navigationbird's-eye viewrobot navigationend-to-end policysemantic mapsunseen scenesreal-time control
0
0 comments X

The pith

CoFL learns continuous flow fields from bird's-eye views and language instructions to navigate unseen scenes more precisely than trajectory predictors or modular planners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CoFL is an end-to-end policy that takes a bird's-eye view observation and a language instruction to produce a continuous flow field across the workspace. Rather than predicting one trajectory from the current start position, it learns local motion vectors at every location, turning each scene-instruction pair into dense spatial supervision. Paths are generated by numerical integration of the field from any starting point, which supports real-time rollout and closed-loop correction. On strictly unseen scenes from Matterport3D and ScanNet, the method outperforms both vision-language model planners and direct trajectory policies in success rate, precision, and safety while running in real time. The same trained model transfers zero-shot to physical robot experiments across multiple layouts with high success.

Core claim

CoFL reformulates language-conditioned navigation as workspace-conditioned field learning: it maps any bird's-eye view location to a motion vector conditioned on the language instruction, so that each scene-instruction annotation supplies dense supervision instead of a single start-conditioned rollout. Trajectories are recovered from arbitrary starts by integrating the predicted field, enabling simple real-time closed-loop control and recovery from deviations.

What carries the argument

The continuous flow field, which assigns a local motion vector to every point in the bird's-eye view workspace conditioned on the language instruction and is used to generate paths by numerical integration.

If this is right

  • Each scene-instruction annotation now supplies dense spatial supervision rather than a single trajectory, increasing training signal per example.
  • Numerical integration of the field from any start enables closed-loop recovery without retraining.
  • The policy achieves higher navigation precision and safety than modular VLM planners or start-conditioned trajectory generators on unseen scenes.
  • Real-time inference is preserved while supporting zero-shot transfer to physical robot deployments.
  • The same field representation works across multiple room layouts without scene-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower annotation costs by replacing manual trajectory labels with procedural flow fields from existing semantic maps.
  • Similar field-based representations might apply to other robotics tasks such as manipulation or multi-agent coordination where continuous guidance is useful.
  • Online fusion with live semantic mapping could allow the flow field to adapt to dynamic changes without full retraining.
  • The integration-based rollout naturally provides a mechanism for uncertainty-aware planning by sampling multiple paths through the field.

Load-bearing premise

Flow fields procedurally derived from semantic maps in simulation supply accurate and generalizable supervision for arbitrary language instructions when the system is deployed in real physical environments.

What would settle it

A controlled real-world trial in which the robot, following the integrated flow field, deviates from the language-specified goal or collides in a novel layout where the simulation-derived semantic map no longer matches physical geometry.

Figures

Figures reproduced from arXiv: 2603.02854 by Haokun Liu, Jinjie Li, Masaki Kitagawa, Moju Zhao, Wentao Zhang, Yicheng Chen, Zhaoqi Ma, Zicen Xiong.

Figure 1
Figure 1. Figure 1: Overview of the main contributions. (a) An automated annotation pipeline that constructs a large-scale BEV image– [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CoFL’s network architecture. Given an RGB BEV observation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed visual observation generation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the procedural annotation pipeline. We [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the training/validation split. We employ a [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of trajectories on validation set. (a) Matterport3D: [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Latency–performance trade-offs of real-time models (DP-family and CoFL). DP-family: [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of query distributions (in yellow) for the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example trajectories from real-world experiments. (a)–(d) [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Four collision trials drawn from E2 as examples. Trajectories are colored by elapsed time from start to end. The static [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More examples of trajectories on validation set. From left to right: ground truth, pure VLM/VLM+Planner, DP￾family and CoFL. (a-b) ScanNet: “please walk to the door” and “can you approach the bathtub”; (c-d) ScanNet: “go to ahead of the table” and “can you approach right side of the table”; (e-f) Matterport3D: “can you proceed to back of the chair in the lower right” and “please travel to the chair in the… view at source ↗
Figure 13
Figure 13. Figure 13: Effect of pseudo receding-horizon segmentation n on DP (DP-T-DDPM, 5 denoising steps, fixed seed 42). The red horizontal line indicates CoFL with a fixed 100 × 100 query grid. a) DP-DDPM (stochastic reverse diffusion): Let ∆X˜ ∈ R T ×2 denote the full displacement sequence of length T (each step a d-DoF displacement), and diffusion is performed on this entire sequence. We initialize ∆X˜ Ndiff ∼ N(0, I) an… view at source ↗
read the original abstract

Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene--instruction annotation mainly to supervise one start-conditioned rollout. To address these limitations, we present CoFL, an end-to-end policy that maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. CoFL reformulates navigation as workspace-conditioned field learning rather than start-conditioned trajectory prediction: it learns local motion vectors at arbitrary BEV locations, turning each scene--instruction annotation into dense spatial control supervision. Trajectories are generated from any start by numerical integration of the predicted field, enabling simple real-time rollout and closed-loop recovery. To enable large-scale training and evaluation, we build a dataset of over 500k BEV image--instruction pairs, each procedurally annotated with a flow field and a trajectory derived from semantic maps built on Matterport3D and ScanNet. Evaluating on strictly unseen scenes, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and trajectory generation policies in both navigation precision and safety, while maintaining real-time inference. Finally, we deploy CoFL zero-shot in real-world experiments with BEV observations across multiple layouts, maintaining feasible closed-loop control and a high success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces CoFL, an end-to-end policy that maps bird's-eye-view (BEV) observations and language instructions to continuous flow fields for navigation. It reformulates the problem as learning local motion vectors at arbitrary BEV locations rather than start-conditioned trajectories, using numerical integration to generate paths. The approach is trained on over 500k procedurally generated BEV-instruction pairs derived from semantic maps on Matterport3D and ScanNet. The central claim is that CoFL significantly outperforms modular VLM-based planners and trajectory-generation policies on strictly unseen scenes in navigation precision and safety, supports real-time inference, and succeeds in zero-shot real-world closed-loop deployment.

Significance. If the performance claims hold under detailed scrutiny, the reformulation to dense flow-field supervision could improve generalization and closed-loop recovery compared to sparse trajectory supervision. The scale of the procedurally generated dataset is a notable strength for training. However, the absence of quantitative metrics, error bars, or ablation details in the abstract limits assessment of practical impact on embodied navigation.

major comments (3)
  1. [Abstract] Abstract: the claim of significant outperformance on unseen scenes in precision and safety is stated without any quantitative metrics, success rates, error bars, baseline implementations, or ablation studies, preventing verification of the central empirical result.
  2. [Real-world experiments] Real-world experiments section: the zero-shot deployment reports feasible closed-loop control and high success rate across layouts, but provides no quantitative metrics, description of the real BEV pipeline, or handling of sensor noise and semantic labeling errors, leaving the sim-to-real transfer assumption untested.
  3. [Methods] Methods: the procedural flow fields are derived from clean simulation semantic maps to provide dense supervision, yet no ablation on perception noise, labeling errors, or BEV quality is reported; this directly bears on whether the learned field remains accurate for arbitrary instructions under real-world conditions.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'workspace-conditioned field learning' is introduced without a brief definition or pointer to related field-based navigation literature, which could aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the empirical presentation and robustness analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of significant outperformance on unseen scenes in precision and safety is stated without any quantitative metrics, success rates, error bars, baseline implementations, or ablation studies, preventing verification of the central empirical result.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics such as success rates on unseen scenes, navigation error reductions relative to baselines, and error bars to support the claims of outperformance in precision and safety. revision: yes

  2. Referee: [Real-world experiments] Real-world experiments section: the zero-shot deployment reports feasible closed-loop control and high success rate across layouts, but provides no quantitative metrics, description of the real BEV pipeline, or handling of sensor noise and semantic labeling errors, leaving the sim-to-real transfer assumption untested.

    Authors: We acknowledge the need for more detail. We will expand the real-world section to report quantitative success rates and failure statistics across trials, provide a description of the real BEV observation pipeline, and discuss mitigation strategies for sensor noise and semantic labeling errors to better substantiate the sim-to-real transfer. revision: yes

  3. Referee: [Methods] Methods: the procedural flow fields are derived from clean simulation semantic maps to provide dense supervision, yet no ablation on perception noise, labeling errors, or BEV quality is reported; this directly bears on whether the learned field remains accurate for arbitrary instructions under real-world conditions.

    Authors: This is a valid concern. We will add an ablation study in the revised manuscript that introduces controlled perception noise and labeling errors into the BEV inputs and reports the resulting degradation in flow-field accuracy and downstream navigation performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out scenes

full rationale

The paper's core claim is an empirical performance comparison: a policy trained on procedurally generated flow-field supervision from semantic maps is evaluated on strictly unseen scenes and real-world zero-shot deployment. The flow fields serve as training targets derived from external semantic maps (Matterport3D/ScanNet), not as a self-referential definition or fitted parameter renamed as prediction. No equations reduce the output trajectory to the input by construction, no load-bearing self-citations justify uniqueness, and no ansatz is smuggled via prior work. The derivation chain is a standard supervised learning pipeline whose success is measured by external benchmarks rather than internal equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that semantic maps yield valid dense flow supervision and that the learned field generalizes from simulation to real scenes; the flow field itself is introduced as the core representation.

free parameters (1)
  • Neural network weights
    The end-to-end policy is a neural network whose parameters are fitted to the 500k dataset.
axioms (1)
  • domain assumption BEV observations combined with language instructions contain sufficient information to define valid navigation flow fields
    Implicit in the end-to-end mapping design and dataset construction from semantic maps.
invented entities (1)
  • Continuous flow field no independent evidence
    purpose: Dense spatial control supervision across the entire workspace
    Core output of the policy that enables integration-based trajectory generation from arbitrary starts.

pith-pipeline@v0.9.0 · 5547 in / 1289 out tokens · 76725 ms · 2026-05-15T17:14:10.182066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 2 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale,

    A. Brohanet al., “RT-1: Robotics Transformer for Real-World Control at Scale,” inProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  2. [2]

    RT-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovichet al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in7th Annual Conference on Robot Learning, 2023

  3. [3]

    OpenVLA: An open-source vision-language-action model,

    M. J. Kimet al., “OpenVLA: An open-source vision-language-action model,” in8th Annual Conference on Robot Learning, 2024

  4. [4]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chiet al., “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  5. [5]

    π 0: A Vision-Language-Action Flow Model for General Robot Control,

    K. Blacket al., “π 0: A Vision-Language-Action Flow Model for General Robot Control,” inProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025

  6. [6]

    π 0.5: a vision-language-action model with open-world general- ization,

    ——, “π 0.5: a vision-language-action model with open-world general- ization,” in9th Annual Conference on Robot Learning, 2025

  7. [7]

    Matterport3d: Learning from rgb-d data in indoor environments,

    A. Changet al., “Matterport3d: Learning from rgb-d data in indoor environments,” in2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

  9. [9]

    Adaptive teams of autonomous aerial and ground robots for situational awareness,

    M. A. Hsiehet al., “Adaptive teams of autonomous aerial and ground robots for situational awareness,”Journal of field robotics, vol. 24, no. 11-12, pp. 991–1014, 2007

  10. [10]

    Bird’s eye view: Cooperative exploration by ugv and uav,

    S. Hood, K. Benson, P. Hamod, D. Madison, J. M. O’Kane, and I. Rekleitis, “Bird’s eye view: Cooperative exploration by ugv and uav,” in2017 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 2017, pp. 247–255

  11. [11]

    Graph-based subterranean exploration path planning using aerial and legged robots,

    T. Dang, M. Tranzatto, S. Khattak, F. Mascarich, K. Alexis, and M. Hutter, “Graph-based subterranean exploration path planning using aerial and legged robots,”Journal of Field Robotics, vol. 37, no. 8, pp. 1363–1388, 2020

  12. [12]

    Collaborative multi-robot search and rescue: Planning, coordination, perception, and active vision,

    J. P. Queraltaet al., “Collaborative multi-robot search and rescue: Planning, coordination, perception, and active vision,”Ieee Access, vol. 8, pp. 191 617–191 643, 2020

  13. [13]

    Deploying foundation model-enabled air and ground robots in the field: Challenges and opportunities,

    Z. Ravichandranet al., “Deploying foundation model-enabled air and ground robots in the field: Challenges and opportunities,”arXiv preprint arXiv:2505.09477, 2025

  14. [14]

    Hierarchical language models for semantic navigation and manipulation in an aerial-ground robotic system,

    H. Liuet al., “Hierarchical language models for semantic navigation and manipulation in an aerial-ground robotic system,”Advanced Intelligent Systems, p. e202500640, 2025

  15. [15]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

    J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” inEuropean conference on computer vision. Springer, 2020, pp. 194–210

  16. [16]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

    Z. Liet al., “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  17. [17]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Andersonet al., “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683

  18. [18]

    Do as i can, not as i say: Grounding language in robotic affordances,

    B. Ichteret al., “Do as i can, not as i say: Grounding language in robotic affordances,” in6th Annual Conference on Robot Learning, 2022

  19. [19]

    LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,

    D. Shah, B. Osi ´nski, brian ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in6th Annual Conference on Robot Learning, 2022

  20. [20]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181

  21. [21]

    Code as policies: Language model programs for embodied control,

    J. Lianget al., “Code as policies: Language model programs for embodied control,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 9493–9500

  22. [22]

    Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,

    J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong, “Mapgpt: Map-guided prompting with adaptive path planning for vision-and- language navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9796–9810

  23. [23]

    Enhancing the llm-based robot manipulation through human-robot collaboration,

    H. Liuet al., “Enhancing the llm-based robot manipulation through human-robot collaboration,”IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 6904–6911, 2024

  24. [24]

    Palm-e: An embodied multimodal language model,

    D. Driesset al., “Palm-e: An embodied multimodal language model,” in International Conference on Machine Learning, 2023, pp. 8469–8488

  25. [25]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  26. [26]

    Training diffu- sion models with reinforcement learning,

    K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Training diffu- sion models with reinforcement learning,” inThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    Consistency models,

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 32 211–32 252

  28. [28]

    Two-steps diffusion policy for robotic manipulation via genetic denoising,

    M. Cl ´ementeet al., “Two-steps diffusion policy for robotic manipulation via genetic denoising,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [29]

    Flow matching for generative modeling,

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in11th International Conference on Learning Representations, ICLR 2023, 2023

  30. [30]

    Flow straight and fast: Learning to generate and transfer data with rectified flow,

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inThe Eleventh International Conference on Learning Representations, 2023

  31. [31]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    P. Intelligenceet al., “π ∗ 0.6: a vla that learns from experience,”arXiv preprint arXiv:2511.14759, 2025

  32. [32]

    Real-time obstacle avoidance for manipulators and mobile robots,

    O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,”The international journal of robotics research, vol. 5, no. 1, pp. 90–98, 1986

  33. [33]

    Neural potential field for obstacle-aware local motion planning,

    M. Alhaddad, K. Mironov, A. Staroverov, and A. Panov, “Neural potential field for obstacle-aware local motion planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9313–9320

  34. [34]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  35. [35]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannenet al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025

  36. [36]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, I. Guyonet al., Eds., vol. 30. Curran Associates, Inc., 2017

  37. [37]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229

  38. [38]

    Fourier features let networks learn high frequency functions in low dimensional domains,

    M. Tanciket al., “Fourier features let networks learn high frequency functions in low dimensional domains,”Advances in neural information processing systems, vol. 33, pp. 7537–7547, 2020

  39. [39]

    A note on two problems in connexion with graphs,

    E. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, vol. 1, pp. 269–271, 1959

  40. [40]

    End-to-end naviga- tion with vision-language models: Transforming spatial reasoning into question-answering,

    D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end naviga- tion with vision-language models: Transforming spatial reasoning into question-answering,” inProceedings of the International Conference on Neuro-symbolic Systems, ser. Proceedings of Machine Learning Research, G. Pappas, P. Ravikumar, and S. A. Seshia, Eds., vol. 288. PMLR, 28–30 May 2025, pp. 22–35

  41. [41]

    Film: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, Apr. 2018. SUPPLEMENTARYMATERIAL For completeness, we provide additional appendix materials in the supplementary document. The supplementary material incl...

  42. [42]

    Pixels whose semantic labels belong to the free set are marked asM free=1, and we defineM obs =¬M free

    Free/Obstacle Masks:Given the semantic mapS,EX- TRACTFREEconstructs a binary free-space maskM free ∈ {0,1} H×W using the predefined dataset-specific label map- pingR. Pixels whose semantic labels belong to the free set are marked asM free=1, and we defineM obs =¬M free

  43. [43]

    Goal Sources:COMPUTEGOALreturns goal source pixelsp g from the target instance specified byℓ target, typically as a thin boundary band adjacent to the target in free space. Whenℓ target contains directional modifiers, such as left/right or front/back, the goal source is restricted to the corresponding side of the target boundary in the BEV coordinate fram...

  44. [44]

    Distance-to-obstacle Transform:DTOcomputes the Euclidean distance transformD free over free spaceM free = 1, whereD free(p)is the distance (in pixels) from a free pixelp to the nearest obstacle pixel

  45. [45]

    Intuitively, this increases costs near obstacles and encourages paths with larger clearance

    Safety-aware Cost Map:COSTMAPconvertsD free into a traversal cost mapC cost ≥1by applying a truncated linear penalty within a safety band of radiusρ safe: Ccost(p) = 1 +λ safe [ρsafe −D free(p)]+,(24) where[z] + = max(0, z). Intuitively, this increases costs near obstacles and encourages paths with larger clearance

  46. [46]

    We treat each free pixel as a graph node

    Cost-weighted Geodesic and Predecessor Map: GEODESICruns the Dijkstra [39] on an 8-connected pixel grid, restricted to free cells (M free=1). We treat each free pixel as a graph node. For any two neighboring pixelspand q(axial/diagonal neighbors), we assign an edge cost w(p,q) = 1 2 Ccost(p) +C cost(q) ∥p−q∥ 2,(25) where∥p−q∥ 2 ∈ {1, √ 2}for axial/diagona...

  47. [47]

    Pixel Distance-to-go Along the Predecessor Tree:PIX- ELLENGTHFROMPREDcomputesD pix g (p), defined as the geometric remaining path length (in pixels) when repeatedly following the predecessor pointers frompto a goal. Equiva- lently, it is the accumulated step length along the predecessor chain: Dpix g (p) = K(p)−1X k=0 pk −p k+1 2, p0 =p,p k+1 = pred(pk),(...

  48. [48]

    Potential Construction:We form a piecewise potential Φfollowing Algorithm 1: Φ(p) = ( wg Dw g (p),M free(p) = 1, wobs Dobs(p) +b obs,M obs(p) = 1. (27) We setw obs ≫w g and chooseb obs = maxp:M free(p)=1 wgDw g (p)such that the obstacle-side potentials dominate around the interface, avoiding discrete gradients that point into obstacles

  49. [49]

    Direction and Magnitude:We smoothΦwith a Gaussian filter and compute spatial derivatives using Sobel operators. The unit direction field is u(p) = −∇Φ(p) ∥∇Φ(p)∥2 +ϵ .(28) In free space, we scale the magnitude by the pixel distance- to-go and convert it to normalized coordinates:V ∗(p) = ux(p)·D pix g (p)/W, u y(p)·D pix g (p)/H ,forM free(p) = 1. Inside ...

  50. [50]

    Therefore,SAMPLESTARTsamples the start pixelp 0 only from the reachable subset, defined by a finite distance-to-go:D pix g (p0)<∞

    Reachable Free Space and Start Sampling:Although Mfree marks all non-obstacle pixels, some free regions may be disconnected from the goal sources (e.g., being fully enclosed by obstacles). Therefore,SAMPLESTARTsamples the start pixelp 0 only from the reachable subset, defined by a finite distance-to-go:D pix g (p0)<∞. The sampled start is further required...

  51. [51]

    APPENDIXB DETAILS OFEVALUATIONMETRICS ANDPROTOCOL We describe the evaluation protocol and metric implementa- tions used in §V

    Backtracking and Resampling:Givenp 0,BACK- TRACKPREDbacktracks predecessorsp k+1 = pred(pk)until reaching a goal source to obtain a polylineτ raw.RESAMPLE then resamplesτ raw by arc length to a fixed number of way- points to obtainτ ∗, which is stored in normalized coordinates (px/W, p y/H). APPENDIXB DETAILS OFEVALUATIONMETRICS ANDPROTOCOL We describe th...

  52. [52]

    FGE is the Euclidean distance between the final resampled point and the endpoint: FGE(¯τ) =∥¯ xK−1 −x end∥2 .(30)

    Final Goal Error (FGE):Letx end ∈[0,1] 2 be the endpoint of the annotated trajectoryτ ∗. FGE is the Euclidean distance between the final resampled point and the endpoint: FGE(¯τ) =∥¯ xK−1 −x end∥2 .(30)

  53. [53]

    Collision Rate (CR):CR is a binary indicator of whether the resampled predicted trajectory ever enters an obstacle cell: CR(¯τ) =I h ∃j:M obs[py(¯ xj), px(¯ xj)] = 1 i .(31) The benchmark reports the mean ofCR(τ)across episodes

  54. [54]

    Let segment vectors be∆ j =¯ xj+1 −¯ xj

    Curvature-based Smoothness (Curv):Curvature is the mean absolute change in heading angle between consecutive segments of the resampled predicted trajectory. Let segment vectors be∆ j =¯ xj+1 −¯ xj. We discard degenerate segments with∥∆ j∥2 ≤ϵand compute headings ψj = atan2(∆(v) j ,∆ (u) j ).(32) Curv is then Curv(¯τ) = 1 M−1 M−2X j=0 |WrapToPi(ψj+1 −ψ j)|...

  55. [55]

    Flow Field Metrics Flow field metrics are evaluated on the exact grid of the annotated flow

    Path Length Ratio (PLR):Let the path length of a resampled trajectory be L(¯τ) = K−2X j=0 ∥¯ xj+1 −¯ xj∥2 .(34) PLR is defined as the ratio between predicted and annotated trajectory lengths: PLR(¯τ) = L(¯τ) L(τ ∗) .(35) C. Flow Field Metrics Flow field metrics are evaluated on the exact grid of the annotated flow. Let the annotated flow have a spatial re...

  56. [56]

    AE is the mean of{∆ϕ n}over all evaluated points

    Angular Error (AE):We compute the clipped cosine similarity cn = clip ˆVn ∥ˆVn∥2 +ϵ · Vn ∗ ∥Vn ∗∥2 +ϵ ,−1,1 ! ,(36) and define the per-point angular error as∆ϕ n = arccos(cn)· 180 π (degrees). AE is the mean of{∆ϕ n}over all evaluated points

  57. [57]

    All baselines operate on the same BEV observationI, language instruction ℓ, and start positionx 0

    Magnitude Error (ME):Magnitude error is ME = 1 N NX n=1 ∥ˆVn∥2 − ∥Vn ∗∥2 .(37) APPENDIXC DETAILS OFBASELINEIMPLEMENTATIONS This appendix describes baseline formulations and imple- mentation details as a supplement for §V-A. All baselines operate on the same BEV observationI, language instruction ℓ, and start positionx 0. Learned baselines use the same fro...

  58. [58]

    notes" using short ASCII words (e.g.,

    Prompt:We use the following system prompt and re- quest a strict JSON response: System prompt for Pure VLM baseline You are a robot navigation policy operating on a TOP-DOWN VIEW (bird’s eye view) and a natural-language instruction. INPUTS 15 - Image: a TOP-DOWN VIEW (bird’s eye view) (RGB-only; no explicit obstacle mask). - Text: (1) Instruction and (2) ...

  59. [59]

    The VLM is instructed to trust the text start if the visualization is unclear

    Query Format:To reduce ambiguity in the start location, we additionally draw a green dot on the input map, while still providing the authoritative start coordinate in text. The VLM is instructed to trust the text start if the visualization is unclear

  60. [60]

    We set the maximum output budget to 8192 tokens to reduce truncation

    Decoding and API Settings:We use Gemini-2.5-Flash with deterministic decoding (temperature= 0, top-p= 1.0) and enforce a JSON-only response format. We set the maximum output budget to 8192 tokens to reduce truncation. If the output is malformed, we retry up to two times

  61. [61]

    All coordinates are clamped to[0,1]

    Output Parsing and Waypoint Normalization:We parse the returned JSON object and extracttargetand trajectory. All coordinates are clamped to[0,1]. If the returned trajectory contains fewer than the required number of waypoints, we interpolate along its arclength to obtain exactlyNwaypoints; if it contains more, we subsample uniformly by index. This post-pr...

  62. [62]

    VLM Prediction (Target and Obstacles):The VLM outputs a JSON object containing the target object (center and bounding box), a list of obstacle objects (each with a center and bounding box), and an optional start estimate. We use the following system instruction and request JSON-only output: System prompt for VLM in VLM+Planner baseline You are analyzing a...

  63. [63]

    The TARGET OBJECT (the object the instruction refers to) with its center, bounding box, and direction descriptor (left, right, top, bottom, none)

  64. [64]

    Other obstacle objects that might block a path, with their centers and bounding boxes

  65. [65]

    target":{{

    If multiple candidates exist, choose the one you judge to be most consistent with the instruction and overall scene layout. Output ONLY valid JSON. Required keys are exactly: {{"target":{{"name": "object_name", "center": [0.50, 0.70], "bbox": [0.40, 0.60, 0.60, 0.85], "direction": "left/right/top/bottom/none", "confidence": "high/medium/low"}}, "obstacles...

  66. [66]

    Query Format and Side-of-object Handling:The user query provides the instruction and requests (i) a target and (ii) obstacles. If the instruction specifies approaching a side 16 of an object (e.g.,left of/right of/above/below), we treat the target object’s bounding box as an additional forbidden region and set the navigation goal to a point offset from th...

  67. [67]

    If the returned JSON is invalid or missing a target bbox/center, we retry up to two times with an explicit JSON-only reminder

    Decoding and API Settings:We use Gemini-2.5-Flash with deterministic decoding (temperature= 0, top-p= 1.0) and a maximum output budget of 8192 tokens. If the returned JSON is invalid or missing a target bbox/center, we retry up to two times with an explicit JSON-only reminder

  68. [68]

    Geometric Planning (A*):We rasterize the predicted obstacle bounding boxes into a binary occupancy grid of size G×G(defaultG= 128) and run A* (8-neighborhood) from the provided startx 0 to the derived goal(x g, yg). To enforce a safety margin, we inflate each obstacle by a fixed pixel radius r(10 px in the bbox image space), converted to a normalized marg...

  69. [69]

    Problem Formulation and Trajectory Parameterization: All DP-family baselines model a trajectory as a length-T sequence of 2D displacements∆X∈R T×2 in normalized image coordinates, withT= 100. Waypoints are recovered by cumulative summation from the initial state: X=x 0 + cumsum(∆X).(39) To improve optimization conditioning and keep the diffusion scale rou...

  70. [70]

    The encoder outputs a token sequenceC∈R Nv×d that conditions the denoiser

    Network Architecture: a) Vision–language encoder (shared):All DP-family baselines reuse the same visual–language encoder as CoFL: SigLIP2-B/16 at224×224, followed by the same cross-modal fusion stack (model dimensiond=768,8heads,4fusion layers). The encoder outputs a token sequenceC∈R Nv×d that conditions the denoiser. b) Conditioning inputs (shared):Both...

  71. [71]

    please walk to the door

    Training Objective and Defaults:All DP variants share the same data preprocessing and displacement normalization, but use different training objectives depending on the sampler family. a) DDPM objective (DP-*-DDPM):For stochastic re- verse diffusion, we adopt the standard DDPM noise-prediction parameterization. At a discrete noise leveln∈ {1, . . . , N di...

  72. [72]

    & ',+ $ % -$ ' **(* 1 Segments n12451020500.1360.1440.1520.1600.168 (a) FGE↓

    Inference-Time Sampling:At test time, we sample Gaus- sian noise in displacement space and map it to a displace- ment sequence∆ ˜Xusing the sampler corresponding to each training objective, then recover waypointsXby rescaling and cumulative summation. 18 "& ',+ $ % -$ ' **(* 1 Segments n12451020500.1360.1440.1520.1600.168 (a) FGE↓ "& ',+ (%%$+$(' , 1 Segm...