From Language to Logic: A Theoretical Architecture for VLM-Grounded Safe Navigation

Kalonji Harrington; Kristy Sakano; Mumu Xu

arxiv: 2605.04327 · v1 · submitted 2026-05-05 · 💻 cs.RO

From Language to Logic: A Theoretical Architecture for VLM-Grounded Safe Navigation

Kristy Sakano , Kalonji Harrington , Mumu Xu This is my paper

Pith reviewed 2026-05-08 16:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords safe navigationsignal temporal logicvision-language modelsautonomous robotsnatural language instructionsunstructured environmentsrobot planningruntime monitoring

0 comments

The pith

Natural-language safety rules translate into Signal Temporal Logic specifications to guide autonomous robot navigation via vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an architecture for letting robots follow high-level safety rules and preferences given in natural language during outdoor navigation. Human instructions are converted into formal Signal Temporal Logic statements that direct path planning and enable runtime checks for compliance. Persistent rules and terrain preferences become part of a two-dimensional cost map, while time-varying conditions are expressed as logic formulas monitored as the robot moves. The approach assumes vision-language models can interpret scenes from images to connect words directly to real-world features and constraints without task-specific training. This setup produces a navigation model that aims to satisfy both strict logic requirements and softer operator preferences through embedded formal metrics.

Core claim

The architecture translates natural-language rules into Signal Temporal Logic specifications that guide planning and navigation during runtime. Persistent, environment-centric rules and terrain preferences are grounded into a 2D cost map, while temporally dynamic requirements are expressed as STL specifications to be monitored during runtime. Vision-Language Models enable zero-shot scene understanding that maps human instructions to semantic features and environmental constraints, supporting construction of an illustrative navigation model that satisfies the STL-encoded specifications and soft preferences through formal satisfaction metrics.

What carries the argument

Translation of natural language into Signal Temporal Logic (STL) specifications grounded by Vision-Language Models (VLMs) for zero-shot mapping of instructions to cost maps and runtime monitors.

If this is right

Persistent safety rules and operator preferences become encoded as costs in a 2D map used for path planning.
Temporally dynamic requirements can be checked continuously at runtime through STL monitoring.
The navigation planner can optimize paths to meet formal satisfaction metrics for both hard rules and soft preferences.
Zero-shot VLM grounding allows new rules to be added without retraining the system on specific environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Operators without programming skills could define complex safety behaviors for robots in the field by describing them in words.
The architecture might reduce reliance on hand-tuned navigation parameters when robots enter unfamiliar outdoor areas.
Uncertainty or errors from the VLM could be handled by treating its outputs as probabilistic constraints rather than fixed ones.
The same language-to-logic pipeline might apply to other robot tasks such as manipulation or multi-agent coordination.

Load-bearing premise

Vision-language models can reliably perform zero-shot scene understanding to map human instructions to environmental constraints and semantic features in unstructured outdoor environments.

What would settle it

A demonstration in which a vision-language model misidentifies terrain or obstacles described in a safety rule, causing the robot to violate the corresponding STL specification or cost-map preference during a real navigation run.

Figures

Figures reproduced from arXiv: 2605.04327 by Kalonji Harrington, Kristy Sakano, Mumu Xu.

**Figure 1.** Figure 1: Autonomous robot navigation under the proposed theoretical view at source ↗

**Figure 2.** Figure 2: Overall theoretical architecture of our VLM-grounded safe navigation stack. We obtain view at source ↗

**Figure 3.** Figure 3: State-dependent navigation illustrates normal versus low-battery view at source ↗

read the original abstract

We propose an architecture for integrating high-level, human-provided safety rules and operator-aligned semantic preferences into autonomous robot navigation in unstructured outdoor environments. In our approach, natural-language rules are translated into Signal Temporal Logic (STL) specifications that guide planning and navigation during runtime. Persistent, environment-centric rules and terrain preferences are grounded into a 2D cost map, while temporally dynamic requirements are expressed as STL specifications to be monitored during runtime. We hypothesize the use of Vision-Language Models (VLMs) for zero-shot scene understanding, enabling mapping between human instructions, semantic features, and environmental constraints. Within this framework, we construct an illustrative navigation model that is designed to satisfy a set of STL-encoded specifications and soft operator preferences through formal satisfaction metrics embedded into environmental properties and runtime monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a conceptual architecture proposal linking language rules to STL specs and cost maps via VLMs for outdoor robot navigation, but it offers no tests or analysis to support the claims.

read the letter

The paper proposes an architecture that converts human safety rules and preferences into Signal Temporal Logic specs and 2D cost maps for robot navigation, using VLMs for zero-shot scene understanding in unstructured outdoor environments. It stays at the level of a framework and hypothesis. What stands out is the clear division between static terrain preferences encoded in cost maps and time-varying requirements handled by STL monitors at runtime. The illustrative navigation model shows how to incorporate formal satisfaction metrics into the planning process, which gives a structured way to handle both hard rules and soft preferences. The soft spots are in the lack of any supporting evidence. The central claim relies on VLMs accurately mapping instructions to semantic features and constraints without fine-tuning, but there's no analysis of when this might fail, no bounds on errors, and no experimental results at all. It's presented as a proposal rather than a validated method. This is for people working on safe autonomy and formal methods in robotics who might want to build on the high-level integration idea. A reader seeking new theorems, datasets, or working systems will come away empty. It deserves a serious referee because the architecture is logically consistent and addresses a real problem in human-robot interaction, even though it needs substantial development to be publishable. Send it to review with requests for validation experiments on the VLM component.

Referee Report

2 major / 2 minor

Summary. The paper proposes a theoretical architecture for safe autonomous robot navigation in unstructured outdoor environments. High-level natural-language safety rules and semantic preferences are translated into Signal Temporal Logic (STL) specifications for runtime monitoring of dynamic requirements and into 2D cost maps for persistent terrain preferences. The approach hypothesizes the use of Vision-Language Models (VLMs) for zero-shot scene understanding to ground instructions to environmental constraints, and constructs an illustrative navigation model intended to satisfy the resulting STL-encoded specifications and soft preferences via formal satisfaction metrics.

Significance. If the VLM zero-shot grounding hypothesis holds with sufficient reliability, the architecture could enable more interpretable and operator-aligned navigation with formal safety properties in complex settings where traditional methods struggle. The conceptual integration of language-to-STL translation, cost-map grounding, and runtime monitoring is a coherent framework that builds on existing STL planning techniques, though its significance remains prospective given the absence of supporting analysis or results.

major comments (2)

[Abstract] Abstract: The safety and runtime satisfaction claims of the architecture rest on the unverified hypothesis that VLMs can perform reliable zero-shot mapping from natural-language rules to accurate semantic features and constraints; no error models, formal bounds on grounding accuracy, or fallback mechanisms for mis-grounding are described, leaving the formal guarantees unsubstantiated.
[Illustrative navigation model] Illustrative navigation model section: The model is stated to satisfy STL specifications through embedded formal metrics, yet the manuscript supplies no derivations, simulation results, satisfaction analysis, or sensitivity study to demonstrate this property under the hypothesized VLM grounding.

minor comments (2)

The distinction between persistent rules (cost maps) and temporally dynamic requirements (STL) is conceptually clear but could be reinforced with a diagram or pseudocode example of the full pipeline.
Consider adding a dedicated limitations or assumptions subsection to explicitly discuss the scope of the VLM hypothesis.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the theoretical architecture for VLM-grounded safe navigation. We address the major comments point by point below, clarifying the scope of the work as a conceptual framework and outlining planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The safety and runtime satisfaction claims of the architecture rest on the unverified hypothesis that VLMs can perform reliable zero-shot mapping from natural-language rules to accurate semantic features and constraints; no error models, formal bounds on grounding accuracy, or fallback mechanisms for mis-grounding are described, leaving the formal guarantees unsubstantiated.

Authors: We agree that the safety and runtime properties in the proposed architecture are conditional on reliable VLM grounding, which is presented as a hypothesis rather than an empirically verified component. The manuscript is explicitly framed as a theoretical architecture (see abstract and Section 1), with the VLM role stated as a zero-shot hypothesis to enable the language-to-logic mapping. To address this, we will revise the abstract to explicitly qualify the claims as holding under the assumption of accurate VLM-based scene understanding. We will also add a dedicated paragraph in the Discussion section outlining potential sources of grounding error, high-level considerations for error models, and fallback strategies (such as conservative default constraints or operator override), while noting these as important directions for future empirical work. revision: yes
Referee: [Illustrative navigation model] Illustrative navigation model section: The model is stated to satisfy STL specifications through embedded formal metrics, yet the manuscript supplies no derivations, simulation results, satisfaction analysis, or sensitivity study to demonstrate this property under the hypothesized VLM grounding.

Authors: The illustrative navigation model is introduced as a conceptual design that embeds formal satisfaction metrics (derived from STL robustness semantics) directly into the cost-map and planning pipeline, such that satisfaction holds by construction when the input constraints are correctly grounded. We acknowledge that the current manuscript provides only a high-level description without explicit derivations or quantitative analysis. We will expand the relevant section with a step-by-step outline of how the embedded metrics map to STL satisfaction (including a sketch of the robustness function application) and clarify the by-construction guarantee under accurate grounding. However, full simulation results, satisfaction analysis under VLM noise, or sensitivity studies are outside the scope of this theoretical paper. revision: partial

standing simulated objections not resolved

Providing simulation results, satisfaction analysis, or sensitivity studies for the illustrative navigation model, as the work is a theoretical architecture proposal without performed empirical evaluations or implementations.

Circularity Check

0 steps flagged

No circularity: proposal is self-contained conceptual architecture

full rationale

The manuscript presents a high-level architecture for mapping natural-language safety rules into STL specifications and 2D cost maps, then monitoring them at runtime. It explicitly labels the VLM zero-shot grounding step as a hypothesis rather than a derived result, and the illustrative navigation model is described as 'designed to satisfy' the specifications without any equations, fitted parameters, or self-citations that reduce the claims to their own inputs. No self-definitional loops, renamed empirical patterns, or load-bearing prior-author uniqueness theorems appear; the derivation chain therefore remains non-circular and externally falsifiable via the stated VLM assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about VLM capabilities and STL applicability without introducing free parameters or new entities.

axioms (2)

domain assumption Vision-Language Models can perform zero-shot scene understanding to map human instructions to environmental constraints and semantic features
Hypothesized in the abstract as the grounding mechanism but not demonstrated or proven.
domain assumption Signal Temporal Logic specifications can be monitored in real-time to guide planning and navigation while satisfying formal satisfaction metrics
Invoked as the core runtime mechanism for dynamic requirements.

pith-pipeline@v0.9.0 · 5430 in / 1308 out tokens · 86239 ms · 2026-05-08T16:58:26.565221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

UA V Trajectory Planning for Static and Dynamic Environments,

J. Ruz, O. Arevalo, G. Pajares, and J. M. D. La Cruz, “UA V Trajectory Planning for Static and Dynamic Environments,” inAerial Vehicles, T. Mung, Ed. InTech, Jan. 2009

work page 2009
[2]

Autonomous Systems in Unstructured Environments: AI Approaches for Robust Operation,

M. Saleem Sultan and M. Shahid Sultan, “Autonomous Systems in Unstructured Environments: AI Approaches for Robust Operation,” International Journal of Science and Research (IJSR), vol. 13, no. 8, pp. 1348–1355, Aug. 2024

work page 2024
[3]

A Survey on Path Planning for Autonomous Ground Vehicles in Unstructured Environments,

N. Wang, X. Li, K. Zhang, J. Wang, and D. Xie, “A Survey on Path Planning for Autonomous Ground Vehicles in Unstructured Environments,”Machines, vol. 12, no. 1, p. 31, Jan. 2024

work page 2024
[4]

Safety-critical advanced robots: A survey,

J. Guiochet, M. Machin, and H. Waeselynck, “Safety-critical advanced robots: A survey,”Robotics and Autonomous Systems, vol. 94, pp. 43– 52, Aug. 2017

work page 2017
[5]

Real-Time Metric- Semantic Mapping for Autonomous Navigation in Outdoor Environments,

J. Jiao, R. Geng, Y . Li, R. Xin, B. Yang, J. Wu, L. Wang, M. Liu, R. Fan, and D. Kanoulas, “Real-Time Metric- Semantic Mapping for Autonomous Navigation in Outdoor Environments,” vol. 22, pp. 5729–5740. [Online]. Available: https://ieeexplore.ieee.org/document/10620438/

work page arXiv
[6]

ROS-Based Navigation and Obstacle Avoidance: A Study of Architectures, Methods, and Trends,

Z. Wei, S. Wang, K. Chen, and F. Wang, “ROS-Based Navigation and Obstacle Avoidance: A Study of Architectures, Methods, and Trends,” Sensors, vol. 25, no. 14, p. 4306, Jan. 2025

work page 2025
[7]

Using RGB Image as Visual Input for Mapless Robot Navigation,

L. Ma, Y . Liu*, and J. Chen, “Using RGB Image as Visual Input for Mapless Robot Navigation,” Apr. 2019

work page 2019
[8]

An Open-Source Low-Cost Mobile Robot System With an RGB-D Camera and Efficient Real-Time Navigation Algorithm,

T. Kim, S. Lim, G. Shin, G. Sim, and D. Yun, “An Open-Source Low-Cost Mobile Robot System With an RGB-D Camera and Efficient Real-Time Navigation Algorithm,”IEEE Access, vol. 10, pp. 127 871– 127 881, 2022

work page 2022
[9]

Learning Transferable Visual Models From Natural Language Super- vision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Super- vision,” 2021

work page 2021
[10]

FLA V A: A Foundational Language And Vision Align- ment Model,

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “FLA V A: A Foundational Language And Vision Align- ment Model,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 15 617–15 629

work page 2022
[11]

Formal methods for autonomous systems,

T. Wongpiromsarn, M. Ghasemi, M. Cubuktepe, G. Bakirtzis, S. Carr, M. O. Karabag, C. Neary, P. Gohari, and U. Topcu, “Formal Methods for Autonomous Systems,” vol. 10, no. 3–4, pp. 180–407. [Online]. Available: http://arxiv.org/abs/2311.01258

work page arXiv
[12]

Motion planning with temporal-logic specifications: Progress and challenges,

E. Plaku and S. Karaman, “Motion planning with temporal-logic specifications: Progress and challenges,”AI Communications, vol. 29, no. 1, pp. 151–162, Nov. 2014

work page 2014
[13]

A formal methods approach to interpretable reinforcement learning for robotic planning,

X. Li, Z. Serlin, G. Yang, and C. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,”Science Robotics, vol. 4, no. 37, p. eaay6276, Dec. 2019

work page 2019
[14]

Formal methods in robot policy learning and verification: A survey on current techniques and future directions,

A. Manganaris, V . Giammarino, A. H. Qureshi, and S. Jagannathan, “Formal methods in robot policy learning and verification: A survey on current techniques and future directions,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06971

work page arXiv 2026
[15]

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,” 2022

work page 2022
[16]

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning,

Y . Du, T. Fu, Z. Chen, B. Li, S. Su, Z. Zhao, and C. Wang, “VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning,” Mar. 2025

work page 2025
[17]

Behav: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes,

K. Weerakoon, M. Elnoor, G. Seneviratne, V . Rajagopal, S. H. Arul, J. Liang, M. K. M. Jaffar, and D. Manocha, “Behav: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes,” in2025 IEEE International Conference on Robotics and Automation (ICRA). Atlanta, GA, USA: IEEE, May 2025, pp. 7044– 7051

work page 2025
[18]

DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation,

Z. Li, S. Li, Z. Zhang, B. Li, and S. Zhou, “DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation,” 2026

work page 2026
[19]

Runtime Assurance from Signal Temporal Logic Safety Specifications,

L. Baird and S. Coogan, “Runtime Assurance from Signal Temporal Logic Safety Specifications,” in2023 American Control Conference (ACC). San Diego, CA, USA: IEEE, May 2023, pp. 3535–3540

work page 2023
[20]

R. Liu, A. Hou, X. Yu, and X. Yin. Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks. [Online]. Available: http://arxiv.org/abs/2501.13457

work page arXiv
[21]

Trajec- tory Planning with Signal Temporal Logic Costs using Deterministic Path Integral Optimization

P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajec- tory Planning with Signal Temporal Logic Costs using Deterministic Path Integral Optimization.”

work page
[22]

Kapoor, S

P. Kapoor, S. Vemprala, and A. Kapoor. Logically Constrained Robotics Transformers for Enhanced Perception-Action Planning. [Online]. Available: http://arxiv.org/abs/2408.05336

work page arXiv
[23]

B. Ye, J. Huang, Y . Liu, X. Qiao, and X. Yin. Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks. [Online]. Available: http://arxiv.org/abs/2509.12813

work page arXiv
[24]

Vision-Language Models for Vision Tasks: A Survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-Language Models for Vision Tasks: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, Aug. 2024

work page 2024
[25]

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges,

Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges,” 2025

work page 2025
[26]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAd- vances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 34 892–34 916

work page 2023
[27]

Qwen2.5-VL Technical Report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,” 2025

work page 2025
[28]

Open vocabulary scene parsing,

H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, “Open vocabulary scene parsing,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2021–2029

work page 2017
[29]

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation,

Y . Feng, Y . Liu, S. Yang, W. Cai, J. Zhang, Q. Zhan, Z. Huang, H. Yan, Q. Wan, C. Liu, J. Wang, J. Lv, Z. Liu, T. Shi, Q. Liu, and Y . Wang, “Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation,” 2025

work page 2025
[30]

GroupViT: Semantic Segmentation Emerges from Text Supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “GroupViT: Semantic Segmentation Emerges from Text Supervision,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 18 113–18 123

work page 2022
[31]

CAT- Seg: Cost Aggregation for Open-V ocabulary Semantic Segmentation,

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT- Seg: Cost Aggregation for Open-V ocabulary Semantic Segmentation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, Jun. 2024, pp. 4113– 4123

work page 2024
[32]

LLMFormer: Large Language Model for Open-V ocabulary Semantic Segmentation,

H. Shi, S. D. Dao, and J. Cai, “LLMFormer: Large Language Model for Open-V ocabulary Semantic Segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, Feb. 2025

work page 2025
[33]

Visual Language Maps for Robot Navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). London, United Kingdom: IEEE, May 2023, pp. 10 608–10 615

work page 2023
[34]

Any- Traverse: An off-road traversability framework with VLM and human operator in the loop,

S. Sahu, A. Singh, K. Nambiar, S. Saripalli, and P. B. Sujit, “Any- Traverse: An off-road traversability framework with VLM and human operator in the loop,” 2025

work page 2025
[35]

VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 508–515, Jan. 2025

work page 2025
[36]

VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UA V Navigation,

J. Ye, S. Papaioannou, and P. Kolios, “VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UA V Navigation,” in2025 International Conference on Unmanned Aircraft Systems (ICUAS), May 2025, pp. 633–640

work page 2025
[37]

Rapidly-Exploring Random Trees: A New Tool for Path Planning,

S. LaValle, “Rapidly-Exploring Random Trees: A New Tool for Path Planning,” Oct. 1998

work page 1998
[38]

Monitoring Temporal Properties of Continuous Signals,

O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,” inFormal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y . Vardi, G. Weikum...

work page 2004
[39]

On Signal Temporal Logic,

A. Donz ´e, “On Signal Temporal Logic,” inRuntime Verification, A. Legay and S. Bensalem, Eds. Berlin, Heidelberg: Springer, 2013, pp. 382–383

work page 2013
[40]

Robust Satisfaction of Temporal Logic over Real-Valued Signals,

A. Donz ´e and O. Maler, “Robust Satisfaction of Temporal Logic over Real-Valued Signals,” inFormal Modeling and Analysis of Timed Sys- tems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y . Vardi, G. Weikum, K. Chatterjee, and ...

work page 2010
[41]

Planning with Preferences,

J. A. Baier and S. A. McIlraith, “Planning with Preferences,” vol. 29, no. 4, pp. 25–36. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1609/aimag.v29i4.2204

work page doi:10.1609/aimag.v29i4.2204
[42]

DeepSTL: From english requirements to signal temporal logic,

J. He, E. Bartocci, D. Ni ˇckovi´c, H. Isakovic, and R. Grosu, “DeepSTL: From english requirements to signal temporal logic,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY , USA: Association for Computing Machinery, Jul. 2022, pp. 610–622

work page 2022
[43]

NL2STL: Transformation from Logic Natural Language to Sig- nal Temporal Logics using Llama2,

Y . Mao, T. Zhang, X. Cao, Z. Chen, X. Liang, B. Xu, and H. Fang, “NL2STL: Transformation from Logic Natural Language to Sig- nal Temporal Logics using Llama2,” in2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), Aug. 2024, pp. 469–474

work page 2024
[44]

Learning from Failures: Translation of Natural Language Requirements into Linear Temporal Logic with Large Language Models,

Y . Xu, J. Feng, and W. Miao, “Learning from Failures: Translation of Natural Language Requirements into Linear Temporal Logic with Large Language Models,” in2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), Jul. 2024, pp. 204–215

work page 2024
[45]

NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Mod- els,

Y . Chen, R. Gandhi, Y . Zhang, and C. Fan, “NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Mod- els,” Mar. 2024

work page 2024
[46]

Formal Synthesis of Embedded Control Software: Application to Vehicle Management Sys- tems,

T. Wongpiromsarn, U. Topcu, and R. Murray, “Formal Synthesis of Embedded Control Software: Application to Vehicle Management Sys- tems,” inInfotech@Aerospace 2011. St. Louis, Missouri: American Institute of Aeronautics and Astronautics, Mar. 2011

work page 2011
[47]

Image Segmentation Using Text and Image Prompts,

T. Luddecke and A. Ecker, “Image Segmentation Using Text and Image Prompts,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 7076–7086. [Online]. Available: https://ieeexplore.ieee.org/document/9879551/

work page arXiv
[48]

RRT X: Asymptotically optimal single-query sampling-based motion planning with quick replanning,

M. Otte and E. Frazzoli, “RRT X : Asymptotically optimal single-query sampling-based motion planning with quick replanning,” vol. 35, no. 7, pp. 797–822. [Online]. Available: https://journals.sagepub.com/doi/10.1177/0278364915594679

work page doi:10.1177/0278364915594679
[49]

RRTX: Real-Time Motion Planning/Replanning for Environ- ments with Unpredictable Obstacles

——, “RRTX: Real-Time Motion Planning/Replanning for Environ- ments with Unpredictable Obstacles.”

work page
[50]

Synthesis of Reac- tive Switching Protocols From Temporal Logic Specifications,

J. Liu, N. Ozay, U. Topcu, and R. M. Murray, “Synthesis of Reac- tive Switching Protocols From Temporal Logic Specifications,”IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1771–1785, Jul. 2013

work page 2013

[1] [1]

UA V Trajectory Planning for Static and Dynamic Environments,

J. Ruz, O. Arevalo, G. Pajares, and J. M. D. La Cruz, “UA V Trajectory Planning for Static and Dynamic Environments,” inAerial Vehicles, T. Mung, Ed. InTech, Jan. 2009

work page 2009

[2] [2]

Autonomous Systems in Unstructured Environments: AI Approaches for Robust Operation,

M. Saleem Sultan and M. Shahid Sultan, “Autonomous Systems in Unstructured Environments: AI Approaches for Robust Operation,” International Journal of Science and Research (IJSR), vol. 13, no. 8, pp. 1348–1355, Aug. 2024

work page 2024

[3] [3]

A Survey on Path Planning for Autonomous Ground Vehicles in Unstructured Environments,

N. Wang, X. Li, K. Zhang, J. Wang, and D. Xie, “A Survey on Path Planning for Autonomous Ground Vehicles in Unstructured Environments,”Machines, vol. 12, no. 1, p. 31, Jan. 2024

work page 2024

[4] [4]

Safety-critical advanced robots: A survey,

J. Guiochet, M. Machin, and H. Waeselynck, “Safety-critical advanced robots: A survey,”Robotics and Autonomous Systems, vol. 94, pp. 43– 52, Aug. 2017

work page 2017

[5] [5]

Real-Time Metric- Semantic Mapping for Autonomous Navigation in Outdoor Environments,

J. Jiao, R. Geng, Y . Li, R. Xin, B. Yang, J. Wu, L. Wang, M. Liu, R. Fan, and D. Kanoulas, “Real-Time Metric- Semantic Mapping for Autonomous Navigation in Outdoor Environments,” vol. 22, pp. 5729–5740. [Online]. Available: https://ieeexplore.ieee.org/document/10620438/

work page arXiv

[6] [6]

ROS-Based Navigation and Obstacle Avoidance: A Study of Architectures, Methods, and Trends,

Z. Wei, S. Wang, K. Chen, and F. Wang, “ROS-Based Navigation and Obstacle Avoidance: A Study of Architectures, Methods, and Trends,” Sensors, vol. 25, no. 14, p. 4306, Jan. 2025

work page 2025

[7] [7]

Using RGB Image as Visual Input for Mapless Robot Navigation,

L. Ma, Y . Liu*, and J. Chen, “Using RGB Image as Visual Input for Mapless Robot Navigation,” Apr. 2019

work page 2019

[8] [8]

An Open-Source Low-Cost Mobile Robot System With an RGB-D Camera and Efficient Real-Time Navigation Algorithm,

T. Kim, S. Lim, G. Shin, G. Sim, and D. Yun, “An Open-Source Low-Cost Mobile Robot System With an RGB-D Camera and Efficient Real-Time Navigation Algorithm,”IEEE Access, vol. 10, pp. 127 871– 127 881, 2022

work page 2022

[9] [9]

Learning Transferable Visual Models From Natural Language Super- vision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Super- vision,” 2021

work page 2021

[10] [10]

FLA V A: A Foundational Language And Vision Align- ment Model,

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “FLA V A: A Foundational Language And Vision Align- ment Model,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 15 617–15 629

work page 2022

[11] [11]

Formal methods for autonomous systems,

T. Wongpiromsarn, M. Ghasemi, M. Cubuktepe, G. Bakirtzis, S. Carr, M. O. Karabag, C. Neary, P. Gohari, and U. Topcu, “Formal Methods for Autonomous Systems,” vol. 10, no. 3–4, pp. 180–407. [Online]. Available: http://arxiv.org/abs/2311.01258

work page arXiv

[12] [12]

Motion planning with temporal-logic specifications: Progress and challenges,

E. Plaku and S. Karaman, “Motion planning with temporal-logic specifications: Progress and challenges,”AI Communications, vol. 29, no. 1, pp. 151–162, Nov. 2014

work page 2014

[13] [13]

A formal methods approach to interpretable reinforcement learning for robotic planning,

X. Li, Z. Serlin, G. Yang, and C. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,”Science Robotics, vol. 4, no. 37, p. eaay6276, Dec. 2019

work page 2019

[14] [14]

Formal methods in robot policy learning and verification: A survey on current techniques and future directions,

A. Manganaris, V . Giammarino, A. H. Qureshi, and S. Jagannathan, “Formal methods in robot policy learning and verification: A survey on current techniques and future directions,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06971

work page arXiv 2026

[15] [15]

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action,” 2022

work page 2022

[16] [16]

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning,

Y . Du, T. Fu, Z. Chen, B. Li, S. Su, Z. Zhao, and C. Wang, “VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning,” Mar. 2025

work page 2025

[17] [17]

Behav: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes,

K. Weerakoon, M. Elnoor, G. Seneviratne, V . Rajagopal, S. H. Arul, J. Liang, M. K. M. Jaffar, and D. Manocha, “Behav: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes,” in2025 IEEE International Conference on Robotics and Automation (ICRA). Atlanta, GA, USA: IEEE, May 2025, pp. 7044– 7051

work page 2025

[18] [18]

DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation,

Z. Li, S. Li, Z. Zhang, B. Li, and S. Zhou, “DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation,” 2026

work page 2026

[19] [19]

Runtime Assurance from Signal Temporal Logic Safety Specifications,

L. Baird and S. Coogan, “Runtime Assurance from Signal Temporal Logic Safety Specifications,” in2023 American Control Conference (ACC). San Diego, CA, USA: IEEE, May 2023, pp. 3535–3540

work page 2023

[20] [20]

R. Liu, A. Hou, X. Yu, and X. Yin. Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks. [Online]. Available: http://arxiv.org/abs/2501.13457

work page arXiv

[21] [21]

Trajec- tory Planning with Signal Temporal Logic Costs using Deterministic Path Integral Optimization

P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajec- tory Planning with Signal Temporal Logic Costs using Deterministic Path Integral Optimization.”

work page

[22] [22]

Kapoor, S

P. Kapoor, S. Vemprala, and A. Kapoor. Logically Constrained Robotics Transformers for Enhanced Perception-Action Planning. [Online]. Available: http://arxiv.org/abs/2408.05336

work page arXiv

[23] [23]

B. Ye, J. Huang, Y . Liu, X. Qiao, and X. Yin. Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks. [Online]. Available: http://arxiv.org/abs/2509.12813

work page arXiv

[24] [24]

Vision-Language Models for Vision Tasks: A Survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-Language Models for Vision Tasks: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, Aug. 2024

work page 2024

[25] [25]

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges,

Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges,” 2025

work page 2025

[26] [26]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAd- vances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 34 892–34 916

work page 2023

[27] [27]

Qwen2.5-VL Technical Report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,” 2025

work page 2025

[28] [28]

Open vocabulary scene parsing,

H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, “Open vocabulary scene parsing,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2021–2029

work page 2017

[29] [29]

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation,

Y . Feng, Y . Liu, S. Yang, W. Cai, J. Zhang, Q. Zhan, Z. Huang, H. Yan, Q. Wan, C. Liu, J. Wang, J. Lv, Z. Liu, T. Shi, Q. Liu, and Y . Wang, “Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation,” 2025

work page 2025

[30] [30]

GroupViT: Semantic Segmentation Emerges from Text Supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “GroupViT: Semantic Segmentation Emerges from Text Supervision,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun. 2022, pp. 18 113–18 123

work page 2022

[31] [31]

CAT- Seg: Cost Aggregation for Open-V ocabulary Semantic Segmentation,

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT- Seg: Cost Aggregation for Open-V ocabulary Semantic Segmentation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, Jun. 2024, pp. 4113– 4123

work page 2024

[32] [32]

LLMFormer: Large Language Model for Open-V ocabulary Semantic Segmentation,

H. Shi, S. D. Dao, and J. Cai, “LLMFormer: Large Language Model for Open-V ocabulary Semantic Segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, Feb. 2025

work page 2025

[33] [33]

Visual Language Maps for Robot Navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). London, United Kingdom: IEEE, May 2023, pp. 10 608–10 615

work page 2023

[34] [34]

Any- Traverse: An off-road traversability framework with VLM and human operator in the loop,

S. Sahu, A. Singh, K. Nambiar, S. Saripalli, and P. B. Sujit, “Any- Traverse: An off-road traversability framework with VLM and human operator in the loop,” 2025

work page 2025

[35] [35]

VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models,

D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 508–515, Jan. 2025

work page 2025

[36] [36]

VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UA V Navigation,

J. Ye, S. Papaioannou, and P. Kolios, “VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UA V Navigation,” in2025 International Conference on Unmanned Aircraft Systems (ICUAS), May 2025, pp. 633–640

work page 2025

[37] [37]

Rapidly-Exploring Random Trees: A New Tool for Path Planning,

S. LaValle, “Rapidly-Exploring Random Trees: A New Tool for Path Planning,” Oct. 1998

work page 1998

[38] [38]

Monitoring Temporal Properties of Continuous Signals,

O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,” inFormal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y . Vardi, G. Weikum...

work page 2004

[39] [39]

On Signal Temporal Logic,

A. Donz ´e, “On Signal Temporal Logic,” inRuntime Verification, A. Legay and S. Bensalem, Eds. Berlin, Heidelberg: Springer, 2013, pp. 382–383

work page 2013

[40] [40]

Robust Satisfaction of Temporal Logic over Real-Valued Signals,

A. Donz ´e and O. Maler, “Robust Satisfaction of Temporal Logic over Real-Valued Signals,” inFormal Modeling and Analysis of Timed Sys- tems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y . Vardi, G. Weikum, K. Chatterjee, and ...

work page 2010

[41] [41]

Planning with Preferences,

J. A. Baier and S. A. McIlraith, “Planning with Preferences,” vol. 29, no. 4, pp. 25–36. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1609/aimag.v29i4.2204

work page doi:10.1609/aimag.v29i4.2204

[42] [42]

DeepSTL: From english requirements to signal temporal logic,

J. He, E. Bartocci, D. Ni ˇckovi´c, H. Isakovic, and R. Grosu, “DeepSTL: From english requirements to signal temporal logic,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY , USA: Association for Computing Machinery, Jul. 2022, pp. 610–622

work page 2022

[43] [43]

NL2STL: Transformation from Logic Natural Language to Sig- nal Temporal Logics using Llama2,

Y . Mao, T. Zhang, X. Cao, Z. Chen, X. Liang, B. Xu, and H. Fang, “NL2STL: Transformation from Logic Natural Language to Sig- nal Temporal Logics using Llama2,” in2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), Aug. 2024, pp. 469–474

work page 2024

[44] [44]

Learning from Failures: Translation of Natural Language Requirements into Linear Temporal Logic with Large Language Models,

Y . Xu, J. Feng, and W. Miao, “Learning from Failures: Translation of Natural Language Requirements into Linear Temporal Logic with Large Language Models,” in2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), Jul. 2024, pp. 204–215

work page 2024

[45] [45]

NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Mod- els,

Y . Chen, R. Gandhi, Y . Zhang, and C. Fan, “NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Mod- els,” Mar. 2024

work page 2024

[46] [46]

Formal Synthesis of Embedded Control Software: Application to Vehicle Management Sys- tems,

T. Wongpiromsarn, U. Topcu, and R. Murray, “Formal Synthesis of Embedded Control Software: Application to Vehicle Management Sys- tems,” inInfotech@Aerospace 2011. St. Louis, Missouri: American Institute of Aeronautics and Astronautics, Mar. 2011

work page 2011

[47] [47]

Image Segmentation Using Text and Image Prompts,

T. Luddecke and A. Ecker, “Image Segmentation Using Text and Image Prompts,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 7076–7086. [Online]. Available: https://ieeexplore.ieee.org/document/9879551/

work page arXiv

[48] [48]

RRT X: Asymptotically optimal single-query sampling-based motion planning with quick replanning,

M. Otte and E. Frazzoli, “RRT X : Asymptotically optimal single-query sampling-based motion planning with quick replanning,” vol. 35, no. 7, pp. 797–822. [Online]. Available: https://journals.sagepub.com/doi/10.1177/0278364915594679

work page doi:10.1177/0278364915594679

[49] [49]

RRTX: Real-Time Motion Planning/Replanning for Environ- ments with Unpredictable Obstacles

——, “RRTX: Real-Time Motion Planning/Replanning for Environ- ments with Unpredictable Obstacles.”

work page

[50] [50]

Synthesis of Reac- tive Switching Protocols From Temporal Logic Specifications,

J. Liu, N. Ozay, U. Topcu, and R. M. Murray, “Synthesis of Reac- tive Switching Protocols From Temporal Logic Specifications,”IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1771–1785, Jul. 2013

work page 2013