pith. machine review for the scientific record. sign in

arxiv: 2604.23970 · v1 · submitted 2026-04-27 · 💻 cs.AI · cs.CV· cs.HC· cs.MA

Recognition: unknown

LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.HCcs.MA
keywords LLMagentic frameworkfloor plan parsingindoor navigationaccessibilityblind and low-visionspatial knowledge graphpath planning
0
0 comments X

The pith

A multi-agent LLM framework turns floor plan images into structured maps for generating safe navigation instructions for blind and low-vision people.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop an agentic system with two main phases to address indoor navigation challenges for blind and low-vision individuals. First, a multi-agent module parses a single floor plan image into a spatial knowledge graph using self-correcting iterative loops and feedback. Second, a path planner creates routes while a safety evaluator checks for hazards. Tested on real university building floors, the system achieves higher success rates for short, medium, and long routes than single-call LLM baselines, suggesting it can provide accessible guidance with minimal infrastructure.

Core claim

The paper shows that the agentic parsing pipeline builds a retrievable spatial knowledge base from one floor plan image, enabling a path planner and safety evaluator to produce accessible navigation instructions that succeed at rates up to 92.31 percent for short routes on the MP-1 floor, outperforming the strongest baseline.

What carries the argument

The multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback.

If this is right

  • Provides a scalable solution for accessible indoor navigation using only existing floor plans.
  • Achieves consistent performance gains over single LLM calls across route lengths and building floors.
  • Reduces reliance on costly per-building infrastructure for BLV navigation.
  • Supports generation of safe routes by evaluating hazards along proposed paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to dynamic environments by updating the knowledge graph with real-time sensor data.
  • Similar agentic workflows might improve other tasks like generating tactile maps or voice descriptions of spaces.
  • Testing on diverse building types would reveal how well the system generalizes beyond university structures.

Load-bearing premise

The LLM agents accurately extract spatial relations and hazard information from floor plan images without systematic errors or the need for building-specific fine-tuning.

What would settle it

Running the system on a new building and finding that many generated routes contain unaccounted hazards or fail to reach the destination more often than the baseline method.

Figures

Figures reproduced from arXiv: 2604.23970 by Aydin Ayanzadeh, Tim Oates.

Figure 1
Figure 1. Figure 1: Overview of our agentic pipeline showing the five-agent view at source ↗
Figure 2
Figure 2. Figure 2: Sample floor plan images used in this study. (a) MP-3 and (b) MP-1 are architectural floor plans with detailed room and structural view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end architecture of the workflow. Left: The Parser extracts rooms and adjacencies, the Graph Builder constructs a view at source ↗
Figure 5
Figure 5. Figure 5: Bar plot comparing zero-shot navigation accuracy across view at source ↗
Figure 4
Figure 4. Figure 4: Components of the multimodal knowledge base: graph view at source ↗
read the original abstract

Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents an agentic LLM framework that parses a single floor-plan image into a structured spatial knowledge graph via a multi-agent self-correcting pipeline with retry loops, then employs a Path Planner and Safety Evaluator to generate accessible navigation instructions for blind and low-vision users. It reports concrete success rates on two floors of the UMBC Math & Psychology building (MP-1: 92.31%/76.92%/61.54% for short/medium/long routes; MP-3: 76.92%/61.54%/38.46%) that outperform single-call baselines such as Claude 3.7 Sonnet, and states that the system was also evaluated on the CVC-FP benchmark.

Significance. If the performance claims hold under transparent protocols, the work offers a potentially scalable, low-infrastructure alternative to hardware-dependent indoor navigation aids by turning existing floor plans into retrievable knowledge bases. Credit is due for the real-building evaluation on MP-1/MP-3 and the inclusion of an external benchmark, as well as the explicit multi-agent error-correction design. The approach addresses a genuine accessibility gap, but its impact hinges on demonstrating that the reported gains stem from reliable spatial parsing rather than extra inference steps.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The headline success rates on MP-1 and MP-3 are given without any description of the evaluation protocol, including total number of routes per length bin, route sampling procedure, number of independent trials, or the exact criteria used to judge a route 'successful' (e.g., hazard-free arrival, time limits, or human verification). With only ~13 routes per category, the 7–8 percentage-point margins over baselines cannot be assessed for statistical reliability or robustness.
  2. [Abstract and §4] Abstract and §4: No quantitative results, success rates, or error analysis are reported for the CVC-FP benchmark despite its explicit mention as an evaluation target. This omission prevents any assessment of generalization beyond the visual conventions of the two MP floors.
  3. [§3.1–3.2] §3.1–3.2 (Multi-agent Parsing Pipeline): The central claim that the self-correcting multi-agent module produces an accurate spatial knowledge graph and hazard map rests on the assumption that LLMs can reliably extract precise spatial relations (door swings, connectivity, stair/elevator symbols) without systematic errors. No failure-case breakdown, per-relation accuracy metrics, or ablation on the retry/feedback loop is provided to substantiate this assumption.
minor comments (2)
  1. [§3.1] The notation used for the spatial knowledge graph entities and relations should be formalized (e.g., via a small table or BNF) to improve reproducibility.
  2. [Figures 2–4] Figure captions and legends for parsed graph visualizations should explicitly label all node/edge types and hazard symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our evaluation and analysis. We address each major comment point-by-point below and commit to substantial revisions that directly incorporate the requested details, metrics, and ablations.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The headline success rates on MP-1 and MP-3 are given without any description of the evaluation protocol, including total number of routes per length bin, route sampling procedure, number of independent trials, or the exact criteria used to judge a route 'successful' (e.g., hazard-free arrival, time limits, or human verification). With only ~13 routes per category, the 7–8 percentage-point margins over baselines cannot be assessed for statistical reliability or robustness.

    Authors: We agree that the evaluation protocol was under-specified in the submitted version. In the revised manuscript we will add a dedicated subsection in §4 that explicitly states: (i) 13 routes per length bin (short: <20 m, medium: 20–50 m, long: >50 m), sampled by enumerating all feasible start–end pairs on the ground-truth graph and binning by shortest-path distance; (ii) the exact success criterion (the generated instruction sequence must produce a hazard-free path that reaches the destination when executed on the verified building graph, confirmed by two independent human annotators); (iii) three independent runs of the full pipeline per route to account for LLM stochasticity, with mean success rates and standard deviations reported; and (iv) a statistical comparison (McNemar’s test or bootstrap confidence intervals) for the observed margins. These additions will allow readers to assess reliability directly. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: No quantitative results, success rates, or error analysis are reported for the CVC-FP benchmark despite its explicit mention as an evaluation target. This omission prevents any assessment of generalization beyond the visual conventions of the two MP floors.

    Authors: We acknowledge the omission. Although the manuscript states that the system was evaluated on CVC-FP, we did not include the numerical results. In the revision we will insert a new paragraph in §4 reporting: overall parsing accuracy (node/edge F1), navigation success rates broken down by route length, and a brief error analysis contrasting failure modes on CVC-FP versus the MP floors. This will directly demonstrate generalization across different floor-plan visual styles. revision: yes

  3. Referee: [§3.1–3.2] §3.1–3.2 (Multi-agent Parsing Pipeline): The central claim that the self-correcting multi-agent module produces an accurate spatial knowledge graph and hazard map rests on the assumption that LLMs can reliably extract precise spatial relations (door swings, connectivity, stair/elevator symbols) without systematic errors. No failure-case breakdown, per-relation accuracy metrics, or ablation on the retry/feedback loop is provided to substantiate this assumption.

    Authors: We agree that stronger empirical support for the parsing module is required. The revised §3 will include: (1) a qualitative failure-case table with representative examples of initial extraction errors (e.g., incorrect door-swing direction, missed stair symbols) and how the retry/feedback loop corrects them; (2) per-relation precision/recall/F1 scores computed on a manually annotated subset of 20 floor-plan images (covering connectivity, door attributes, and symbol detection); and (3) an ablation that disables the self-correction loop and reports the resulting drop in graph accuracy and downstream navigation success. These additions will quantify the contribution of the multi-agent design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no derivations or fitted predictions

full rationale

The paper presents a multi-agent LLM framework for parsing floor-plan images into a spatial knowledge graph and generating navigation instructions, evaluated empirically on real building floors (MP-1, MP-3) and the CVC-FP benchmark. Success rates are reported as direct measurements against single-call LLM baselines, with no equations, first-principles derivations, parameter fitting, or self-referential definitions. The claimed gains rest on external comparisons and real-world testing rather than any reduction of outputs to inputs by construction. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing elements. The work is therefore self-contained as a systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes reliable LLM spatial reasoning and hazard detection.

pith-pipeline@v0.9.0 · 5581 in / 1041 out tokens · 49730 ms · 2026-05-08T03:47:20.602538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    World report on vision

    World Health Organization. World report on vision. WHO, 2019

  2. [2]

    Kuriakose, R

    B. Kuriakose, R. Shrestha, and F. Sandnes. Smartphone- based indoor navigation for the blind and visually impaired: A comprehensive review.Electronics, 11(8):1283, 2022

  3. [3]

    Nakajimaet al.Indoor navigation using ArUco markers and graph-based pathfinding

    Y . Nakajimaet al.Indoor navigation using ArUco markers and graph-based pathfinding. InProc. IEEE ICCE, 2023

  4. [4]

    Zafari, A

    F. Zafari, A. Gkelias, and K. Leung. A survey of indoor lo- calization systems and technologies.IEEE Commun. Surveys Tut., 21(3):2568–2599, 2019

  5. [5]

    Chenet al.MapGPT: Map-guided prompting with adap- tive path planning for vision-and-language navigation

    J. Chenet al.MapGPT: Map-guided prompting with adap- tive path planning for vision-and-language navigation. In Proc. ACL, 2024

  6. [6]

    Chenet al.SpatialVLM: Endowing vision-language mod- els with spatial reasoning

    B. Chenet al.SpatialVLM: Endowing vision-language mod- els with spatial reasoning. InProc. CVPR, 2024

  7. [7]

    Y . Hu, Z. Liu, and J. Han. Scalable and accurate graph rea- soning with LLM-based multi-agents.arXiv:2410.05130, 2024

  8. [8]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision.arXiv:2212.04356, 2022

  9. [9]

    de las Heras, S

    L.-P. de las Heras, S. Terrades, S. Robles, and G. Sánchez. CVC-FP and SGT: A new database for structural floor plan analysis and its groundtruthing tool.International Journal on Document Analysis and Recognition (IJDAR), 18(1):15– 30, 2015

  10. [10]

    The Llama 3 Herd of Models

    A. Dubeyet al.The LLaMA 3 herd of models. arXiv:2407.21783, 2024

  11. [11]

    Ganon, M

    K. Ganon, M. Alper, R. Mikulinsky, and H. Averbuch-Elor. Waffle: Multimodal floorplan understanding in the wild. In Proc. IEEE/CVF WACV, pp. 1488–1497, 2025

  12. [12]

    Y . Li, S. Ruwanpathirana, N. Mehta, and S. Karunasek- era. Vision language models can parse floor plan maps. arXiv:2409.12842, 2024

  13. [13]

    Coffrini, P

    A. Coffrini, P. Barsocchi, F. Furfari, A. Crivello, and A. Fer- rari. LLM-guided indoor navigation with multimodal map understanding. InProc. IPIN, pp. 1–6, 2025

  14. [14]

    Ayanzadeh and T

    A. Ayanzadeh and T. Oates. Floorplan2Guide: LLM- guided floorplan parsing for BLV indoor navigation. arXiv:2512.12177, 2025

  15. [15]

    Singh, R

    A. Singh, R. Patel, and M. Kumar. ViLLA: Vision-language layout analyzer for automatic floor plan analysis. InProc. DAS, 2024

  16. [16]

    Lewiset al.Retrieval-augmented generation for knowledge-intensive NLP tasks

    P. Lewiset al.Retrieval-augmented generation for knowledge-intensive NLP tasks. InProc. NeurIPS, 2020

  17. [17]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    D. Edgeet al.From local to global: A graph RAG approach to query-focused summarization.arXiv:2404.16130, 2024

  18. [18]

    Y . Ning, Y . Liu, and Y . Zheng. UrbanKGent: A unified LLM agent framework for urban knowledge graph construction. In Proc. NeurIPS, 2024

  19. [19]

    Madaanet al.Self-Refine: Iterative refinement with self- feedback

    A. Madaanet al.Self-Refine: Iterative refinement with self- feedback. InProc. NeurIPS, 2023

  20. [20]

    Yanget al.W AFFLE: Multimodal floor plan understand- ing with LLMs and VLMs.arXiv:2407.10320, 2024

    X. Yanget al.W AFFLE: Multimodal floor plan understand- ing with LLMs and VLMs.arXiv:2407.10320, 2024

  21. [21]

    Jianget al.ChatHouseDiffusion: LLM-driven floor plan generation with diffusion and graph representation

    R. Jianget al.ChatHouseDiffusion: LLM-driven floor plan generation with diffusion and graph representation. arXiv:2405.13632, 2024

  22. [22]

    Y . Tian, Q. Ye, and D. Doermann. YOLOv12: Attention- centric real-time object detectors.arXiv:2502.12524, 2025

  23. [23]

    Claude 3.7 Sonnet model card, 2025

    Anthropic. Claude 3.7 Sonnet model card, 2025

  24. [24]

    Chroma: The AI-native open-source embedding database, 2024

    Chroma. Chroma: The AI-native open-source embedding database, 2024

  25. [25]

    Hagberg, P

    A. Hagberg, P. Swart, and D. Chult. Exploring network structure, dynamics, and function using NetworkX. InProc. SciPy, 2008

  26. [26]

    Gupta and A

    T. Gupta and A. Kembhavi. Visual programming: Compo- sitional visual reasoning without training. InProc. CVPR, 2023

  27. [27]

    R. K. Keser, I. Nallbani, N. Çalik, A. Ayanzadeh, and B. U. Töreyin. Graph embedding for link prediction using residual variational graph autoencoders. InProc. SIU, pp. 1– 4, 2020

  28. [28]

    Nallbani, R

    I. Nallbani, R. K. Keser, A. Ayanzadeh, N. Çalık, and B. U. Töreyin. ResVGAE: Going Deeper with Residual Modules for Link Prediction.arXiv preprint arXiv:2105.00695, 2021

  29. [29]

    Garrido-Jurado, R

    S. Garrido-Jurado, R. Muñoz-Salinas, F. J. Madrid-Cuevas, and M. J. Marín-Jiménez. Automatic generation and detec- tion of highly reliable fiducial markers under occlusion.Pat- tern Recognition, 47(6):2280–2292, 2014

  30. [30]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Z. Yanget al.MM-ReAct: Prompting ChatGPT for multi- modal reasoning and action.arXiv:2303.11381, 2023

  31. [31]

    Yaoet al.ReAct: Synergizing reasoning and acting in language models

    S. Yaoet al.ReAct: Synergizing reasoning and acting in language models. InProc. ICLR, 2023

  32. [32]

    Wanget al.V oyager: An open-ended embodied agent with large language models

    G. Wanget al.V oyager: An open-ended embodied agent with large language models. InProc. NeurIPS, 2023

  33. [33]

    Zhouet al.NavGPT: Explicit reasoning in vision-and- language navigation with large language models

    W. Zhouet al.NavGPT: Explicit reasoning in vision-and- language navigation with large language models. InProc. AAAI, 2024

  34. [34]

    Liang, G

    Y . Liang, G. Li, and X. Chen. DiscussNav: Discussion im- proves zero-shot vision-and-language navigation via multi- expert consultation. InProc. IEEE ICRA, 2024

  35. [35]

    Shabani, S

    M. Shabani, S. Lim, and Y . Furukawa. HouseDiffusion: Vec- tor floorplan generation via a diffusion model with discrete and continuous denoising. InProc. CVPR, 2023

  36. [36]

    Z. Fan, L. Zhu, H. Li, X. Chen, S. Zhu, and P. Tan. Floor- PlanCAD: A large-scale CAD drawing dataset for panoptic symbol spotting. InProc. ICCV, 2021

  37. [37]

    arXiv preprint arXiv:2009.09941 , year=

    Y . Duet al.PP-OCR: A practical ultra lightweight OCR system.arXiv preprint arXiv:2009.09941, 2020. Appendix A Compact version of prompt for Parser, Planner, self-critics, and safety evaluator Agents. Parser Agent You are an AI analysing architectural floor plan images. TASK: Extract room-connectivity graph + metadata. Return ONLY valid JSON. {detection_c...