pith. machine review for the scientific record. sign in

arxiv: 2605.13741 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:08 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords 3D scene graphmonocular mappingopen-vocabulary segmentationfeed-forward reconstructionroom partitioningfactor graphdense SLAMindoor mapping
0
0 comments X

The pith

Monocular RGB alone can build accurate dense open-vocabulary 3D scene graphs when rooms guide reconstruction order and global alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LEXI-SG, a mapping system that builds hierarchical 3D scene graphs from a single RGB camera by first using open-vocabulary foundation models to divide the environment into rooms. Reconstruction of each room is deferred until the room is fully observed, which removes the scale drift common in continuous sliding-window methods. A room-based factor graph then aligns the separate room maps into one consistent global structure while preserving local consistency and the semantic hierarchy. The system also performs open-vocabulary object segmentation and tracking inside each room. On indoor datasets it produces better trajectories and denser maps than prior feed-forward SLAM and scene-graph baselines.

Core claim

LEXI-SG shows that dense monocular 3D scene graphs become feasible when semantic priors partition the scene into rooms, each room is reconstructed feed-forward only after complete observation, and the resulting local maps are globally aligned by a room-based factor graph that naturally enforces the scene-graph hierarchy.

What carries the argument

Room-partitioned feed-forward reconstruction combined with a room-based factor graph that globally aligns local maps while enforcing semantic hierarchy.

If this is right

  • Trajectory estimation improves on indoor sequences relative to existing feed-forward SLAM methods.
  • Dense reconstruction quality rises because sliding-window scale inconsistencies are eliminated.
  • Open-vocabulary object segmentation and tracking operate inside each reconstructed room.
  • The semantic scene-graph hierarchy emerges directly from the room alignment process.
  • Scalable mapping on longer sequences becomes possible with only RGB input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could perform semantic navigation in homes and offices using only a single inexpensive camera.
  • The same room-deferral idea could be tested in outdoor environments if analogous spatial partitions can be defined.
  • Newer or larger foundation models could be swapped in to raise segmentation accuracy without redesigning the mapping pipeline.
  • Global consistency across rooms may enable reliable loop closure even when visual overlap between rooms is low.

Load-bearing premise

Open-vocabulary foundation models can reliably identify consistent room boundaries so that per-room deferred reconstruction avoids creating new drift or alignment errors.

What would settle it

A long indoor sequence in which room partitions are ambiguous or incorrect, producing visible scale drift or misalignment once the room maps are joined in the global factor graph.

Figures

Figures reproduced from arXiv: 2605.13741 by Ayoung Kim, Christina Kassab, Hyeonjae Gil, Mat\'ias Mattamala, Maurice Fallon.

Figure 1
Figure 1. Figure 1: LEXI-SG is the first dense monocular mapping system to build open-vocabulary 3D scene graphs from RGB input alone. We first partition the incoming image stream room by room. Within each room, we jointly estimate camera trajectories and dense geometry using feed-forward reconstruction models, amortizing expensive model queries while ensuring local scale consistency. The room graph is then expanded to a full… view at source ↗
Figure 2
Figure 2. Figure 2: LEXI-SG System Overview. RGB frames are segmented into rooms using DINO features. Upon detecting a room transition, the accumulated batch is passed through a feed-forward model (MapAnything—MapA in the figure) to produce per-frame depths and poses in a local room frame. Transition edges are estimated by feeding transition frame pairs through the same model. New rooms are checked for loop closures, with any… view at source ↗
Figure 3
Figure 3. Figure 3: Transition edge estimation. The relative transform Trirj between adjacent rooms is estimated by retrieving transition image pairs (p, q) and computing Tpq via a feed￾forward reconstruction model. as E = ERR ∪ ERO. Room-to-room edges erirj ∈ ERR connect neighboring room nodes and encode the relative transformation Trirj between their local reference frames. Room-to-object edges erioi ∈ ERO connect each obje… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative room segmentation results on the AOD and HM3D sequences. Our approach reliably delineates room boundaries by detecting transitional structures (such as doorways and corridors) from RGB input alone—without relying on a depth sensor or geometric priors. On the AOD sequences, MASt3R-SLAM achieves the second best performance. The performance of LEXI-SG on ground floor sequences 1 and 2 is reduced b… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of OpenLex3D Benchmark on 00824 sequence of HM3D dataset using ground-truth poses and depth. Our object segmentation module shows better performance in the synonyms category (green) and fewer incorrect labels compared to other baselines. local as well as global geometric consistency. These im￾provements stem from LEXI-SG reconstructing each room in a single pass. In contrast, sliding-window a… view at source ↗
read the original abstract

Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. LEXI-SG presents the first dense monocular RGB-only system for open-vocabulary 3D scene graph mapping. It uses open-vocabulary foundation models to partition the scene into rooms, defers feed-forward reconstruction until each room is fully observed to avoid sliding-window scale drift, and employs a room-based factor-graph formulation for global alignment that preserves local consistency and encodes the semantic hierarchy. Within rooms it adds open-vocabulary object segmentation and tracking. The method is evaluated on Habitat-Matterport 3D and self-collected egocentric office sequences, reporting improved trajectory estimation and dense reconstruction relative to feed-forward SLAM and scene-graph baselines.

Significance. If the room-guided deferral and factor-graph alignment prove robust, the work would constitute a meaningful step toward scalable, sensor-light semantic mapping for robotics. The explicit use of foundation-model priors for hierarchical decomposition and the avoidance of depth sensors address practical constraints in many indoor deployments; reproducible code and project data further strengthen the contribution.

major comments (3)
  1. [§3.2] §3.2 (Room Partitioning): The central claim that deferring reconstruction until rooms are 'fully observed' eliminates scale inconsistencies rests on the unvalidated assumption that open-vocabulary foundation models produce reliable room boundaries. No quantitative segmentation accuracy, failure-mode analysis, or sensitivity study to ambiguous walls/doorways/partial views is reported, despite the skeptic note that such failures would propagate directly into the factor-graph alignment.
  2. [§4] §4 (Room-Based Factor Graph): The global alignment formulation assumes clean room boundaries and complete observations; the manuscript provides no experiments quantifying inter-room misalignment or residual drift when segmentation is imperfect. This directly affects the scalability and 'parameter-free' claims for monocular input.
  3. [Evaluation] Evaluation section: While improved trajectory and reconstruction metrics are asserted versus baselines, the paper supplies no ablation isolating the contribution of room-deferred reconstruction versus standard sliding-window or global optimization, nor statistical significance across multiple runs, making it impossible to confirm that gains arise from the proposed mechanism rather than implementation details.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the qualitative results are occasionally underspecified (e.g., units for reconstruction error).
  2. [Related Work] Related-work discussion omits several recent monocular scene-graph and open-vocabulary mapping papers from 2023-2024; a brief comparison table would clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our contributions. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Room Partitioning): The central claim that deferring reconstruction until rooms are 'fully observed' eliminates scale inconsistencies rests on the unvalidated assumption that open-vocabulary foundation models produce reliable room boundaries. No quantitative segmentation accuracy, failure-mode analysis, or sensitivity study to ambiguous walls/doorways/partial views is reported, despite the skeptic note that such failures would propagate directly into the factor-graph alignment.

    Authors: We acknowledge that the reliability of room partitioning is crucial to our approach. While we rely on state-of-the-art open-vocabulary models such as those based on CLIP and SAM for semantic segmentation, we agree that additional validation would strengthen the paper. In the revised manuscript, we will include quantitative metrics for room segmentation accuracy on the evaluation datasets, along with a failure-mode analysis for cases involving ambiguous boundaries. This will demonstrate the robustness of the partitioning step and its impact on the overall system. revision: partial

  2. Referee: [§4] §4 (Room-Based Factor Graph): The global alignment formulation assumes clean room boundaries and complete observations; the manuscript provides no experiments quantifying inter-room misalignment or residual drift when segmentation is imperfect. This directly affects the scalability and 'parameter-free' claims for monocular input.

    Authors: The room-based factor graph is designed to handle potential imperfections by optimizing over the semantic hierarchy and using relative constraints between rooms. However, we agree that explicit quantification of misalignment under imperfect segmentation would be valuable. We will add experiments in the revision that simulate or analyze cases with noisy room boundaries, reporting inter-room drift metrics to support the scalability claims. revision: yes

  3. Referee: [Evaluation] Evaluation section: While improved trajectory and reconstruction metrics are asserted versus baselines, the paper supplies no ablation isolating the contribution of room-deferred reconstruction versus standard sliding-window or global optimization, nor statistical significance across multiple runs, making it impossible to confirm that gains arise from the proposed mechanism rather than implementation details.

    Authors: We appreciate this point regarding the need for ablations. The current evaluation compares against feed-forward SLAM and scene-graph baselines, showing improvements attributable to our room-guided approach. To isolate the contribution, we will include an ablation study in the revised version comparing room-deferred reconstruction against sliding-window variants. Additionally, we will report results with standard deviations across multiple runs to provide statistical context. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LEXI-SG system description

full rationale

The paper presents an engineering system for monocular scene graph mapping that partitions scenes using external open-vocabulary foundation models, defers per-room feed-forward reconstruction, and applies a standard room-based factor-graph optimizer. No equations, parameter fits, or derivations are shown that reduce the claimed trajectory or reconstruction improvements to quantities defined or fitted inside the same pipeline. The approach relies on independent external models and established SLAM techniques rather than self-referential loops, self-citation chains, or ansatzes smuggled from prior author work. This is the common honest outcome for a systems paper whose central claims rest on empirical validation against external benchmarks rather than internal re-derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the reliability of pre-trained open-vocabulary models for room segmentation and on standard factor-graph optimization; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Open-vocabulary foundation models supply sufficiently accurate semantic priors to partition indoor scenes into rooms
    Invoked to defer reconstruction until each room is fully observed.
  • standard math Standard factor-graph optimization can globally align per-room reconstructions while preserving local consistency
    Used to impose the semantic scene-graph hierarchy.

pith-pipeline@v0.9.0 · 5567 in / 1361 out tokens · 40530 ms · 2026-05-14T18:08:21.980310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed—enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    A Sim(3) room-level factor graph that globally aligns per-room reconstructions while preserving local consistency and correcting monocular scale ambiguity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,

    A. Werbyet al., “Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,”Robot.: Sci. Syst., 2024

  2. [2]

    ConceptGraphs: Open-V ocabulary 3D Scene Graphs for Perception and Planning,

    Q. Guet al., “ConceptGraphs: Open-V ocabulary 3D Scene Graphs for Perception and Planning,” inIEEE Int. Conf. Robot. Autom. (ICRA), 2024

  3. [3]

    Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs,

    A. Rosinolet al., “Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs,”Int. J. Robot. Res., 2021

  4. [4]

    OpenMask3D: Open-V ocabulary 3D Instance Segmentation,

    A. Takmazet al., “OpenMask3D: Open-V ocabulary 3D Instance Segmentation,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2023

  5. [5]

    Clio: Real-time Task-Driven Open-Set 3D Scene Graphs,

    D. Maggioet al., “Clio: Real-time Task-Driven Open-Set 3D Scene Graphs,”IEEE Robot. Autom. Lett., vol. 9, no. 10, pp. 8921–8928, 2024

  6. [6]

    ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM,

    C. Camposet al., “ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM,”IEEE Trans. Robot., vol. 37, no. 6, pp. 1874–1890, 2021

  7. [7]

    VINS-Mono: A Robust and Versa- tile Monocular Visual-Inertial State Estimator,

    T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versa- tile Monocular Visual-Inertial State Estimator,”IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018

  8. [8]

    MonoSLAM: Real-Time Single Camera SLAM

    A. J. Davison, I. D. Reid, N. Molton, and O. Stasse, “MonoSLAM: Real-Time Single Camera SLAM.”IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 1052–1067, 2007

  9. [9]

    Fusion++: V olumetric Object-Level SLAM,

    J. McCormacet al., “Fusion++: V olumetric Object-Level SLAM,” in Intl. Conf. 3D Vision (3DV), 2018, pp. 32–41

  10. [10]

    SLAM++: Simultaneous Localisation and Mapping at the Level of Objects,

    R. F. Salas-Morenoet al., “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects,” inIEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2013

  11. [11]

    VGGT: Visual Geometry Grounded Transformer,

    J. Wanget al., “VGGT: Visual Geometry Grounded Transformer,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025

  12. [12]

    Grounding Image Matching in 3D with MASt3R,

    V . Leroy, Y . Cabon, and J. Revaud, “Grounding Image Matching in 3D with MASt3R,” inEur. Conf. Comput. Vis. (ECCV), 2024

  13. [13]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction,

    N. Keethaet al., “MapAnything: Universal Feed-Forward Metric 3D Reconstruction,” inIntl. Conf. 3D Vision (3DV). IEEE, 2026

  14. [14]

    MASt3R-SLAM: Real- Time Dense SLAM with 3D Reconstruction Priors,

    R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R-SLAM: Real- Time Dense SLAM with 3D Reconstruction Priors,” inIEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025

  15. [15]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL (4) Manifold,

    D. Maggio, H. Lim, and L. Carlone, “VGGT-SLAM: Dense RGB SLAM Optimized on the SL (4) Manifold,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 39, 2025

  16. [16]

    ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association,

    G. Zhang, S. Qian, X. Wang, and D. Cremers, “ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association,”CoRR, vol. abs/2509.01584, 2025

  17. [17]

    Learning Transferable Visual Models From Natural Language Supervision,

    A. Radfordet al., “Learning Transferable Visual Models From Natural Language Supervision,”Int. Conf. Mach. Learn. (ICML), 2021

  18. [18]

    SAM 2: Segment anything in images and videos,

    N. Raviet al., “SAM 2: Segment anything in images and videos,” in Intl. Conf. on Learning Representations (ICLR), 2025

  19. [19]

    DINOv3

    O. Sim ´eoniet al., “Dinov3,”CoRR, vol. abs/2508.10104, 2025

  20. [20]

    Ovir-3d: Open-vocabulary 3D Instance Retrieval Without Training on 3D Data,

    S. Luet al., “Ovir-3d: Open-vocabulary 3D Instance Retrieval Without Training on 3D Data,” inConf. on Robot Learning (CoRL). PMLR, 2023, pp. 1610–1620

  21. [21]

    OpenScene: 3D Scene Understanding with Open V ocabularies,

    S. Penget al., “OpenScene: 3D Scene Understanding with Open V ocabularies,” inIEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023

  22. [22]

    ConceptFusion: Open-set Multimodal 3D Mapping,

    K. Jatavallabhulaet al., “ConceptFusion: Open-set Multimodal 3D Mapping,” inRobot.: Sci. Syst., 2023

  23. [23]

    LERF: Language Embedded Radiance Fields,

    J. Kerret al., “LERF: Language Embedded Radiance Fields,” inIEEE Int. Conf. Comput. Vis. (ICCV), 2023, pp. 19 672–19 682

  24. [24]

    RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration,

    O. Alamaet al., “RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration,” inIEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2025

  25. [25]

    LSD-SLAM: Large-scale direct monocular SLAM,

    J. Engel, T. Sch ¨ops, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” inEur. Conf. Comput. Vis. (ECCV), 2014

  26. [26]

    Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

    K. Denget al., “VGGT-Long: Chunk it, Loop it, Align it - Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences,”CoRR, vol. abs/2507.16443, 2025

  27. [27]

    CubeSLAM: Monocular 3-D Object SLAM,

    S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D Object SLAM,” IEEE Trans. Robot., vol. 35, no. 4, pp. 925–938, 2019

  28. [28]

    Recognize Anything: A Strong Image Tagging Model,

    Y . Zhanget al., “Recognize Anything: A Strong Image Tagging Model,” inIEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 1724–1732

  29. [29]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,

    S. Liuet al., “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” inEur. Conf. Com- put. Vis. (ECCV), 2024, pp. 38–55

  30. [30]

    VGGT-SLAM 2.0: Real-time Dense Feed- forward Scene Reconstruction,

    D. Maggio and L. Carlone, “VGGT-SLAM 2.0: Real-time Dense Feed- forward Scene Reconstruction,”CoRR, vol. abs/2601.19887, 2026

  31. [31]

    Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,

    N. Hughes, Y . Chang, and L. Carlone, “Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimiza- tion,” inRobot.: Sci. Syst., 2022

  32. [32]

    A Benchmark for the Evaluation of RGB-D SLAM Systems,

    J. Sturmet al., “A Benchmark for the Evaluation of RGB-D SLAM Systems,” inIEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2012

  33. [33]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,

    S. K. Ramakrishnanet al., “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2021

  34. [34]

    Benchmarking Egocentric Visual-Inertial SLAM at City Scale,

    A. Krishnanet al., “Benchmarking Egocentric Visual-Inertial SLAM at City Scale,” inIEEE Int. Conf. Comput. Vis. (ICCV), 2025

  35. [35]

    OpenLex3D: A New Evaluation Benchmark for Open-V ocabulary 3D Scene Representations,

    C. Kassabet al., “OpenLex3D: A New Evaluation Benchmark for Open-V ocabulary 3D Scene Representations,”Adv. Neural Inf. Pro- cess. Syst. (NeurIPS), 2025