{"paper":{"title":"LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Monocular RGB alone can build accurate dense open-vocabulary 3D scene graphs when rooms guide reconstruction order and global alignment.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Ayoung Kim, Christina Kassab, Hyeonjae Gil, Mat\\'ias Mattamala, Maurice Fallon","submitted_at":"2026-05-13T16:19:02Z","abstract_excerpt":"Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LEXI-SG is the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input, achieving improved trajectory estimation and dense reconstruction on indoor scenes.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The method assumes that open-vocabulary foundation models can reliably partition the scene into rooms and that deferring reconstruction until each room is fully observed will eliminate scale inconsistencies without introducing other drift or alignment errors.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Monocular RGB alone can build accurate dense open-vocabulary 3D scene graphs when rooms guide reconstruction order and global alignment.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ac14111f065b8c30de59f5f8451a71731f0efb20b0ea96a33539e001e1cbc49e"},"source":{"id":"2605.13741","kind":"arxiv","version":1},"verdict":{"id":"d0a932c1-a143-45df-b5ca-55fe923f6eec","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:08:21.980310Z","strongest_claim":"LEXI-SG is the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input, achieving improved trajectory estimation and dense reconstruction on indoor scenes.","one_line_summary":"LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The method assumes that open-vocabulary foundation models can reliably partition the scene into rooms and that deferring reconstruction until each room is fully observed will eliminate scale inconsistencies without introducing other drift or alignment errors.","pith_extraction_headline":"Monocular RGB alone can build accurate dense open-vocabulary 3D scene graphs when rooms guide reconstruction order and global alignment."},"references":{"count":35,"sample":[{"doi":"","year":2024,"title":"Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation,","work_id":"2700e251-2beb-4f7a-ae16-12b54daa4d8c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"ConceptGraphs: Open-V ocabulary 3D Scene Graphs for Perception and Planning,","work_id":"7ac6b4ae-5d8b-48e7-b7b7-ed0d5da9c3fb","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs,","work_id":"0a5aa2a8-65ea-4a40-8681-15b13f10c857","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"OpenMask3D: Open-V ocabulary 3D Instance Segmentation,","work_id":"9b99ba4e-64b1-46d4-acd4-66a339e46f39","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Clio: Real-time Task-Driven Open-Set 3D Scene Graphs,","work_id":"f54519f9-378c-4443-b9bf-1e31e458d47f","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":35,"snapshot_sha256":"674d5b11982d299c2f8555fae6f411a777308446dc5f297996cbed32245eef9c","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d120db0c5c2eec90f5b632bdea1b0d70db372e7ed0065327c63c3b4a52178d9d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}