Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

Andreea Bobu; Aryan Naveen; Jason Xinyu Liu; Luca Carlone

arxiv: 2606.08666 · v1 · pith:F5UXZDKHnew · submitted 2026-06-07 · 💻 cs.RO

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

Aryan Naveen , Jason Xinyu Liu , Luca Carlone , Andreea Bobu This is my paper

Pith reviewed 2026-06-27 18:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords language sensor modelspatial belief estimation3D scene mappingcalibrated uncertaintyvision-language fusionVL-Mapmultimodal distributionsrobot localization

0 comments

The pith

Language descriptions can be converted into calibrated multimodal spatial distributions and fused with robot perception to improve 3D target localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a Language Sensor Model that takes an utterance plus scene-graph context and outputs a mixture of Gaussians whose weights capture which object is meant and whose covariances capture where it lies. These outputs are treated as noisy observations inside the VL-Map framework and combined with camera and other onboard data inside a single probabilistic map. On the VLA-3D benchmark and a physical robot, only this model keeps its reported covariances calibrated; the resulting maps place roughly 70 percent more probability mass on the true target than the best foundation-model baseline. A reader would care because robots routinely receive verbal references to areas they cannot yet see, and treating those references as properly quantified sensor data lets the robot update its beliefs without waiting for direct observation.

Core claim

By training LSM to map language and scene context to mixture distributions that separately encode referential ambiguity and spatial uncertainty, then fusing those distributions as independent stochastic observations inside VL-Map, language becomes a usable sensor modality whose calibrated covariances allow more accurate recovery of target object locations than vision-only or uncalibrated language baselines.

What carries the argument

Language Sensor Model (LSM), a network that produces mixture weights for referential ambiguity and component covariances for spatial uncertainty, enabling probabilistic fusion inside VL-Map.

If this is right

Robots can incorporate out-of-view spatial references from human speech directly into their metric-semantic maps.
Fused language predictions place approximately 70 percent more probability mass on the correct target than the strongest uncalibrated baseline.
Among tested language predictors, only LSM produces covariance estimates that remain inside the calibrated regime on VLA-3D and real-robot data.
The same fusion framework unifies language with any other sensor whose uncertainty can be expressed as a spatial distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibration approach could be applied to other verbal or textual sources such as written notes or overheard conversations if they can be grounded to the same scene graph.
Treating language as one more sensor modality suggests that future mapping systems might optimize sensor selection or exploration policies by comparing expected information gain across vision, language, and touch.
If the independence assumption holds across multiple speakers, repeated descriptions of the same scene could be fused to reduce uncertainty without additional robot motion.

Load-bearing premise

The multimodal distributions output by LSM act as independent observations whose uncertainty is correctly quantified and can be added to the robot's perception model without systematic bias.

What would settle it

A held-out set of utterances where the true object location falls outside the reported 95 percent covariance contours at a rate far higher or lower than 5 percent, or where adding the language observations decreases rather than increases probability mass on the ground-truth target.

Figures

Figures reproduced from arXiv: 2606.08666 by Andreea Bobu, Aryan Naveen, Jason Xinyu Liu, Luca Carlone.

**Figure 1.** Figure 1: Language as a complementary sensing modality. A robot navigating a scene receives the utterance “I left my backpack on the table.” The Language Sensor Model (LSM) maps this utterance and the prior scene graph into a calibrated spatial distribution, which VL-Map fuses with streaming visual observations. By treating the utterance as a stochastic spatial observation, VLMap immediately places probabilistic ma… view at source ↗

**Figure 2.** Figure 2: The Language Sensor Model (LSM) instantiates Eq. 1 as a learned Gaussian mixture over the target’s 3D position. Top: the prior scene graph m− and a language observation Lj are passed to a hypothesis proposer, a Large Language Model, that enumerates anchor candidates (each scoped to an anchor object and region) and assigns each a confidence; for every hypothesis hi , the Language Sensor Transformer fuses ob… view at source ↗

**Figure 3.** Figure 3: Grounding performance as semantic ambiguity grows. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Belief evolution when each language predictor is fused with streaming visual observations. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: LSM component ablations on val seen after 50 epochs. Removing any single component degrades performance relative to the full LSM. D Hypothesis Generation LLM Prompt The system prompt used for the hypothesis-proposer LLM call in Section 4: You are a spatial reasoning module. You are given: A scene graph describing a region and its objects. A natural language utterance introducing a new object. Your task is… view at source ↗

**Figure 6.** Figure 6: Example BEV prompt: point cloud coloured by semantics, anchor object in magenta. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Per-relation grounding accuracy versus calibration on unambiguous utterances. LSM is the only method that clusters near the calibrated optimum (ANEES = 3, RMSE = 0) across all six spatial relations, while the foundation-model baselines collapse on extent-heavy relations such as near, between, and above, where the language most underdetermines the valid region. Each method contributes six points (one per r… view at source ↗

read the original abstract

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LSM turns language into calibrated multimodal distributions that fuse usefully into VL-Map and improve target localization on VLA-3D and a real robot.

read the letter

The main point is that this work gives a practical route for folding natural-language spatial descriptions into metric-semantic maps as calibrated observations. LSM maps an utterance plus scene-graph context to a Gaussian mixture whose weights capture referential ambiguity and whose covariances capture spatial uncertainty. VL-Map then treats those outputs as stochastic observations and fuses them with onboard perception in one belief map.

What the paper does well is the calibration step and the end-to-end empirical result. LSM is the only language model they test whose predicted covariances stay consistent with observed errors; when fused, it puts roughly 70 percent more probability mass on the true target location than the strongest foundation-model baseline. They also run the system on a physical mobile robot, which adds a useful reality check.

The construction itself looks new: a dedicated language-to-distribution model rather than direct use of off-the-shelf VLMs, plus explicit probabilistic fusion inside an existing mapping framework. The citation pattern is standard and appropriate for the sub-area.

The soft spots are mostly about missing detail rather than outright flaws. The abstract does not spell out the exact calibration metric, the training objective, or whether the data splits were locked before looking at results, so the strength of the calibration claim rests on the methods section. The fusion step assumes the language observations are sufficiently independent of the visual ones; that may hold in their experiments but could be sensitive in other settings. Neither issue appears load-bearing on the information given.

This is for people working on semantic mapping, language grounding, or human-robot interaction in indoor environments. A reader who needs a concrete way to turn vague spatial language into map updates will find usable pieces here. The combination of model, calibration evidence, and real-robot demo is solid enough to merit a serious referee.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Language Sensor Model (LSM), which maps natural-language utterances plus scene-graph context to multimodal spatial distributions (mixture weights for referential ambiguity, component covariances for spatial uncertainty). It further defines VL-Map, a probabilistic fusion framework that treats LSM outputs as stochastic observations to be combined with onboard perception inside a unified belief map. On the VLA-3D benchmark and a real mobile robot, LSM is reported as the only language predictor whose covariances stay calibrated, yielding approximately 70% more probability mass on the true target location than the strongest foundation-model baseline.

Significance. If the calibration and fusion claims are substantiated, the work offers a concrete route to treating language as a calibrated sensor modality that can be fused with metric-semantic maps, addressing a recognized gap when targets lie outside the robot’s field of view. The explicit modeling of ambiguity via mixture weights and the probabilistic observation model are strengths that, if validated with proper uncertainty quantification, could influence downstream belief-update pipelines.

major comments (2)

[Abstract] Abstract: the central quantitative claim (~70% more probability mass on the true target) is presented without error bars, confidence intervals, or a description of the statistical test used; because this figure is the primary empirical support for the superiority of VL-Map + LSM, the absence of uncertainty quantification is load-bearing.
[Abstract] Abstract: the claim that LSM covariances “remain within the calibrated regime” is not accompanied by any definition or measurement procedure (e.g., expected calibration error, reliability diagrams, or binning method); without this information the calibration assertion cannot be verified and directly underpins the assertion that language predictions can be treated as unbiased stochastic observations.

minor comments (1)

[Abstract] The abstract does not indicate whether the VLA-3D data splits were pre-specified or whether any hyper-parameter tuning was performed on the evaluation set; a single sentence clarifying the evaluation protocol would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. Both major comments concern the abstract's presentation of key claims; we agree these points merit clarification and will revise the abstract text accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claim (~70% more probability mass on the true target) is presented without error bars, confidence intervals, or a description of the statistical test used; because this figure is the primary empirical support for the superiority of VL-Map + LSM, the absence of uncertainty quantification is load-bearing.

Authors: We agree that the abstract should reference the supporting analysis. The full results, including error bars computed across multiple trials on the VLA-3D benchmark and the associated statistical comparison, appear in Section 5.2. We will revise the abstract to append a parenthetical directing readers to these details (e.g., "approximately 70% more probability mass on the true target (Section 5.2)"). revision: yes
Referee: [Abstract] Abstract: the claim that LSM covariances “remain within the calibrated regime” is not accompanied by any definition or measurement procedure (e.g., expected calibration error, reliability diagrams, or binning method); without this information the calibration assertion cannot be verified and directly underpins the assertion that language predictions can be treated as unbiased stochastic observations.

Authors: We acknowledge that the abstract does not define the calibration metric. Section 4.3 of the manuscript specifies the procedure via expected calibration error (ECE) together with reliability diagrams and binning details. We will revise the abstract to include a concise qualifier (e.g., "whose covariance estimates remain calibrated (low ECE, Section 4.3)") so that the claim is traceable to the reported methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper trains LSM on data to produce multimodal distributions from language and scene graphs, then fuses them probabilistically into VL-Map with onboard perception; claims rest on held-out evaluation on VLA-3D and real-robot experiments showing calibration and improved localization versus baselines. No equations, self-citations, or derivations are shown that reduce any prediction or uniqueness claim to a fitted input or prior author result by construction. The pipeline is self-contained against external benchmarks with no load-bearing self-referential steps visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the central claim rests on unstated modeling assumptions about how language observations are generated and how their uncertainty can be quantified independently of the fusion step. No free parameters, axioms, or invented entities can be audited from the abstract alone.

pith-pipeline@v0.9.1-grok · 5763 in / 1204 out tokens · 13559 ms · 2026-06-27T18:24:33.210079+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages · 1 internal anchor

[1]

C. E. Peacock and A. D. Ekstrom. Verbal cues flexibly transform spatial representations in human memory.Memory, 27(4):465–479, 2019

2019
[2]

Munnich, B

E. Munnich, B. Landau, and B. A. Dosher. Spatial language and spatial representa- tion: a cross-linguistic comparison.Cognition, 81(3):171–208, 2001. ISSN 0010-0277. doi:https://doi.org/10.1016/S0010-0277(01)00127-5. URLhttps://www.sciencedirect. com/science/article/pii/S0010027701001275

work page doi:10.1016/s0010-0277(01)00127-5 2001
[3]

Hermer-Vazquez, E

L. Hermer-Vazquez, E. S. Spelke, and A. S. Katsnelson. Sources of flexibility in hu- man cognition: Dual-task studies of space and language.Cognitive Psychology, 39(1):3– 36, 1999. ISSN 0010-0285. doi:https://doi.org/10.1006/cogp.1998.0713. URLhttps: //www.sciencedirect.com/science/article/pii/S0010028598907137

work page doi:10.1006/cogp.1998.0713 1999
[4]

Rosinol, M

A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real- time metric-semantic localization and mapping. In2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, pages 1689–

2020
[5]

doi:10.1109/ICRA40945.2020.9196885

IEEE, 2020. doi:10.1109/ICRA40945.2020.9196885. URLhttps://doi.org/10. 1109/ICRA40945.2020.9196885

work page doi:10.1109/icra40945.2020.9196885 2020
[6]

Hughes, Y

N. Hughes, Y . Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 2024. doi:10.1177/02783649241229725. URL https://doi.org/10.1177/02783649241229725

work page doi:10.1177/02783649241229725 2024
[7]

S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas. Probabilistic data association for semantic SLAM. In2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pages 1722–1729. IEEE, 2017. doi: 10.1109/ICRA.2017.7989203. URLhttps://doi.org/10.1109/ICRA.2017.7989203

work page doi:10.1109/icra.2017.7989203 2017
[8]

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2899–2909. IEEE, 2023. doi:10.1109/ ICCV51070.2023.00272. URLhttps://doi.org/10.1109/ICCV51070.2023.00272

work page doi:10.1109/iccv51070.2023.00272 2023
[9]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d- llm: Injecting the 3d world into large language models. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, ...

2023
[10]

D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in RGB- D scans using natural language. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, ed- itors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX, Lecture Notes in Computer Science, pages 202–221. Springer, 2020. doi:1...

work page doi:10.1007/978-3-030-58565-5 2020
[11]

A ConvNet for the 2020s , booktitle =

S. Huang, Y . Chen, J. Jia, and L. Wang. Multi-view transformer for 3d visual ground- ing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15503–15512. IEEE, 2022. doi:10.1109/ CVPR52688.2022.01508. URLhttps://doi.org/10.1109/CVPR52688.2022.01508. 9

work page doi:10.1109/cvpr52688.2022.01508 2022
[12]

J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. D. Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3772–

2025
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Computer Vision Foundation / IEEE, 2025. doi:10.1109/CVPR52734.2025.00357. URLhttps://openaccess.thecvf.com/content/CVPR2025/html/Deng_3D-LLaVA_ Towards_Generalist_3D_LMMs_with_Omni_Superpoint_Transformer_CVPR_2025_ paper.html

work page doi:10.1109/cvpr52734.2025.00357 2025
[14]

B. Jia, Y . Chen, H. Yu, Y . Wang, X. Niu, T. Liu, Q. Li, and S. Huang. Sceneverse: Scal- ing 3d vision-language learning for grounded scene understanding. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Pa...

work page doi:10.1007/978-3-031-72673-6_16 2024
[15]

Conditional convolutions for instance segmentation,

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas. Referit3d: Neu- ral listeners for fine-grained 3d object identification in real-world scenes. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, Lecture Notes in Com- pu...

work page doi:10.1007/978-3-030-58452-8 2020
[16]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. E. Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
[17]

Xiong, Z

M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[18]

URLhttps://openreview.net/forum?id=gjeQKFxFpZ
[19]

Kumar, C

B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam. Conformal pre- diction with large language models for multi-choice question answering. InICML Workshop on Neural Conversational AI, 2023

2023
[20]

Mohri and T

C. Mohri and T. Hashimoto. Language models with conformal factuality guarantees. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages 36029– 36047. PM...

2024
[21]

M. R. Walter, S. Patki, A. F. Daniele, E. Fahnestock, F. Duvallet, S. Hemachandra, J. Oh, A. Stentz, N. Roy, and T. M. Howard. Language understanding for field and service robots in a priori unknown environments.Field Robotics, 2(1):1191–1231, 2022. doi:10.55417/FR. 2022040. URLhttps://doi.org/10.55417/fr.2022040

work page doi:10.55417/fr 2022
[22]

D. Kim, N. Oh, D. Hwang, and D. Park. Lingo-space: Language-conditioned incremen- tal grounding for space. In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors,Thirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Ed- ucati...

work page doi:10.1609/aaai.v38i9.28898 2024
[23]

J. A. Nader, D. Lee, N. Dennler, and A. Bobu. Quicklap: Quick language-action preference learning for semi-autonomous agents, 2026. URLhttps://arxiv.org/abs/2511.17855

Pith/arXiv arXiv 2026
[24]

Shirwatkar, N

H. Zhang, N. Zantout, P. Kachana, J. Zhang, and W. Wang. Iref-vla: A benchmark for interactive referential grounding with imperfect language in 3d scenes. InIEEE Interna- tional Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19- 23, 2025, pages 1677–1683. IEEE, 2025. doi:10.1109/ICRA55743.2025.11127464. URL https://doi.org/10.1109...

work page doi:10.1109/icra55743.2025.11127464 2025
[25]

A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pages 667–676. IEEE Computer Society, 2017. doi:10.1109/3DV .2017.00081. URLhttps: //doi.org/10....

work page doi:10.1109/3dv 2017
[26]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society, 2017. doi:10.1109/CVPR.2017.261. URLhttps:// doi.org/10.1109/CV...

work page doi:10.1109/cvpr.2017.261 2017
[27]

there is a <target object> <relation> <existing object(s)>

A. Dehghan, G. Baruch, Z. Chen, Y . Feigin, P. Fu, T. Gebauer, D. Kurz, T. Dimry, B. Joffe, A. Schwartz, and E. Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...

2021

[1] [1]

C. E. Peacock and A. D. Ekstrom. Verbal cues flexibly transform spatial representations in human memory.Memory, 27(4):465–479, 2019

2019

[2] [2]

Munnich, B

E. Munnich, B. Landau, and B. A. Dosher. Spatial language and spatial representa- tion: a cross-linguistic comparison.Cognition, 81(3):171–208, 2001. ISSN 0010-0277. doi:https://doi.org/10.1016/S0010-0277(01)00127-5. URLhttps://www.sciencedirect. com/science/article/pii/S0010027701001275

work page doi:10.1016/s0010-0277(01)00127-5 2001

[3] [3]

Hermer-Vazquez, E

L. Hermer-Vazquez, E. S. Spelke, and A. S. Katsnelson. Sources of flexibility in hu- man cognition: Dual-task studies of space and language.Cognitive Psychology, 39(1):3– 36, 1999. ISSN 0010-0285. doi:https://doi.org/10.1006/cogp.1998.0713. URLhttps: //www.sciencedirect.com/science/article/pii/S0010028598907137

work page doi:10.1006/cogp.1998.0713 1999

[4] [4]

Rosinol, M

A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real- time metric-semantic localization and mapping. In2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, pages 1689–

2020

[5] [5]

doi:10.1109/ICRA40945.2020.9196885

IEEE, 2020. doi:10.1109/ICRA40945.2020.9196885. URLhttps://doi.org/10. 1109/ICRA40945.2020.9196885

work page doi:10.1109/icra40945.2020.9196885 2020

[6] [6]

Hughes, Y

N. Hughes, Y . Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 2024. doi:10.1177/02783649241229725. URL https://doi.org/10.1177/02783649241229725

work page doi:10.1177/02783649241229725 2024

[7] [7]

S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas. Probabilistic data association for semantic SLAM. In2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pages 1722–1729. IEEE, 2017. doi: 10.1109/ICRA.2017.7989203. URLhttps://doi.org/10.1109/ICRA.2017.7989203

work page doi:10.1109/icra.2017.7989203 2017

[8] [8]

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2899–2909. IEEE, 2023. doi:10.1109/ ICCV51070.2023.00272. URLhttps://doi.org/10.1109/ICCV51070.2023.00272

work page doi:10.1109/iccv51070.2023.00272 2023

[9] [9]

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d- llm: Injecting the 3d world into large language models. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, ...

2023

[10] [10]

D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in RGB- D scans using natural language. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, ed- itors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XX, Lecture Notes in Computer Science, pages 202–221. Springer, 2020. doi:1...

work page doi:10.1007/978-3-030-58565-5 2020

[11] [11]

A ConvNet for the 2020s , booktitle =

S. Huang, Y . Chen, J. Jia, and L. Wang. Multi-view transformer for 3d visual ground- ing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15503–15512. IEEE, 2022. doi:10.1109/ CVPR52688.2022.01508. URLhttps://doi.org/10.1109/CVPR52688.2022.01508. 9

work page doi:10.1109/cvpr52688.2022.01508 2022

[12] [12]

J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. D. Reid. 3d-llava: Towards generalist 3d lmms with omni superpoint transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 3772–

2025

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Computer Vision Foundation / IEEE, 2025. doi:10.1109/CVPR52734.2025.00357. URLhttps://openaccess.thecvf.com/content/CVPR2025/html/Deng_3D-LLaVA_ Towards_Generalist_3D_LMMs_with_Omni_Superpoint_Transformer_CVPR_2025_ paper.html

work page doi:10.1109/cvpr52734.2025.00357 2025

[14] [14]

B. Jia, Y . Chen, H. Yu, Y . Wang, X. Niu, T. Liu, Q. Li, and S. Huang. Sceneverse: Scal- ing 3d vision-language learning for grounded scene understanding. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Pa...

work page doi:10.1007/978-3-031-72673-6_16 2024

[15] [15]

Conditional convolutions for instance segmentation,

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas. Referit3d: Neu- ral listeners for fine-grained 3d object identification in real-world scenes. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, Lecture Notes in Com- pu...

work page doi:10.1007/978-3-030-58452-8 2020

[16] [16]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield- Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. E. Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022

[17] [17]

Xiong, Z

M. Xiong, Z. Hu, X. Lu, Y . Li, J. Fu, J. He, and B. Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024

[18] [18]

URLhttps://openreview.net/forum?id=gjeQKFxFpZ

[19] [19]

Kumar, C

B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam. Conformal pre- diction with large language models for multi-choice question answering. InICML Workshop on Neural Conversational AI, 2023

2023

[20] [20]

Mohri and T

C. Mohri and T. Hashimoto. Language models with conformal factuality guarantees. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages 36029– 36047. PM...

2024

[21] [21]

M. R. Walter, S. Patki, A. F. Daniele, E. Fahnestock, F. Duvallet, S. Hemachandra, J. Oh, A. Stentz, N. Roy, and T. M. Howard. Language understanding for field and service robots in a priori unknown environments.Field Robotics, 2(1):1191–1231, 2022. doi:10.55417/FR. 2022040. URLhttps://doi.org/10.55417/fr.2022040

work page doi:10.55417/fr 2022

[22] [22]

D. Kim, N. Oh, D. Hwang, and D. Park. Lingo-space: Language-conditioned incremen- tal grounding for space. In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors,Thirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Ed- ucati...

work page doi:10.1609/aaai.v38i9.28898 2024

[23] [23]

J. A. Nader, D. Lee, N. Dennler, and A. Bobu. Quicklap: Quick language-action preference learning for semi-autonomous agents, 2026. URLhttps://arxiv.org/abs/2511.17855

Pith/arXiv arXiv 2026

[24] [24]

Shirwatkar, N

H. Zhang, N. Zantout, P. Kachana, J. Zhang, and W. Wang. Iref-vla: A benchmark for interactive referential grounding with imperfect language in 3d scenes. InIEEE Interna- tional Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19- 23, 2025, pages 1677–1683. IEEE, 2025. doi:10.1109/ICRA55743.2025.11127464. URL https://doi.org/10.1109...

work page doi:10.1109/icra55743.2025.11127464 2025

[25] [25]

A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pages 667–676. IEEE Computer Society, 2017. doi:10.1109/3DV .2017.00081. URLhttps: //doi.org/10....

work page doi:10.1109/3dv 2017

[26] [26]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society, 2017. doi:10.1109/CVPR.2017.261. URLhttps:// doi.org/10.1109/CV...

work page doi:10.1109/cvpr.2017.261 2017

[27] [27]

there is a <target object> <relation> <existing object(s)>

A. Dehghan, G. Baruch, Z. Chen, Y . Feigin, P. Fu, T. Gebauer, D. Kurz, T. Dimry, B. Joffe, A. Schwartz, and E. Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neu...

2021