pith. machine review for the scientific record. sign in

arxiv: 2511.12676 · v2 · submitted 2025-11-16 · 💻 cs.CV · cs.AI

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Pith reviewed 2026-05-17 21:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords BridgeEQAembodied question answeringvision-language modelsbridge inspectionepisodic memoryMarkov decision processvisual reasoninginfrastructure inspection
0
0 comments X

The pith

Embodied Memory Visual Reasoning outperforms baselines on the BridgeEQA benchmark for real bridge inspections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs drawn from professional inspection reports across 200 real-world bridge scenes. It frames inspection EQA as a testbed for episodic memory and multi-scale spatial reasoning in embodied agents. The authors develop Embodied Memory Visual Reasoning (EMVR) that treats the task as a Markov decision process. This method achieves stronger results than state-of-the-art vision-language models on the benchmark and on a new metric for image citation relevance. The work shows how domain-specific benchmarks grounded in real inspection data can expose and address gaps in current models for practical infrastructure tasks.

Core claim

The paper establishes that formulating inspection EQA as a Markov decision process enables the Embodied Memory Visual Reasoning method to deliver stronger performance than baselines on the BridgeEQA benchmark of 2,200 questions and answers from 200 real bridge scenes with linked professional reports.

What carries the argument

Embodied Memory Visual Reasoning (EMVR), which casts the inspection EQA task as a Markov decision process to support sequential memory use, image citation, and multi-scale visual reasoning.

If this is right

  • State-of-the-art vision-language models show substantial performance gaps on tasks that require long-range spatial understanding and episodic memory.
  • Professional inspection reports can ground open-vocabulary questions for standardized evaluation in embodied settings.
  • The Image Citation Relevance metric measures a model's ability to select and reference relevant images during question answering.
  • Bridge inspection provides a repeatable domain for testing multi-scale reasoning in embodied agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same benchmark style could apply to inspections of tunnels, dams, or buildings to test embodied agents in other infrastructure domains.
  • Adding depth or thermal sensors to the EMVR pipeline could improve accuracy in low-visibility or material-damage scenarios.
  • Successful agents trained on BridgeEQA might support robotic systems that reduce human time spent in hazardous bridge environments.
  • Extending the benchmark with time-series imagery or changing conditions would test memory retention across multiple visits.

Load-bearing premise

The 200 real-world bridge scenes and their linked professional reports sufficiently capture the episodic memory and multi-scale reasoning challenges of actual infrastructure inspections.

What would settle it

Evaluating EMVR on a held-out set of new bridge scenes or against human inspectors performing the same questions would show whether the performance gains hold beyond the original 200 scenes.

Figures

Figures reproduced from arXiv: 2511.12676 by Asad Ur Rahman, Joshua Gao, Subin Varghese, Vedhus Hoskere.

Figure 1
Figure 1. Figure 1: BridgeEQA: Open-Vocabulary Embodied Question Answering for bridge inspection. Two example scenes from our bench￾mark showing questions that require synthesizing visual evidence across multiple egocentric images to assess bridges. Abstract Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings re￾mains difficult, partly due to the scarcity of benchmarks… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of how EMVR mitigates the “lost in the mid [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scene graph structure for bridge inspection. Nodes rep [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Embodied Memory Visual Reasoning. An [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pipeline for constructing the BridgeEQAdataset from [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative sample images from BridgeEQA across [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of NBI condition ratings for bridge compo [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of question types in the BridgeEQA dataset [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: We observed several qualitative performance patterns. We provide a correct example and contrast this with two commonly [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Condition rating prediction accuracy comparison [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA. It demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset are available at https://drags99.github.io/bridge-eqa/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce BridgeEQA, a benchmark for embodied question answering in bridge inspections consisting of 200 real-world scenes with professional reports and 2,200 QA pairs. It proposes EMVR, an MDP-based approach for visual reasoning with memory, which reportedly outperforms baselines on this benchmark and introduces a new metric for image citation relevance.

Significance. Should the empirical results hold under scrutiny, this work offers a valuable new benchmark for EQA research grounded in real infrastructure inspection data rather than simulated environments. The release of code and dataset is a positive aspect. The MDP formulation for EMVR could provide a useful framework for handling episodic memory in embodied agents. However, the overall significance depends on addressing whether the benchmark scale adequately tests the targeted challenges.

major comments (2)
  1. The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.
  2. The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.
minor comments (2)
  1. The average image count is reported with two decimal places (47.93); consider if this precision is necessary or if it should be explained how it was calculated.
  2. Ensure that the citation for OpenEQA is included when referencing the style of the QA pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.

    Authors: We agree that the abstract would benefit from quantitative support for the central claims. In the revised version, we will update the abstract to include specific metrics (e.g., EMVR accuracy of X% vs. baseline Y% on the BridgeEQA benchmark, with standard deviations and p-values from statistical tests). We will also expand the experiments section with a dedicated table comparing EMVR against the full set of baselines (including model names, hyperparameters, and ablation results) to make the superiority claims fully transparent and reproducible. revision: yes

  2. Referee: The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.

    Authors: We acknowledge that our benchmark uses static scenes rather than dynamic multi-visit trajectories. However, each scene contains an average of 47.93 egocentric images sampled along realistic inspection paths, requiring agents to perform sequential visual reasoning and memory retrieval across long spatial ranges within a single episode. The MDP formulation in EMVR explicitly models state transitions and memory updates to handle this. We will add a new subsection in the discussion clarifying these design choices, including an analysis of scene complexity (e.g., average path length and number of relevant images per question), and note the limitation regarding multi-visit scenarios as an avenue for future work while arguing that the current scale still meaningfully advances episodic memory EQA beyond simulated environments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper constructs BridgeEQA as a new benchmark directly from external professional inspection reports and real-world scenes, then evaluates EMVR (formulated as an MDP) empirically against baselines on this benchmark. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the performance results are independent empirical measurements on externally grounded data. This is a standard honest benchmark-plus-method paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the collected bridge scenes and reports plus standard assumptions in vision-language model evaluation; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Professional inspection reports provide reliable ground truth for open-vocabulary questions about bridge condition.
    Invoked when grounding the 2200 QA pairs.

pith-pipeline@v0.9.0 · 5514 in / 1051 out tokens · 25903 ms · 2026-05-17T21:37:04.260693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    Agia et al

    C. Agia et al. Taskography: Evaluating robot task planning over large 3d scene graphs. InConference on Robot Learn- ing, pages 46–58. PMLR, 2022. 3

  2. [2]

    Consis- tency of the new york state bridge inspection program

    Anil K Agrawal, Glenn A Washer, Xu Gong, et al. Consis- tency of the new york state bridge inspection program. Tech- nical report, New York (NY). Dept. of Transportation, 2013. 3, 5, 8

  3. [3]

    Evaluation of the consistency of bridge inspection ratings in new york state.Journal of In- frastructure Systems, 27(3):04021016, 2021

    Anil Kumar Agrawal, Glenn Washer, Sreenivas Alampalli, Xu Gong, and Ran Cao. Evaluation of the consistency of bridge inspection ratings in new york state.Journal of In- frastructure Systems, 27(3):04021016, 2021. 8

  4. [4]

    A. S. Ahmad. Bridge preservation guide: Maintaining a state of good repair using cost-effective investment strate- gies. Technical report, United States. Federal Highway Ad- ministration, 2011. 3

  5. [5]

    ASCE, Reston, V A, 2021

    American Society of Civil Engineers.2021 Report Card for America’s Infrastructure. ASCE, Reston, V A, 2021. 2, 3

  6. [6]

    arXiv:2404.16811

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou. Make your llm fully utilize the context. ArXiv, abs/2404.16811, 2024. 2, 3

  7. [7]

    Armeni et al

    I. Armeni et al. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019. 3

  8. [8]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 3

  9. [9]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hong Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understand- ing.ArXiv, abs/2308.14508, 2023. 2, 3

  10. [10]

    Y . Chen, S. Zhang, T. Han, Y . Du, W. Zhang, and J. Li. Chat3d: Interactive understanding of 3d scene-level point clouds by chatting with foundation models for urban eco- logical construction.ISPRS Journal of Photogrammetry and Remote Sensing, 212:181–192, 2024. 4

  11. [11]

    Z. Chen, Y . Zou, V . A. Gonzalez, J. Ingham, and L. M. Wotherspoon. Bridge inspection using a multi-modal vi- sion language model. InProceedings of the 6th International Conference on Civil and Building Engineering Informatics, page 11, 2025. 3

  12. [12]

    Embodied question answer- ing

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. 2, 3

  13. [13]

    Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024

    Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Pira- muthu, Michael Johnston, Reza Ghanadhan, and Dinesh Manocha. Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024. 3

  14. [14]

    RAGAs: Automated evaluation of retrieval aug- mented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval aug- mented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computa- tional Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, 2024. Association for Computational Lin- guistics. 6

  15. [15]

    Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges

    Federal Highway Administration. Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges. Technical Report FHW A-PD-96-001, U.S. Department of Transportation, Federal Highway Adminis- tration, 1995. 2, 3, 5, 6

  16. [16]

    Ragalyst: Automated human- aligned agentic evaluation for domain-specific rag.arXiv preprint arXiv:2511.04502, 2025

    Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, and Vedhus Hoskere. Ragalyst: Automated human- aligned agentic evaluation for domain-specific rag.arXiv preprint arXiv:2511.04502, 2025. 6

  17. [17]

    Gu et al

    Q. Gu et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024. 3

  18. [18]

    Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025

    Vedhus Hoskere, Delaram Hassanlou, Asad Ur Rahman, Reza Bazrgary, and Muhammad Taseer Ali. Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025. 3

  19. [19]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?ArXiv, abs/2404.06654, 2024. 2, 3

  20. [20]

    Jia et al

    B. Jia et al. Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding. InComputer Vision - ECCV 2024, pages 289–310. Springer Nature Switzerland,

  21. [21]

    Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression. InAnnual Meeting of the Associa- tion for Computational Linguistics, 2023. 2, 3

  22. [22]

    Kunlamai, T

    T. Kunlamai, T. Yamane, M. Suganuma, P. Chun, and T. Okatani. Improving visual question answering for bridge inspection by pre-training with external image-text data. Computer-Aided Civil and Infrastructure Engineering, 39 (3):345–361, 2024. 3

  23. [23]

    Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y

    Yuri Kuratov, A. Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y . Sorokin, and M. Burtsev. Babi- long: Testing the limits of llms with long context reasoning- in-a-haystack.ArXiv, abs/2406.10149, 2024. 2, 3

  24. [24]

    Industryeqa: Push- ing the frontiers of embodied question answering in indus- trial scenarios.arXiv preprint arXiv:2505.20640, 2025

    Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, and Yu Kong. Industryeqa: Push- ing the frontiers of embodied question answering in indus- trial scenarios.arXiv preprint arXiv:2505.20640, 2025. 3, 7

  25. [25]

    Liao and G

    P. Liao and G. Nakano. Bridgeclip: Automatic bridge in- spection by utilizing vision-language model. InInternational Conference on Pattern Recognition, pages 61–76. Springer,

  26. [26]

    J. Liu, H. Li, C. Chai, K. Chen, and D. Wang. A llm- informed multi-agent ai system for drone-based visual in- spection for infrastructure.Advanced Engineering Informat- ics, 68:103643, 2025. 4

  27. [27]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F. Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transac- tions of the Association for Computational Linguistics, 12: 157–173, 2023. 2, 3

  28. [28]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 2

  29. [29]

    Lobry, D

    S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Trans- actions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020. 3

  30. [30]

    Openeqa: Embodied question answering in the era of foundation mod- els

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Sil- wal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravin...

  31. [31]

    Lokeswari Malepati, Vedhus Hoskere, Nagarajan Ganapathy, and S Suriya Prakash. Segmentation of surface and subsur- face damages in concrete structures through fusion of multi- modal images using vision transformer.Automation in Con- struction, 179:106469, 2025. 3

  32. [32]

    I. S. Mohamed and A. Y . A. Omaisan. Infragpt smart infrastructure: An end-to-end vlm-based framework for detecting and managing urban defects.arXiv preprint arXiv:2510.16017, 2025. 3

  33. [33]

    Reliability of visual inspection for highway bridges, volume i

    Mark Moore, Brent M Phares, Benjamin Graybeal, Dennis Rolander, Glenn Washer, Janney Wiss, et al. Reliability of visual inspection for highway bridges, volume i. Technical report, Turner-Fairbank Highway Research Center, 2001. 5

  34. [34]

    Instance segmenta- tion of reinforced concrete bridge point clouds with trans- formers trained exclusively on synthetic data.Automation in Construction, 173:106067, 2025

    Asad Ur Rahman and Vedhus Hoskere. Instance segmenta- tion of reinforced concrete bridge point clouds with trans- formers trained exclusively on synthetic data.Automation in Construction, 173:106067, 2025. 3

  35. [35]

    Rahnemoonfar, T

    M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy. Floodnet: A high resolution aerial imagery dataset for post-flood scene understanding. IEEE Access, 9:89644–89654, 2021. 3

  36. [36]

    Technologies and platforms for remote and autonomous bridge inspection–review.Structural Engineering Interna- tional, 35(3):354–376, 2025

    Anna M Rakoczy, Diogo Ribeiro, Vedhus Hoskere, Yasu- taka Narazaki, Piotr Olaszek, Wojciech Karwowski, Rafael Cabral, Yanlin Guo, Marcos Massao Futai, Pietro Milillo, et al. Technologies and platforms for remote and autonomous bridge inspection–review.Structural Engineering Interna- tional, 35(3):354–376, 2025. 3

  37. [37]

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023. 3

  38. [38]

    Methodologies for remote bridge inspection

    Diogo Ribeiro, Anna M Rakoczy, Rafael Cabral, Ved- hus Hoskere, Yasutaka Narazaki, Ricardo Santos, Gledson Tondo, Luis Gonzalez, José Campos Matos, Marcos Mas- sao Futai, et al. Methodologies for remote bridge inspection. Sensors (Basel, Switzerland), 25(18):5708, 2025. 3

  39. [39]

    T. W. Ryan, R. A. Hartle, J. E. Mann, and L. J. Danovich. Bridge inspector’s reference manual. Technical report, Na- tional Highway Institute (US), 2006. 2, 3

  40. [40]

    Sarkar, T

    A. Sarkar, T. Chowdhury, R. R. Murphy, A. Gangopadhyay, and M. Rahnemoonfar. Sam-vqa: Supervised attention- based visual question answering model for post-disaster damage assessment on remote sensing imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–16, 2023. 3

  41. [41]

    Deepank Kumar Singh, Vedhus Hoskere, and Pietro Milillo. Multiclass post-earthquake building assessment integrating high-resolution optical and sar satellite imagery, ground mo- tion, and soil data with transformers.Earthquake Spectra, page 87552930251377778, 2025. 3

  42. [42]

    View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,

    Subin Varghese and Vedhus Hoskere. View-invariant pixel- wise anomaly detection in multi-object scenes with adaptive view synthesis.arXiv preprint arXiv:2406.18012, 2024

  43. [43]

    Viewdelta: Scaling scene change detection through text- conditioning

    Subin Varghese, Joshua Gao, and Vedhus Hoskere. Viewdelta: Scaling scene change detection through text- conditioning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025,

  44. [44]

    Wang and N

    S. Wang and N. El-Gohary. Automated bridge inspection image interpretation based on vision-language pre-training. InComputing in Civil Engineering 2023, pages 1–8, 2024. 3

  45. [45]

    Yamane, P

    T. Yamane, P. Chun, J. Dang, and T. Okatani. Deep learning- based bridge damage cause estimation from multiple images using visual question answering.Structure and Infrastruc- ture Engineering, pages 1–14, 2024. 3 10

  46. [46]

    Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2024. 2

  47. [47]

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 7

  48. [48]

    Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space.arXiv preprint arXiv:2502.12532, 2025

    Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, and Jincai Huang. Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space.arXiv preprint arXiv:2502.12532, 2025. 3, 7

  49. [49]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    H. Zhen et al. 3d-vla: A 3d vision-language-action genera- tive world model.arXiv preprint arXiv:2403.09631, 2024. 4

  50. [50]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 2

  51. [51]

    Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d- vista: Pre-trained transformer for 3d vision and text align- ment. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 2899–2909, 2023. 4

  52. [52]

    Tango: Training-free embodied ai agents for open-world tasks.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24603– 24613, 2024

    Filippo Ziliotto, Tommaso Campari, Luciano Serafini, and Lamberto Ballan. Tango: Training-free embodied ai agents for open-world tasks.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24603– 24613, 2024. 3, 7 11