arxiv: 2511.12676 · v2 · submitted 2025-11-16 · 💻 cs.CV · cs.AI

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese , Joshua Gao , Asad Ur Rahman , Vedhus Hoskere This is my paper

Pith reviewed 2026-05-17 21:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords BridgeEQAembodied question answeringvision-language modelsbridge inspectionepisodic memoryMarkov decision processvisual reasoninginfrastructure inspection

0 comments

The pith

Embodied Memory Visual Reasoning outperforms baselines on the BridgeEQA benchmark for real bridge inspections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs drawn from professional inspection reports across 200 real-world bridge scenes. It frames inspection EQA as a testbed for episodic memory and multi-scale spatial reasoning in embodied agents. The authors develop Embodied Memory Visual Reasoning (EMVR) that treats the task as a Markov decision process. This method achieves stronger results than state-of-the-art vision-language models on the benchmark and on a new metric for image citation relevance. The work shows how domain-specific benchmarks grounded in real inspection data can expose and address gaps in current models for practical infrastructure tasks.

Core claim

The paper establishes that formulating inspection EQA as a Markov decision process enables the Embodied Memory Visual Reasoning method to deliver stronger performance than baselines on the BridgeEQA benchmark of 2,200 questions and answers from 200 real bridge scenes with linked professional reports.

What carries the argument

Embodied Memory Visual Reasoning (EMVR), which casts the inspection EQA task as a Markov decision process to support sequential memory use, image citation, and multi-scale visual reasoning.

If this is right

State-of-the-art vision-language models show substantial performance gaps on tasks that require long-range spatial understanding and episodic memory.
Professional inspection reports can ground open-vocabulary questions for standardized evaluation in embodied settings.
The Image Citation Relevance metric measures a model's ability to select and reference relevant images during question answering.
Bridge inspection provides a repeatable domain for testing multi-scale reasoning in embodied agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same benchmark style could apply to inspections of tunnels, dams, or buildings to test embodied agents in other infrastructure domains.
Adding depth or thermal sensors to the EMVR pipeline could improve accuracy in low-visibility or material-damage scenarios.
Successful agents trained on BridgeEQA might support robotic systems that reduce human time spent in hazardous bridge environments.
Extending the benchmark with time-series imagery or changing conditions would test memory retention across multiple visits.

Load-bearing premise

The 200 real-world bridge scenes and their linked professional reports sufficiently capture the episodic memory and multi-scale reasoning challenges of actual infrastructure inspections.

What would settle it

Evaluating EMVR on a held-out set of new bridge scenes or against human inspectors performing the same questions would show whether the performance gains hold beyond the original 200 scenes.

Figures

Figures reproduced from arXiv: 2511.12676 by Asad Ur Rahman, Joshua Gao, Subin Varghese, Vedhus Hoskere.

**Figure 1.** Figure 1: BridgeEQA: Open-Vocabulary Embodied Question Answering for bridge inspection. Two example scenes from our benchmark showing questions that require synthesizing visual evidence across multiple egocentric images to assess bridges. Abstract Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks… view at source ↗

**Figure 2.** Figure 2: Illustration of how EMVR mitigates the “lost in the mid [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Scene graph structure for bridge inspection. Nodes rep [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Embodied Memory Visual Reasoning. An [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Pipeline for constructing the BridgeEQAdataset from [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Representative sample images from BridgeEQA across [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of NBI condition ratings for bridge compo [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of question types in the BridgeEQA dataset [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: We observed several qualitative performance patterns. We provide a correct example and contrast this with two commonly [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Condition rating prediction accuracy comparison [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA. It demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset are available at https://drags99.github.io/bridge-eqa/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce BridgeEQA, a benchmark for embodied question answering in bridge inspections consisting of 200 real-world scenes with professional reports and 2,200 QA pairs. It proposes EMVR, an MDP-based approach for visual reasoning with memory, which reportedly outperforms baselines on this benchmark and introduces a new metric for image citation relevance.

Significance. Should the empirical results hold under scrutiny, this work offers a valuable new benchmark for EQA research grounded in real infrastructure inspection data rather than simulated environments. The release of code and dataset is a positive aspect. The MDP formulation for EMVR could provide a useful framework for handling episodic memory in embodied agents. However, the overall significance depends on addressing whether the benchmark scale adequately tests the targeted challenges.

major comments (2)

The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.
The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.

minor comments (2)

The average image count is reported with two decimal places (47.93); consider if this precision is necessary or if it should be explained how it was calculated.
Ensure that the citation for OpenEQA is included when referencing the style of the QA pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.

Authors: We agree that the abstract would benefit from quantitative support for the central claims. In the revised version, we will update the abstract to include specific metrics (e.g., EMVR accuracy of X% vs. baseline Y% on the BridgeEQA benchmark, with standard deviations and p-values from statistical tests). We will also expand the experiments section with a dedicated table comparing EMVR against the full set of baselines (including model names, hyperparameters, and ablation results) to make the superiority claims fully transparent and reproducible. revision: yes
Referee: The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.

Authors: We acknowledge that our benchmark uses static scenes rather than dynamic multi-visit trajectories. However, each scene contains an average of 47.93 egocentric images sampled along realistic inspection paths, requiring agents to perform sequential visual reasoning and memory retrieval across long spatial ranges within a single episode. The MDP formulation in EMVR explicitly models state transitions and memory updates to handle this. We will add a new subsection in the discussion clarifying these design choices, including an analysis of scene complexity (e.g., average path length and number of relevant images per question), and note the limitation regarding multi-visit scenarios as an avenue for future work while arguing that the current scale still meaningfully advances episodic memory EQA beyond simulated environments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper constructs BridgeEQA as a new benchmark directly from external professional inspection reports and real-world scenes, then evaluates EMVR (formulated as an MDP) empirically against baselines on this benchmark. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the performance results are independent empirical measurements on externally grounded data. This is a standard honest benchmark-plus-method paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the collected bridge scenes and reports plus standard assumptions in vision-language model evaluation; no free parameters or invented entities are described.

axioms (1)

domain assumption Professional inspection reports provide reliable ground truth for open-vocabulary questions about bridge condition.
Invoked when grounding the 2200 QA pairs.

pith-pipeline@v0.9.0 · 5514 in / 1051 out tokens · 25903 ms · 2026-05-17T21:37:04.260693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EMVR formulates the inspection EQA task as a Markov decision process. ... images are nodes, and an agent takes actions to traverse views
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

[1]

Agia et al

C. Agia et al. Taskography: Evaluating robot task planning over large 3d scene graphs. InConference on Robot Learn- ing, pages 46–58. PMLR, 2022. 3

work page 2022
[2]

Consis- tency of the new york state bridge inspection program

Anil K Agrawal, Glenn A Washer, Xu Gong, et al. Consis- tency of the new york state bridge inspection program. Tech- nical report, New York (NY). Dept. of Transportation, 2013. 3, 5, 8

work page 2013
[3]

Evaluation of the consistency of bridge inspection ratings in new york state.Journal of In- frastructure Systems, 27(3):04021016, 2021

Anil Kumar Agrawal, Glenn Washer, Sreenivas Alampalli, Xu Gong, and Ran Cao. Evaluation of the consistency of bridge inspection ratings in new york state.Journal of In- frastructure Systems, 27(3):04021016, 2021. 8

work page 2021
[4]

A. S. Ahmad. Bridge preservation guide: Maintaining a state of good repair using cost-effective investment strate- gies. Technical report, United States. Federal Highway Ad- ministration, 2011. 3

work page 2011
[5]

ASCE, Reston, V A, 2021

American Society of Civil Engineers.2021 Report Card for America’s Infrastructure. ASCE, Reston, V A, 2021. 2, 3

work page 2021
[6]

arXiv:2404.16811

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou. Make your llm fully utilize the context. ArXiv, abs/2404.16811, 2024. 2, 3

work page arXiv 2024
[7]

Armeni et al

I. Armeni et al. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019. 3

work page 2019
[8]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 3

work page 2022
[9]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hong Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understand- ing.ArXiv, abs/2308.14508, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Y . Chen, S. Zhang, T. Han, Y . Du, W. Zhang, and J. Li. Chat3d: Interactive understanding of 3d scene-level point clouds by chatting with foundation models for urban eco- logical construction.ISPRS Journal of Photogrammetry and Remote Sensing, 212:181–192, 2024. 4

work page 2024
[11]

Z. Chen, Y . Zou, V . A. Gonzalez, J. Ingham, and L. M. Wotherspoon. Bridge inspection using a multi-modal vi- sion language model. InProceedings of the 6th International Conference on Civil and Building Engineering Informatics, page 11, 2025. 3

work page 2025
[12]

Embodied question answer- ing

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. 2, 3

work page 2018
[13]

Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024

Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Pira- muthu, Michael Johnston, Reza Ghanadhan, and Dinesh Manocha. Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024. 3

work page arXiv 2024
[14]

RAGAs: Automated evaluation of retrieval aug- mented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval aug- mented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computa- tional Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, 2024. Association for Computational Lin- guistics. 6

work page 2024
[15]

Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges

Federal Highway Administration. Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges. Technical Report FHW A-PD-96-001, U.S. Department of Transportation, Federal Highway Adminis- tration, 1995. 2, 3, 5, 6

work page 1995
[16]

Ragalyst: Automated human- aligned agentic evaluation for domain-specific rag.arXiv preprint arXiv:2511.04502, 2025

Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, and Vedhus Hoskere. Ragalyst: Automated human- aligned agentic evaluation for domain-specific rag.arXiv preprint arXiv:2511.04502, 2025. 6

work page arXiv 2025
[17]

Gu et al

Q. Gu et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InIEEE International Conference on Robotics and Automation (ICRA), pages 5021–5028, 2024. 3

work page 2024
[18]

Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025

Vedhus Hoskere, Delaram Hassanlou, Asad Ur Rahman, Reza Bazrgary, and Muhammad Taseer Ali. Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025. 3

work page 2025
[19]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?ArXiv, abs/2404.06654, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Jia et al

B. Jia et al. Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding. InComputer Vision - ECCV 2024, pages 289–310. Springer Nature Switzerland,

work page 2024
[21]

Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression. InAnnual Meeting of the Associa- tion for Computational Linguistics, 2023. 2, 3

work page 2023
[22]

Kunlamai, T

T. Kunlamai, T. Yamane, M. Suganuma, P. Chun, and T. Okatani. Improving visual question answering for bridge inspection by pre-training with external image-text data. Computer-Aided Civil and Infrastructure Engineering, 39 (3):345–361, 2024. 3

work page 2024
[23]

Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y

Yuri Kuratov, A. Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y . Sorokin, and M. Burtsev. Babi- long: Testing the limits of llms with long context reasoning- in-a-haystack.ArXiv, abs/2406.10149, 2024. 2, 3

work page arXiv 2024
[24]

Industryeqa: Push- ing the frontiers of embodied question answering in indus- trial scenarios.arXiv preprint arXiv:2505.20640, 2025

Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, and Yu Kong. Industryeqa: Push- ing the frontiers of embodied question answering in indus- trial scenarios.arXiv preprint arXiv:2505.20640, 2025. 3, 7

work page arXiv 2025
[25]

Liao and G

P. Liao and G. Nakano. Bridgeclip: Automatic bridge in- spection by utilizing vision-language model. InInternational Conference on Pattern Recognition, pages 61–76. Springer,

work page
[26]

J. Liu, H. Li, C. Chai, K. Chen, and D. Wang. A llm- informed multi-agent ai system for drone-based visual in- spection for infrastructure.Advanced Engineering Informat- ics, 68:103643, 2025. 4

work page 2025
[27]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F. Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transac- tions of the Association for Computational Linguistics, 12: 157–173, 2023. 2, 3

work page 2023
[28]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 2

work page 2024
[29]

Lobry, D

S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Trans- actions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020. 3

work page 2020
[30]

Openeqa: Embodied question answering in the era of foundation mod- els

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Sil- wal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravin...

work page 2024
[31]

Lokeswari Malepati, Vedhus Hoskere, Nagarajan Ganapathy, and S Suriya Prakash. Segmentation of surface and subsur- face damages in concrete structures through fusion of multi- modal images using vision transformer.Automation in Con- struction, 179:106469, 2025. 3

work page 2025
[32]

I. S. Mohamed and A. Y . A. Omaisan. Infragpt smart infrastructure: An end-to-end vlm-based framework for detecting and managing urban defects.arXiv preprint arXiv:2510.16017, 2025. 3

work page arXiv 2025
[33]

Reliability of visual inspection for highway bridges, volume i

Mark Moore, Brent M Phares, Benjamin Graybeal, Dennis Rolander, Glenn Washer, Janney Wiss, et al. Reliability of visual inspection for highway bridges, volume i. Technical report, Turner-Fairbank Highway Research Center, 2001. 5

work page 2001
[34]

Instance segmenta- tion of reinforced concrete bridge point clouds with trans- formers trained exclusively on synthetic data.Automation in Construction, 173:106067, 2025

Asad Ur Rahman and Vedhus Hoskere. Instance segmenta- tion of reinforced concrete bridge point clouds with trans- formers trained exclusively on synthetic data.Automation in Construction, 173:106067, 2025. 3

work page 2025
[35]

Rahnemoonfar, T

M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy. Floodnet: A high resolution aerial imagery dataset for post-flood scene understanding. IEEE Access, 9:89644–89654, 2021. 3

work page 2021
[36]

Technologies and platforms for remote and autonomous bridge inspection–review.Structural Engineering Interna- tional, 35(3):354–376, 2025

Anna M Rakoczy, Diogo Ribeiro, Vedhus Hoskere, Yasu- taka Narazaki, Piotr Olaszek, Wojciech Karwowski, Rafael Cabral, Yanlin Guo, Marcos Massao Futai, Pietro Milillo, et al. Technologies and platforms for remote and autonomous bridge inspection–review.Structural Engineering Interna- tional, 35(3):354–376, 2025. 3

work page 2025
[37]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023. 3

work page arXiv 2023
[38]

Methodologies for remote bridge inspection

Diogo Ribeiro, Anna M Rakoczy, Rafael Cabral, Ved- hus Hoskere, Yasutaka Narazaki, Ricardo Santos, Gledson Tondo, Luis Gonzalez, José Campos Matos, Marcos Mas- sao Futai, et al. Methodologies for remote bridge inspection. Sensors (Basel, Switzerland), 25(18):5708, 2025. 3

work page 2025
[39]

T. W. Ryan, R. A. Hartle, J. E. Mann, and L. J. Danovich. Bridge inspector’s reference manual. Technical report, Na- tional Highway Institute (US), 2006. 2, 3

work page 2006
[40]

Sarkar, T

A. Sarkar, T. Chowdhury, R. R. Murphy, A. Gangopadhyay, and M. Rahnemoonfar. Sam-vqa: Supervised attention- based visual question answering model for post-disaster damage assessment on remote sensing imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–16, 2023. 3

work page 2023
[41]

Deepank Kumar Singh, Vedhus Hoskere, and Pietro Milillo. Multiclass post-earthquake building assessment integrating high-resolution optical and sar satellite imagery, ground mo- tion, and soil data with transformers.Earthquake Spectra, page 87552930251377778, 2025. 3

work page 2025
[42]

View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,

Subin Varghese and Vedhus Hoskere. View-invariant pixel- wise anomaly detection in multi-object scenes with adaptive view synthesis.arXiv preprint arXiv:2406.18012, 2024

work page arXiv 2024
[43]

Viewdelta: Scaling scene change detection through text- conditioning

Subin Varghese, Joshua Gao, and Vedhus Hoskere. Viewdelta: Scaling scene change detection through text- conditioning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025,

work page 2025
[44]

Wang and N

S. Wang and N. El-Gohary. Automated bridge inspection image interpretation based on vision-language pre-training. InComputing in Civil Engineering 2023, pages 1–8, 2024. 3

work page 2023
[45]

Yamane, P

T. Yamane, P. Chun, J. Dang, and T. Okatani. Deep learning- based bridge damage cause estimation from multiple images using visual question answering.Structure and Infrastruc- ture Engineering, pages 1–14, 2024. 3 10

work page 2024
[46]

Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2024. 2

work page 2025
[47]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space.arXiv preprint arXiv:2502.12532, 2025

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, and Jincai Huang. Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space.arXiv preprint arXiv:2502.12532, 2025. 3, 7

work page arXiv 2025
[49]

3D-VLA: A 3D Vision-Language-Action Generative World Model

H. Zhen et al. 3d-vla: A 3d vision-language-action genera- tive world model.arXiv preprint arXiv:2403.09631, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 2

work page 2023
[51]

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d- vista: Pre-trained transformer for 3d vision and text align- ment. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 2899–2909, 2023. 4

work page 2023
[52]

Tango: Training-free embodied ai agents for open-world tasks.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24603– 24613, 2024

Filippo Ziliotto, Tommaso Campari, Luciano Serafini, and Lamberto Ballan. Tango: Training-free embodied ai agents for open-world tasks.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24603– 24613, 2024. 3, 7 11

work page 2025