BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
Pith reviewed 2026-05-17 21:37 UTC · model grok-4.3
The pith
Embodied Memory Visual Reasoning outperforms baselines on the BridgeEQA benchmark for real bridge inspections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that formulating inspection EQA as a Markov decision process enables the Embodied Memory Visual Reasoning method to deliver stronger performance than baselines on the BridgeEQA benchmark of 2,200 questions and answers from 200 real bridge scenes with linked professional reports.
What carries the argument
Embodied Memory Visual Reasoning (EMVR), which casts the inspection EQA task as a Markov decision process to support sequential memory use, image citation, and multi-scale visual reasoning.
If this is right
- State-of-the-art vision-language models show substantial performance gaps on tasks that require long-range spatial understanding and episodic memory.
- Professional inspection reports can ground open-vocabulary questions for standardized evaluation in embodied settings.
- The Image Citation Relevance metric measures a model's ability to select and reference relevant images during question answering.
- Bridge inspection provides a repeatable domain for testing multi-scale reasoning in embodied agents.
Where Pith is reading between the lines
- The same benchmark style could apply to inspections of tunnels, dams, or buildings to test embodied agents in other infrastructure domains.
- Adding depth or thermal sensors to the EMVR pipeline could improve accuracy in low-visibility or material-damage scenarios.
- Successful agents trained on BridgeEQA might support robotic systems that reduce human time spent in hazardous bridge environments.
- Extending the benchmark with time-series imagery or changing conditions would test memory retention across multiple visits.
Load-bearing premise
The 200 real-world bridge scenes and their linked professional reports sufficiently capture the episodic memory and multi-scale reasoning challenges of actual infrastructure inspections.
What would settle it
Evaluating EMVR on a held-out set of new bridge scenes or against human inspectors performing the same questions would show whether the performance gains hold beyond the original 200 scenes.
Figures
read the original abstract
Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA. It demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset are available at https://drags99.github.io/bridge-eqa/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce BridgeEQA, a benchmark for embodied question answering in bridge inspections consisting of 200 real-world scenes with professional reports and 2,200 QA pairs. It proposes EMVR, an MDP-based approach for visual reasoning with memory, which reportedly outperforms baselines on this benchmark and introduces a new metric for image citation relevance.
Significance. Should the empirical results hold under scrutiny, this work offers a valuable new benchmark for EQA research grounded in real infrastructure inspection data rather than simulated environments. The release of code and dataset is a positive aspect. The MDP formulation for EMVR could provide a useful framework for handling episodic memory in embodied agents. However, the overall significance depends on addressing whether the benchmark scale adequately tests the targeted challenges.
major comments (2)
- The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.
- The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.
minor comments (2)
- The average image count is reported with two decimal places (47.93); consider if this precision is necessary or if it should be explained how it was calculated.
- Ensure that the citation for OpenEQA is included when referencing the style of the QA pairs.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: The abstract reports 'substantial performance gaps' and that 'EMVR shows strong performance over the baselines' but does not include any quantitative numbers, specific baseline models, or statistical details. This omission is load-bearing for the central claim of EMVR's superiority and requires detailed reporting in the experiments section.
Authors: We agree that the abstract would benefit from quantitative support for the central claims. In the revised version, we will update the abstract to include specific metrics (e.g., EMVR accuracy of X% vs. baseline Y% on the BridgeEQA benchmark, with standard deviations and p-values from statistical tests). We will also expand the experiments section with a dedicated table comparing EMVR against the full set of baselines (including model names, hyperparameters, and ablation results) to make the superiority claims fully transparent and reproducible. revision: yes
-
Referee: The use of 200 scenes with an average of 47.93 images per scene from egocentric sampling is presented as capturing multi-scale reasoning and long-range spatial understanding. However, without explicit tests for long-horizon or multi-visit scenarios, it is unclear if this setup sufficiently stresses the episodic memory aspects central to the paper's motivation, as the static nature may not fully represent real inspection workflows.
Authors: We acknowledge that our benchmark uses static scenes rather than dynamic multi-visit trajectories. However, each scene contains an average of 47.93 egocentric images sampled along realistic inspection paths, requiring agents to perform sequential visual reasoning and memory retrieval across long spatial ranges within a single episode. The MDP formulation in EMVR explicitly models state transitions and memory updates to handle this. We will add a new subsection in the discussion clarifying these design choices, including an analysis of scene complexity (e.g., average path length and number of relevant images per question), and note the limitation regarding multi-visit scenarios as an avenue for future work while arguing that the current scale still meaningfully advances episodic memory EQA beyond simulated environments. revision: partial
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper constructs BridgeEQA as a new benchmark directly from external professional inspection reports and real-world scenes, then evaluates EMVR (formulated as an MDP) empirically against baselines on this benchmark. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the performance results are independent empirical measurements on externally grounded data. This is a standard honest benchmark-plus-method paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Professional inspection reports provide reliable ground truth for open-vocabulary questions about bridge condition.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EMVR formulates the inspection EQA task as a Markov decision process. ... images are nodes, and an agent takes actions to traverse views
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
C. Agia et al. Taskography: Evaluating robot task planning over large 3d scene graphs. InConference on Robot Learn- ing, pages 46–58. PMLR, 2022. 3
work page 2022
-
[2]
Consis- tency of the new york state bridge inspection program
Anil K Agrawal, Glenn A Washer, Xu Gong, et al. Consis- tency of the new york state bridge inspection program. Tech- nical report, New York (NY). Dept. of Transportation, 2013. 3, 5, 8
work page 2013
-
[3]
Anil Kumar Agrawal, Glenn Washer, Sreenivas Alampalli, Xu Gong, and Ran Cao. Evaluation of the consistency of bridge inspection ratings in new york state.Journal of In- frastructure Systems, 27(3):04021016, 2021. 8
work page 2021
-
[4]
A. S. Ahmad. Bridge preservation guide: Maintaining a state of good repair using cost-effective investment strate- gies. Technical report, United States. Federal Highway Ad- ministration, 2011. 3
work page 2011
-
[5]
American Society of Civil Engineers.2021 Report Card for America’s Infrastructure. ASCE, Reston, V A, 2021. 2, 3
work page 2021
-
[6]
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou. Make your llm fully utilize the context. ArXiv, abs/2404.16811, 2024. 2, 3
-
[7]
I. Armeni et al. 3d scene graph: A structure for unified semantics, 3d space, and camera. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5664–5673, 2019. 3
work page 2019
-
[8]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 3
work page 2022
-
[9]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hong Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understand- ing.ArXiv, abs/2308.14508, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Y . Chen, S. Zhang, T. Han, Y . Du, W. Zhang, and J. Li. Chat3d: Interactive understanding of 3d scene-level point clouds by chatting with foundation models for urban eco- logical construction.ISPRS Journal of Photogrammetry and Remote Sensing, 212:181–192, 2024. 4
work page 2024
-
[11]
Z. Chen, Y . Zou, V . A. Gonzalez, J. Ingham, and L. M. Wotherspoon. Bridge inspection using a multi-modal vi- sion language model. InProceedings of the 6th International Conference on Civil and Building Engineering Informatics, page 11, 2025. 3
work page 2025
-
[12]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. 2, 3
work page 2018
-
[13]
Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Pira- muthu, Michael Johnston, Reza Ghanadhan, and Dinesh Manocha. Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024. 3
-
[14]
RAGAs: Automated evaluation of retrieval aug- mented generation
Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval aug- mented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computa- tional Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, 2024. Association for Computational Lin- guistics. 6
work page 2024
-
[15]
Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges
Federal Highway Administration. Recording and coding guide for the structure inventory and appraisal of the na- tion’s bridges. Technical Report FHW A-PD-96-001, U.S. Department of Transportation, Federal Highway Adminis- tration, 1995. 2, 3, 5, 6
work page 1995
-
[16]
Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, and Vedhus Hoskere. Ragalyst: Automated human- aligned agentic evaluation for domain-specific rag.arXiv preprint arXiv:2511.04502, 2025. 6
- [17]
-
[18]
Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025
Vedhus Hoskere, Delaram Hassanlou, Asad Ur Rahman, Reza Bazrgary, and Muhammad Taseer Ali. Unified frame- work for digital twins of bridges.Automation in Construc- tion, 175:106214, 2025. 3
work page 2025
-
[19]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?ArXiv, abs/2404.06654, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [20]
-
[21]
Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: 9 Accelerating and enhancing llms in long context scenarios via prompt compression. InAnnual Meeting of the Associa- tion for Computational Linguistics, 2023. 2, 3
work page 2023
-
[22]
T. Kunlamai, T. Yamane, M. Suganuma, P. Chun, and T. Okatani. Improving visual question answering for bridge inspection by pre-training with external image-text data. Computer-Aided Civil and Infrastructure Engineering, 39 (3):345–361, 2024. 3
work page 2024
-
[23]
Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y
Yuri Kuratov, A. Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Y . Sorokin, and M. Burtsev. Babi- long: Testing the limits of llms with long context reasoning- in-a-haystack.ArXiv, abs/2406.10149, 2024. 2, 3
-
[24]
Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, and Yu Kong. Industryeqa: Push- ing the frontiers of embodied question answering in indus- trial scenarios.arXiv preprint arXiv:2505.20640, 2025. 3, 7
-
[25]
P. Liao and G. Nakano. Bridgeclip: Automatic bridge in- spection by utilizing vision-language model. InInternational Conference on Pattern Recognition, pages 61–76. Springer,
-
[26]
J. Liu, H. Li, C. Chai, K. Chen, and D. Wang. A llm- informed multi-agent ai system for drone-based visual in- spection for infrastructure.Advanced Engineering Informat- ics, 68:103643, 2025. 4
work page 2025
-
[27]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, F. Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transac- tions of the Association for Computational Linguistics, 12: 157–173, 2023. 2, 3
work page 2023
-
[28]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 2
work page 2024
- [29]
-
[30]
Openeqa: Embodied question answering in the era of foundation mod- els
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Sil- wal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravin...
work page 2024
-
[31]
Lokeswari Malepati, Vedhus Hoskere, Nagarajan Ganapathy, and S Suriya Prakash. Segmentation of surface and subsur- face damages in concrete structures through fusion of multi- modal images using vision transformer.Automation in Con- struction, 179:106469, 2025. 3
work page 2025
- [32]
-
[33]
Reliability of visual inspection for highway bridges, volume i
Mark Moore, Brent M Phares, Benjamin Graybeal, Dennis Rolander, Glenn Washer, Janney Wiss, et al. Reliability of visual inspection for highway bridges, volume i. Technical report, Turner-Fairbank Highway Research Center, 2001. 5
work page 2001
-
[34]
Asad Ur Rahman and Vedhus Hoskere. Instance segmenta- tion of reinforced concrete bridge point clouds with trans- formers trained exclusively on synthetic data.Automation in Construction, 173:106067, 2025. 3
work page 2025
-
[35]
M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy. Floodnet: A high resolution aerial imagery dataset for post-flood scene understanding. IEEE Access, 9:89644–89654, 2021. 3
work page 2021
-
[36]
Anna M Rakoczy, Diogo Ribeiro, Vedhus Hoskere, Yasu- taka Narazaki, Piotr Olaszek, Wojciech Karwowski, Rafael Cabral, Yanlin Guo, Marcos Massao Futai, Pietro Milillo, et al. Technologies and platforms for remote and autonomous bridge inspection–review.Structural Engineering Interna- tional, 35(3):354–376, 2025. 3
work page 2025
- [37]
-
[38]
Methodologies for remote bridge inspection
Diogo Ribeiro, Anna M Rakoczy, Rafael Cabral, Ved- hus Hoskere, Yasutaka Narazaki, Ricardo Santos, Gledson Tondo, Luis Gonzalez, José Campos Matos, Marcos Mas- sao Futai, et al. Methodologies for remote bridge inspection. Sensors (Basel, Switzerland), 25(18):5708, 2025. 3
work page 2025
-
[39]
T. W. Ryan, R. A. Hartle, J. E. Mann, and L. J. Danovich. Bridge inspector’s reference manual. Technical report, Na- tional Highway Institute (US), 2006. 2, 3
work page 2006
-
[40]
A. Sarkar, T. Chowdhury, R. R. Murphy, A. Gangopadhyay, and M. Rahnemoonfar. Sam-vqa: Supervised attention- based visual question answering model for post-disaster damage assessment on remote sensing imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–16, 2023. 3
work page 2023
-
[41]
Deepank Kumar Singh, Vedhus Hoskere, and Pietro Milillo. Multiclass post-earthquake building assessment integrating high-resolution optical and sar satellite imagery, ground mo- tion, and soil data with transformers.Earthquake Spectra, page 87552930251377778, 2025. 3
work page 2025
-
[42]
View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,
Subin Varghese and Vedhus Hoskere. View-invariant pixel- wise anomaly detection in multi-object scenes with adaptive view synthesis.arXiv preprint arXiv:2406.18012, 2024
-
[43]
Viewdelta: Scaling scene change detection through text- conditioning
Subin Varghese, Joshua Gao, and Vedhus Hoskere. Viewdelta: Scaling scene change detection through text- conditioning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025,
work page 2025
-
[44]
S. Wang and N. El-Gohary. Automated bridge inspection image interpretation based on vision-language pre-training. InComputing in Civil Engineering 2023, pages 1–8, 2024. 3
work page 2023
- [45]
-
[46]
Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Fei-Fei Li, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2024. 2
work page 2025
-
[47]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, and Jincai Huang. Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space.arXiv preprint arXiv:2502.12532, 2025. 3, 7
-
[49]
3D-VLA: A 3D Vision-Language-Action Generative World Model
H. Zhen et al. 3d-vla: A 3d vision-language-action genera- tive world model.arXiv preprint arXiv:2403.09631, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 2
work page 2023
-
[51]
Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li. 3d- vista: Pre-trained transformer for 3d vision and text align- ment. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 2899–2909, 2023. 4
work page 2023
-
[52]
Filippo Ziliotto, Tommaso Campari, Luciano Serafini, and Lamberto Ballan. Tango: Training-free embodied ai agents for open-world tasks.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24603– 24613, 2024. 3, 7 11
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.