WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

An Zhang; Bingnan Liu; Chenhang Cui; Fei Shen; Jiani Luo; Lingbei Meng; Rui Huang; Tinghao Wang; Xiande Huang; Zhirong Shen

arxiv: 2605.20306 · v1 · pith:3VNTA5L3new · submitted 2026-05-19 · 💻 cs.CV · cs.LG

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Bingnan Liu , Chenhang Cui , Rui Huang , Jiani Luo , Zhirong Shen , Tinghao Wang , Xiande Huang , Lingbei Meng

show 2 more authors

Fei Shen An Zhang

This is my paper

Pith reviewed 2026-05-21 07:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords WildRoadBenchaerial road damageUAV imageryvision-language modelsautonomous agentsvisual groundingAP_50 metricbenchmark

0 comments

The pith

WildRoadBench shows both VLMs and LLM agents fall far short of reliable road-damage grounding on wild UAV imagery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WildRoadBench as a single professionally annotated UAV corpus that tests two separate capabilities under the same per-class AP_50 metric. In the VLM Track a fixed model must localize damage from one image and one prompt through a standardized pipeline. In the Agent Track an autonomous LLM agent receives only a task brief and a small data slice, then must search the web, adapt models, write code, and submit results within a fixed interaction budget. Results indicate that closed-source frontier VLMs lead the leaderboard yet still leave more than half the metric unachieved, open-source grounders remain lower and collapse on small targets, and agents lag the best VLM despite greater affordances.

Core claim

On the same wild aerial road-damage UAV images and the same per-class AP_50 metric, neither fixed vision-language models nor budget-constrained LLM-driven agents reach reliable localization performance; closed-source models lead but substantial gaps remain, open-source models plateau lower especially on small instances, and agents do not surpass the strongest VLM.

What carries the argument

The dual-track WildRoadBench protocol that applies identical UAV images and AP_50 scoring to both direct VLM prompting and autonomous agent adaptation under unified prompting, decoding and parsing rules.

If this is right

Closed-source frontier VLMs set the current ceiling but still miss more than half the AP_50 points.
Open-source VLMs plateau well below closed-source ones and fail especially on small damage instances.
Newer generations and reasoning-style open-source variants do not deliver consistent grounding gains.
LLM agents with web search and code-writing affordances still lag the best fixed VLM within the given budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Domain-specific fine-tuning or aerial-aware pretraining may be required before general VLMs become practical for UAV road inspection.
The benchmark could be extended with multi-frame or temporal inputs to test whether agents improve when given richer context.
Real deployment of autonomous UAV fleets for infrastructure monitoring would need grounding accuracy well above the levels shown here.

Load-bearing premise

The professionally annotated UAV corpus together with per-class AP_50 under one fixed prompting and parsing pipeline gives a fair measure of real-world wild aerial road-damage grounding ability.

What would settle it

A new VLM or agent that achieves AP_50 scores approaching the theoretical maximum on the hidden holdout set while following the benchmark's exact prompting, decoding and submission rules.

Figures

Figures reproduced from arXiv: 2605.20306 by An Zhang, Bingnan Liu, Chenhang Cui, Fei Shen, Jiani Luo, Lingbei Meng, Rui Huang, Tinghao Wang, Xiande Huang, Zhirong Shen.

**Figure 1.** Figure 1: WILDROADBENCH at a glance. A single 1,061-image corpus of professionally-annotated UAV road-damage imagery is graded under two protocols: the VLM TRACK evaluates grounding VLMs in a single forward pass against a schema (released with images and labels), while the AGENT TRACK evaluates autonomous LLM-driven agents against the underlying schema as a hidden holdout; agents may browse the web, train detectors … view at source ↗

**Figure 2.** Figure 2: AGENT TRACK agent pipeline. Each agent runs inside an isolation namespace in which sibling workspaces, the organiser archive and the holdout directory are not visible. An out-ofnamespace grader holds the holdout in process memory and exposes a single submission endpoint that returns a scalar mAP50; per-class and per-image diagnostics are never visible to the agent. Per-workspace caches for packages, model… view at source ↗

**Figure 3.** Figure 3: breaks open-source VLM performance by COCO scale bucket: small (< 322 ), medium (322–962 ), and large (≥ 962 ). Small aerial objects remain largely unsolved: no model exceeds 7.1 % recall. Qwen2.5-VL-32B-AWQ is strongest on medium and large objects, reaching 32.2 % and 39.7 % recall, respectively, while Qwen3-VL variants are roughly five times lower on the same buckets. This suggests that high-resolution t… view at source ↗

**Figure 4.** Figure 4: Eight representative VLM failure modes on the VLM TRACK corpus. Each panel shows one image from the public VLM TRACK set with the ground-truth box (green, solid) and the model’s prediction (red, dashed) overlaid. Empty-prediction panels (Qwen3-VL-8B-Thinking, KimiVL-A3B-Thinking, InternVL2-8B) correspond to runs where the model emitted natural-language reasoning or a <think> preamble but never produced a … view at source ↗

**Figure 5.** Figure 5: Per-class samples from the WildRoadBench dataset. One representative aerial UAV image per scene; ground-truth boxes drawn in the class color. All eight scenes are professionally annotated by road-maintenance experts following highway-maintenance specifications, and instances frequently co-occur within a single frame (e.g. multiple debris items on a single shoulder). NeurIPS Paper Checklist 1. Claims Questi… view at source ↗

read the original abstract

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WildRoadBench, a benchmark coupling VLM direct grounding and LLM-agent engineering on a single professionally annotated UAV corpus of wild aerial road damages. The VLM Track evaluates fixed models under a unified prompting/decoding/parsing pipeline using per-class AP_50; the Agent Track evaluates autonomous agents that must search the web, adapt components, write code, and submit to a hidden holdout under a fixed interaction budget and scalar feedback. Results show closed-source frontier VLMs lead but leave more than half the metric unrealized, open-source models plateau lower (especially on small targets), and agents lag the best VLM or fail to submit.

Significance. If the benchmark is representative, the work provides a reproducible, dual-protocol testbed that quantifies current limitations in domain-specific aerial grounding and agentic adaptation for real-world conditions. The release of code, data, and the hidden-holdout protocol with scalar feedback are concrete strengths that enable follow-up research.

major comments (1)

[Benchmark definition and data sections] Benchmark definition and data sections: the central claim that both tracks remain far from reliable performance rests on the corpus providing a fair, unbiased measure of wild aerial road-damage grounding. The manuscript states the data are 'professionally annotated' and evaluations use 'the same per-class AP_50 metric' under a 'unified prompting, decoding and parsing pipeline,' yet supplies no details on damage class definitions, annotation guidelines, inter-annotator agreement, image selection criteria (altitude, weather, occlusion), or class balance. This is load-bearing; without these, the reported gaps (frontier models leaving >50% on the table, open-source collapse on small targets) cannot be confidently attributed to model limitations rather than testbed artifacts.

minor comments (2)

[Abstract] Abstract: the anonymous link should be replaced with a permanent repository identifier in the camera-ready version to ensure long-term reproducibility.
[Figures and tables] Figure and table captions: several captions are terse; expanding them to explicitly state what each subplot or row measures would improve readability for readers unfamiliar with the dual-track protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on benchmark transparency. We agree that additional details on data construction are warranted to support the central claims and will incorporate them in the revision.

read point-by-point responses

Referee: [Benchmark definition and data sections] Benchmark definition and data sections: the central claim that both tracks remain far from reliable performance rests on the corpus providing a fair, unbiased measure of wild aerial road-damage grounding. The manuscript states the data are 'professionally annotated' and evaluations use 'the same per-class AP_50 metric' under a 'unified prompting, decoding and parsing pipeline,' yet supplies no details on damage class definitions, annotation guidelines, inter-annotator agreement, image selection criteria (altitude, weather, occlusion), or class balance. This is load-bearing; without these, the reported gaps (frontier models leaving >50% on the table, open-source collapse on small targets) cannot be confidently attributed to model limitations rather than testbed artifacts.

Authors: We agree that these details are load-bearing for interpreting the reported performance gaps. The original manuscript emphasized the dual-track protocols and results while keeping the data section concise; the full annotation protocol was documented internally but not expanded in the main text. In the revised manuscript we will add a dedicated subsection (or appendix) that specifies: (1) exact definitions and visual examples for each damage class, (2) the annotation guidelines given to the professional annotators, (3) inter-annotator agreement statistics (e.g., mean IoU and Cohen’s kappa on a double-annotated subset), (4) image selection criteria including altitude ranges, weather and lighting conditions, and occlusion levels chosen to capture realistic “wild” variability, and (5) per-class instance counts and balance statistics. These additions will make explicit that the corpus was constructed to reflect domain-specific challenges rather than to favor any particular model class. We have retained all original annotation records and can integrate this material without altering the reported numbers or conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction with external evaluations

full rationale

This is a benchmark paper that collects a UAV corpus, defines evaluation protocols (VLM Track and Agent Track), and reports empirical results from existing models and agents. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The central claims rest on measured performance gaps against a held-out set under a fixed pipeline, with no self-citation load-bearing steps or ansatz smuggling. The work is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on new professional annotations of a UAV corpus and the choice of AP_50 as the unified metric; no free parameters are fitted to produce the headline results, and no new entities are postulated.

free parameters (1)

interaction budget for agents
Fixed limit on agent actions chosen to constrain exploration and code-writing attempts.

axioms (1)

domain assumption Professional annotations constitute reliable ground truth for per-class damage localization.
Invoked when defining the benchmark corpus and AP_50 evaluation.

pith-pipeline@v0.9.0 · 5827 in / 1402 out tokens · 25837 ms · 2026-05-21T07:51:27.637676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce WILDROADBENCH, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 10 internal anchors

[1]

Introducing Claude Haiku 4.5

Anthropic. Introducing Claude Haiku 4.5. https://www.anthropic.com/news/ claude-haiku-4-5 , October 2025. System card available at https://anthropic.com/ claude-haiku-4-5-system-card

work page 2025
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report, 2025. URL https: //arxiv.org/abs/2502.13923. Qwen Team, Alibaba Group

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024. doi: 10.48550/ARXIV .2410.07095

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[7]

A pavement crack detection and evaluation framework for a uav inspection system based on deep learning.Applied Sciences, 14(3):1157, 2024

Xinbao Chen, Chang Liu, Long Chen, Xiaodong Zhu, Yaohui Zhang, and Chenxi Wang. A pavement crack detection and evaluation framework for a uav inspection system based on deep learning.Applied Sciences, 14(3):1157, 2024. doi: 10.3390/app14031157

work page doi:10.3390/app14031157 2024
[8]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zis- serman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. doi: 10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010
[9]

Inês Feitosa, Bruno Santos, and Paulo J. G. Almeida. Pavement inspection in transport in- frastructures using unmanned aerial vehicles (uavs).Sustainability, 16(5):2207, 2024. doi: 10.3390/su16052207

work page doi:10.3390/su16052207 2024
[10]

Gemini 3: Introducing the latest Gemini AI model from Google

Google DeepMind. Gemini 3: Introducing the latest Gemini AI model from Google. https: //blog.google/products-and-platforms/products/gemini/gemini-3/ , November 2025

work page 2025
[11]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, and Yunke Zhao. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025
[12]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024
[13]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. ReferItGame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, 2014. doi: 10.3115/v1/D14-1086. URLhttps://aclanthology.org/D14-1086/. 10

work page doi:10.3115/v1/d14-1086 2014
[14]

Kimi-VL Technical Report

Kimi Team. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Shamma, et al

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. doi: 10.1007/s11263-016-0981-7

work page doi:10.1007/s11263-016-0981-7 2017
[16]

LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl. github.io/blog/2024-01-30-llava-next/, January 2024

work page 2024
[17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

work page 2024
[18]

Nano Claude Code: A lightweight python agent loop

Xinhao Ma. Nano Claude Code: A lightweight python agent loop. https://github.com/ mxh1999/nano-claude-code, 2024. Open-source reference implementation

work page 2024
[19]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv. org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Public model idgpt-5.4-2026-03-05

work page 2026
[22]

Kosmos-2: Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. In International Conference on Learning Representations, 2024

work page 2024
[23]

Qwen3-VL technical report, 2025

Qwen Team. Qwen3-VL technical report, 2025. URL https://arxiv.org/abs/2511. 21631

work page 2025
[24]

Nazia Tasnim, Keanu Nichols, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, and Bryan A. Plummer. Seeing isn’t orienting: A cognitively grounded benchmark reveals systematic orientation failures in MLLMs (DORI).arXiv preprint arXiv:2505.21649, 2025. URLhttps://arxiv.org/abs/2505.21649

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Thinking in structures: Evaluating spatial intelligence through reasoning on constrained manifolds (SSI-Bench).arXiv preprint arXiv:2602.07864, 2026

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, and Jiansheng Fan. Thinking in structures: Evaluating spatial intelligence through reasoning on constrained manifolds (SSI-Bench).arXiv preprint arXiv:2602.07864, 2026. URL https://arxiv.org/ abs/2602.07864

work page arXiv 2026
[26]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-Agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[27]

Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. URL https://arxiv.org/abs/2505. 23764

work page arXiv 2025
[28]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023
[29]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[30]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InInternational Conference on Learning Representations, 2024. 11

work page 2024
[31]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV), pages 69–85, 2016. doi: 10.1007/978-3-319-46475-6_5

work page doi:10.1007/978-3-319-46475-6_5 2016
[32]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. URL https://arxiv.org/abs/2504.10479

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Pavement distress detection using convolutional neural networks with images captured via uav

Junqing Zhu, Jingtao Zhong, Tao Ma, Xiaoming Huang, Weiguang Zhang, and Yang Zhou. Pavement distress detection using convolutional neural networks with images captured via uav. Automation in Construction, 133:103991, 2022. doi: 10.1016/j.autcon.2021.103991

work page doi:10.1016/j.autcon.2021.103991 2022
[35]

bbox": [x1, y1, x2, y2],

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Qinqin Nie, Hao Cheng, et al. VisDrone-DET2018: The vision meets drone object detection in image challenge results. InEuropean Conference on Computer Vision (ECCV) Workshops, pages 437–468, 2018. doi: 10.1007/978-3-030-11021-5_27. Challenge results paper; full author list (60+) in the...

work page doi:10.1007/978-3-030-11021-5_27 2018
[36]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Introducing Claude Haiku 4.5

Anthropic. Introducing Claude Haiku 4.5. https://www.anthropic.com/news/ claude-haiku-4-5 , October 2025. System card available at https://anthropic.com/ claude-haiku-4-5-system-card

work page 2025

[2] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report, 2025. URL https: //arxiv.org/abs/2502.13923. Qwen Team, Alibaba Group

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [5]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024. doi: 10.48550/ARXIV .2410.07095

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[6] [7]

A pavement crack detection and evaluation framework for a uav inspection system based on deep learning.Applied Sciences, 14(3):1157, 2024

Xinbao Chen, Chang Liu, Long Chen, Xiaodong Zhu, Yaohui Zhang, and Chenxi Wang. A pavement crack detection and evaluation framework for a uav inspection system based on deep learning.Applied Sciences, 14(3):1157, 2024. doi: 10.3390/app14031157

work page doi:10.3390/app14031157 2024

[7] [8]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zis- serman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. doi: 10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010

[8] [9]

Inês Feitosa, Bruno Santos, and Paulo J. G. Almeida. Pavement inspection in transport in- frastructures using unmanned aerial vehicles (uavs).Sustainability, 16(5):2207, 2024. doi: 10.3390/su16052207

work page doi:10.3390/su16052207 2024

[9] [10]

Gemini 3: Introducing the latest Gemini AI model from Google

Google DeepMind. Gemini 3: Introducing the latest Gemini AI model from Google. https: //blog.google/products-and-platforms/products/gemini/gemini-3/ , November 2025

work page 2025

[10] [11]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, and Yunke Zhao. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025

[11] [12]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024

[12] [13]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. ReferItGame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, 2014. doi: 10.3115/v1/D14-1086. URLhttps://aclanthology.org/D14-1086/. 10

work page doi:10.3115/v1/d14-1086 2014

[13] [14]

Kimi-VL Technical Report

Kimi Team. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [15]

Shamma, et al

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123(1):32–73, 2017. doi: 10.1007/s11263-016-0981-7

work page doi:10.1007/s11263-016-0981-7 2017

[15] [16]

LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl. github.io/blog/2024-01-30-llava-next/, January 2024

work page 2024

[16] [17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

work page 2024

[17] [18]

Nano Claude Code: A lightweight python agent loop

Xinhao Ma. Nano Claude Code: A lightweight python agent loop. https://github.com/ mxh1999/nano-claude-code, 2024. Open-source reference implementation

work page 2024

[18] [19]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [20]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv. org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [21]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Public model idgpt-5.4-2026-03-05

work page 2026

[21] [22]

Kosmos-2: Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. In International Conference on Learning Representations, 2024

work page 2024

[22] [23]

Qwen3-VL technical report, 2025

Qwen Team. Qwen3-VL technical report, 2025. URL https://arxiv.org/abs/2511. 21631

work page 2025

[23] [24]

Nazia Tasnim, Keanu Nichols, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, and Bryan A. Plummer. Seeing isn’t orienting: A cognitively grounded benchmark reveals systematic orientation failures in MLLMs (DORI).arXiv preprint arXiv:2505.21649, 2025. URLhttps://arxiv.org/abs/2505.21649

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Thinking in structures: Evaluating spatial intelligence through reasoning on constrained manifolds (SSI-Bench).arXiv preprint arXiv:2602.07864, 2026

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, and Jiansheng Fan. Thinking in structures: Evaluating spatial intelligence through reasoning on constrained manifolds (SSI-Bench).arXiv preprint arXiv:2602.07864, 2026. URL https://arxiv.org/ abs/2602.07864

work page arXiv 2026

[25] [26]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-Agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[26] [27]

Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. URL https://arxiv.org/abs/2505. 23764

work page arXiv 2025

[27] [28]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023

[28] [29]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023

[29] [30]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InInternational Conference on Learning Representations, 2024. 11

work page 2024

[30] [31]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV), pages 69–85, 2016. doi: 10.1007/978-3-319-46475-6_5

work page doi:10.1007/978-3-319-46475-6_5 2016

[31] [32]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. URL https://arxiv.org/abs/2504.10479

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

Pavement distress detection using convolutional neural networks with images captured via uav

Junqing Zhu, Jingtao Zhong, Tao Ma, Xiaoming Huang, Weiguang Zhang, and Yang Zhou. Pavement distress detection using convolutional neural networks with images captured via uav. Automation in Construction, 133:103991, 2022. doi: 10.1016/j.autcon.2021.103991

work page doi:10.1016/j.autcon.2021.103991 2022

[34] [35]

bbox": [x1, y1, x2, y2],

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Qinqin Nie, Hao Cheng, et al. VisDrone-DET2018: The vision meets drone object detection in image challenge results. InEuropean Conference on Computer Vision (ECCV) Workshops, pages 437–468, 2018. doi: 10.1007/978-3-030-11021-5_27. Challenge results paper; full author list (60+) in the...

work page doi:10.1007/978-3-030-11021-5_27 2018

[35] [36]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page