HLL: Can Agents Cross Humanity's Last Line of Verification?

Dongrui Liu; Gongshen Liu; Hongliang Wu; Linfeng Zhang; Sirui Song; Su Su; Wen Shen; Xinhao Song; Zhihua Wei

arxiv: 2606.02449 · v1 · pith:O4HMDTSEnew · submitted 2026-06-01 · 💻 cs.AI · cs.CL· cs.CV· cs.LG· cs.MM

HLL: Can Agents Cross Humanity's Last Line of Verification?

Xinhao Song , Su Su , Sirui Song , Hongliang Wu , Wen Shen , Zhihua Wei , Gongshen Liu , Linfeng Zhang

show 1 more author

Dongrui Liu

This is my paper

Pith reviewed 2026-06-28 14:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LGcs.MM

keywords multimodal agentsCAPTCHA verificationhuman substitutionGUI agentsbenchmark evaluationaction tracesautomation boundaries

0 comments

The pith

Current multimodal agents remain brittle at the human-substitution boundary in CAPTCHA verifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark called HLL to test whether multimodal agents can perform the interactive CAPTCHA verifications that services use to block automation before protected actions such as account creation or form submission. It places eight frontier agents into a closed-loop GUI setting and applies controlled realism stressors including cluttered pages, harder variants, and the requirement that correct answers be backed by valid action traces. Results show inconsistent performance across verification types, clear degradation when interfaces become realistic, and further drops when traces must be produced. A sympathetic reader would care because crossing this boundary is a concrete precondition for agents to act as full human substitutes in real workflows. The benchmark isolates specific failure modes in how agents locate elements, calibrate actions, track states, and keep processes consistent.

Core claim

HLL is a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross the human-verification boundary through grounded, human-like interaction rather than recognition alone. When eight frontier multimodal agents are tested in a closed-loop GUI environment, performance varies sharply across verification types, degrades under realistic interface conditions such as cluttered webpages and harder task variants, and drops further when correct answers must be supported by valid action traces. The benchmark thereby exposes concrete gaps in localization, action calibration, state tracking, and process consistency.

What carries the argument

The HLL benchmark, which applies controlled realism stressors to diverse CAPTCHA interactions and requires trace-conditioned validation of the solving process.

If this is right

Performance varies sharply across different verification types.
Success rates degrade when agents encounter cluttered webpages or harder task variants.
Performance drops further when agents must support correct answers with valid action traces.
Gaps appear specifically in localization, action calibration, state tracking, and process consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents that close these gaps could automate additional protected online actions that currently require human presence.
Service providers may need to layer new verification methods on top of current CAPTCHAs if agent performance improves.
The benchmark offers a repeatable testbed for measuring progress toward grounded GUI interaction beyond recognition tasks.

Load-bearing premise

The controlled CAPTCHA interactions and realism stressors in HLL sufficiently represent the actual human-verification boundaries that services place before protected actions in real deployments.

What would settle it

A trial in which one or more agents achieve consistently high success rates on every HLL verification type under all listed realism stressors while also generating action traces that correctly document the solving process.

Figures

Figures reproduced from arXiv: 2606.02449 by Dongrui Liu, Gongshen Liu, Hongliang Wu, Linfeng Zhang, Sirui Song, Su Su, Wen Shen, Xinhao Song, Zhihua Wei.

**Figure 1.** Figure 1: CAPTCHA as the final frontier: securing web services by testing interactive, humanlevel reasoning against automated agents. This last-mile verification barrier is largely missing from current agent evaluation. Existing web and GUI benchmarks measure progress on navigation and application control [38, 15], browsing and web-game capabilities [48, 45], or general task completion [29], but verification ste… view at source ↗

**Figure 2.** Figure 2: Limitations of existing benchmarks: current static tasks fail to capture the complex, grounded interactions required in realistic environments. To address this gap, we introduce Humanity’s Last Line of Verification (HLL), a controlled benchmark for evaluating interactive verification as a core capability of general multimodal agents. HLL spans ten CAPTCHA families with heterogeneous interaction requireme… view at source ↗

**Figure 3.** Figure 3: Overview of the HLL benchmark structure. The benchmark combines heterogeneous [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of benchmark instances across CAPTCHA task families and realismaxis settings [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Static performance across the ten CAPTCHA task families. Evaluated Agents. We evaluate a diverse set of frontier multimodal agents spanning multiple model families. The proprietary systems include OpenAI’s GPT-5.4 [37], Google’s Gemini-3.1-Pro [19], Anthropic’s Claude-Sonnet-4.6 [6] and Claude-Opus4.6 [5], xAI’s Grok-4 [51], GLM-5V [22], MiniMaxM2.7 [35], and Qwen-Max [8]. All models are deployed as cl… view at source ↗

**Figure 6.** Figure 6: Dynamic performance across the CAPTCHA task families. We finally evaluate dynamic variants, where a semantically correct answer must additionally satisfy trace-conditioned validation rules over simulatorobservable interaction traces and committed page states [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Representative static failure case for perceptual decoding errors. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Representative static failure case for target localization and candidate filtering errors. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Representative static failure case for spatial grounding and action mapping errors. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Representative static failure case for geometric calibration failures. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Representative static failure case for visual structure reconstruction failures. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Representative static failure case for UI affordance and interactive-region misunderstand [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Representative static failure case for state tracking and completion-judgment failures. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Representative dynamic failure case for Trajectory-continuity and action-authenticity [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

read the original abstract

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HLL gives a new interactive CAPTCHA benchmark with trace checks but its claim to represent real service boundaries rests on an untested assumption.

read the letter

The paper's core offering is HLL, a benchmark that puts multimodal agents into closed-loop GUI interactions with CAPTCHAs while adding controlled stressors like page clutter, harder variants, and trace-conditioned validation.

This is new relative to the cited literature because it moves past static recognition to require grounded action sequences that must be verified. The evaluation of eight agents shows clear patterns: performance varies by verification type, drops under the added conditions, and falls further when traces are required. The code release supports anyone who wants to reproduce or extend the setup.

The results highlight concrete gaps in localization, action calibration, and state tracking. That part of the work is straightforward empirical reporting.

The soft spot is the jump from benchmark to deployment claim. Real services combine visual puzzles with behavioral signals, fingerprints, rate limits, and adaptive challenges that evolve. The controlled GUI environment abstracts those away, and the paper supplies no evidence that HLL scores correlate with live service outcomes. The stress-test concern holds up on the text given.

The paper is aimed at people building or evaluating agents for web interfaces and protected workflows. A reader focused on robustness testing will find the degradation numbers and the released code useful.

It should go to peer review. The benchmark design is concrete enough to benefit from referee input on validation and scope.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Humanity's Last Line of Verification (HLL), a controlled benchmark using interactive CAPTCHA tasks in a closed-loop GUI environment to test whether multimodal agents can substitute for humans at service-protected verification boundaries. It evaluates eight frontier agents, reporting sharp performance variation across verification types, degradation under realism stressors (cluttered pages, harder variants), and further drops when correct answers must be supported by valid action traces. The work claims this exposes gaps in localization, action calibration, state tracking, and process consistency, positioning HLL as a testbed for measuring progress toward human-like substitution in protected workflows. Code is released at the provided GitHub link.

Significance. If the experimental results are reproducible and the benchmark stressors adequately capture real deployment boundaries, the work provides a concrete, reproducible empirical testbed for a practically relevant capability gap in multimodal agents. The public code release is a clear strength that supports verification and extension by others. The contribution is primarily evaluative rather than theoretical, with no parameter-free derivations or machine-checked proofs.

major comments (2)

[Abstract and §1] Abstract and §1: The central claim that HLL performance indicates closeness to human substitution in 'protected real-world workflows' is load-bearing on the untested assumption that the controlled CAPTCHA interactions, cluttered pages, harder variants, and trace-conditioned validation form a sufficient proxy. No correlation with live service outcomes, behavioral signals, device fingerprinting, or adaptive challenges is reported, leaving the extrapolation from benchmark brittleness to deployment boundaries unsupported.
[§4 (Evaluation) and methods] §4 (Evaluation) and methods: The reported degradation patterns and trace-conditioned validation results rest on an experimental setup whose implementation details (GUI environment closure, action trace logging, and stressor application) are not described at a level that allows independent verification of the soundness 5.0 rating. This directly affects whether the observed brittleness can be taken as evidence for the human-substitution boundary claim.

minor comments (2)

[Methods] Notation for 'trace-conditioned validation' is introduced without a formal definition or pseudocode; a short methods subsection would improve clarity.
[Figures and Tables] Figure captions and table headers should explicitly state the number of trials per agent and per condition to allow readers to assess statistical reliability of the degradation patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: The central claim that HLL performance indicates closeness to human substitution in 'protected real-world workflows' is load-bearing on the untested assumption that the controlled CAPTCHA interactions, cluttered pages, harder variants, and trace-conditioned validation form a sufficient proxy. No correlation with live service outcomes, behavioral signals, device fingerprinting, or adaptive challenges is reported, leaving the extrapolation from benchmark brittleness to deployment boundaries unsupported.

Authors: We agree that the manuscript reports no direct correlation between HLL results and live service outcomes or other real-world signals such as device fingerprinting. HLL is presented as a controlled proxy benchmark to isolate capabilities like localization and state tracking under verifiable conditions. To address the concern, we will revise the abstract and §1 to frame the contribution more narrowly as a testbed exposing specific gaps rather than a direct indicator of substitution readiness in production workflows. A new limitations subsection will explicitly note the absence of live-service validation and the scope of the proxy. revision: partial
Referee: [§4 (Evaluation) and methods] §4 (Evaluation) and methods: The reported degradation patterns and trace-conditioned validation results rest on an experimental setup whose implementation details (GUI environment closure, action trace logging, and stressor application) are not described at a level that allows independent verification of the soundness 5.0 rating. This directly affects whether the observed brittleness can be taken as evidence for the human-substitution boundary claim.

Authors: We acknowledge that the current methods description is insufficient for independent verification of the experimental setup. Although the full implementation is released at the provided GitHub repository, we will expand §4 and the methods section in the revision to include precise specifications of GUI environment closure, action trace logging format and validation procedure, and the exact mechanisms for applying each realism stressor. Additional pseudocode and environment configuration details will be added to support reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain

full rationale

The paper presents HLL as an empirical benchmark for testing multimodal agents on interactive CAPTCHA tasks in a closed-loop GUI setting. It evaluates eight agents under controlled stressors and reports observed performance patterns without any equations, fitted parameters, predictions derived from prior results, or mathematical derivations. Claims rest on direct measurement of agent behavior rather than reduction to self-referential inputs or self-citations. No load-bearing steps match the enumerated circularity patterns; the work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the HLL benchmark tasks and stressors capture real protected workflows; no free parameters or invented entities are introduced beyond the benchmark definition itself.

axioms (1)

domain assumption CAPTCHA verification constitutes a human-verification boundary that services deliberately protect against automation.
Stated in the abstract as the motivation for the benchmark.

pith-pipeline@v0.9.1-grok · 5798 in / 1157 out tokens · 17039 ms · 2026-06-28T14:55:47.872978+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 20 canonical work pages · 10 internal anchors

[1]

Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey.IEEE Internet of Things Journal, 8(1):65–84, 2020

Mohammed Abuhamad, Ahmed Abusnaina, DaeHun Nyang, and David Mohaisen. Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey.IEEE Internet of Things Journal, 8(1):65–84, 2020

2020
[2]

Becaptcha: Detecting human behavior in smartphone interaction using multiple inbuilt sensors

Alejandro Acien, Aythami Morales, Julian Fierrez, Ruben Vera-Rodriguez, and Ivan Bartolome. Becaptcha: Detecting human behavior in smartphone interaction using multiple inbuilt sensors. arXiv preprint arXiv:2002.00918, 2020

work page arXiv 2002
[3]

Becaptcha: Behavioral bot detection using touchscreen and mobile sensors bench- marked on humidb.Engineering Applications of Artificial Intelligence, 98:104058, 2021

Alejandro Acien, Aythami Morales, Julian Fierrez, Ruben Vera-Rodriguez, and Oscar Delgado- Mohatar. Becaptcha: Behavioral bot detection using touchscreen and mobile sensors bench- marked on humidb.Engineering Applications of Artificial Intelligence, 98:104058, 2021

2021
[4]

A survey on captcha: Origin, applications and classification

Abdalnaser Muhammad Algwil. A survey on captcha: Origin, applications and classification. Journal of Basic Sciences, 36(1):1–37, 2023

2023
[5]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, 2025

2025
[6]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf, 2026

2026
[7]

Abusing images and sounds for indirect instruction injection in multi-modal llms.arXiv preprint arXiv:2307.10490, 2023

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. Abusing images and sounds for indirect instruction injection in multi-modal llms.arXiv preprint arXiv:2307.10490, 2023

work page arXiv 2023
[8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Text-based captcha strengths and weak- nesses

Elie Bursztein, Matthieu Martin, and John Mitchell. Text-based captcha strengths and weak- nesses. InProceedings of the 18th ACM conference on Computer and communications security, pages 125–138, 2011

2011
[10]

Building segmen- tation based human-friendly human interaction proofs (hips)

Kumar Chellapilla, Kevin Larson, Patrice Y Simard, and Mary Czerwinski. Building segmen- tation based human-friendly human interaction proofs (hips). InInternational Workshop on Human Interactive Proofs, pages 1–26. Springer, 2005

2005
[11]

Evaluating the robustness of multimodal agents against active environmental injection attacks

Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks. InProceedings of the 33rd ACM International Conference on Multimedia, pages 11648–11656, 2025

2025
[12]

Os-kairos: Adaptive interaction for mllm-powered gui agents

Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. Os-kairos: Adaptive interaction for mllm-powered gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6701–6725, 2025

2025
[13]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[14]

Oedipus: Llm- enchanced reasoning captcha solver

Gelei Deng, Haoran Ou, Yi Liu, Jie Zhang, Tianwei Zhang, and Yang Liu. Oedipus: Llm- enchanced reasoning captcha solver. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 6–20, 2025

2025
[15]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[16]

Illusioncaptcha: A captcha based on visual illusion

Ziqi Ding, Gelei Deng, Yi Liu, Junchen Ding, Jieshan Chen, Yulei Sui, and Yuekang Li. Illusioncaptcha: A captcha based on visual illusion. InProceedings of the ACM on Web Conference 2025, pages 3683–3691, 2025

2025
[17]

Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication

Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE transactions on information forensics and security, 8(1):136–148, 2012. 11

2012
[18]

Research on the security of visual reasoning CAPTCHA

Yipeng Gao, Haichang Gao, Sainan Luo, Yang Zi, Shudong Zhang, Wenjie Mao, Ping Wang, Yulong Shen, and Jeff Yan. Research on the security of visual reasoning CAPTCHA. In30th USENIX Security Symposium (USENIX Security 21), pages 3291–3308, 2021

2021
[19]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026

2026
[20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017
[21]

Capture the bot: Using adversarial examples to improve captcha robustness to bot attacks.IEEE Intelligent Systems, 36(5):104–112, 2020

Dorjan Hitaj, Briland Hitaj, Sushil Jajodia, and Luigi V Mancini. Capture the bot: Using adversarial examples to improve captcha robustness to bot attacks.IEEE Intelligent Systems, 36(5):104–112, 2020

2020
[22]

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024
[24]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017
[25]

Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

Simon Khan, Charles Devlen, Michael Manno, and Daqing Hou. Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

2024
[26]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024
[27]

EIA: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

work page arXiv 2024
[28]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023
[29]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[31]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Open captchaworld: A comprehensive web-based platform for testing and benchmarking multimodal llm agents.arXiv preprint arXiv:2505.24878, 2025

Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, and Zhiqiang Shen. Open captchaworld: A comprehensive web-based platform for testing and benchmarking multimodal llm agents.arXiv preprint arXiv:2505.24878, 2025

work page arXiv 2025
[33]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 12

2025
[34]

Caution for the environment: Multimodal llm agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal llm agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024

work page arXiv 2024
[35]

Minimax m2.7: Early echoes of self-evolution

MiniMax. Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026. Accessed: 2026-04-30

2026
[36]

Deep-captcha: a deep learning based captcha solver for vulnerability assessment.arXiv preprint arXiv:2006.08296, 2020

Zahra Noury and Mahdi Rezaei. Deep-captcha: a deep learning based captcha solver for vulnerability assessment.arXiv preprint arXiv:2006.08296, 2020

work page arXiv 2006
[37]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, 2026. Accessed: 2026-04-30

2026
[38]

WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Breaking recaptchav2

Andreas Plesner, Tobias V ontobel, and Roger Wattenhofer. Breaking recaptchav2. In2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1047–1056. IEEE, 2024

2024
[40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

An empirical study & evaluation of modern {CAPTCHAs}

Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and Ai Enkoji. An empirical study & evaluation of modern {CAPTCHAs}. In32nd usenix security symposium (usenix security 23), pages 3081–3097, 2023

2023
[42]

Adversarial captchas.IEEE transactions on cybernetics, 52(7):6095–6108, 2021

Chenghui Shi, Xiaogang Xu, Shouling Ji, Kai Bu, Jianhai Chen, Raheem Beyah, and Ting Wang. Adversarial captchas.IEEE transactions on cybernetics, 52(7):6095–6108, 2021

2021
[43]

I am robot:(deep) learning to break semantic image captchas

Suphannee Sivakorn, Iasonas Polakis, and Angelos D Keromytis. I am robot:(deep) learning to break semantic image captchas. In2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 388–403. IEEE, 2016

2016
[44]

Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model

Xiwen Teoh, Yun Lin, Siqi Li, Ruofan Liu, Avi Sollomoni, Yaniv Harel, and Jin Song Dong. Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model. In34th USENIX Security Symposium (USENIX Security 25), pages 3747–3766, 2025

2025
[45]

Webgames: Challenging general-purpose web-browsing ai agents.arXiv preprint arXiv:2502.18356, 2025

George Thomas, Alex J Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, and Marvin Purtorab. Webgames: Challenging general-purpose web-browsing ai agents.arXiv preprint arXiv:2502.18356, 2025

work page arXiv 2025
[46]

Hopper, and John Langford

Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. Captcha: Using hard ai problems for security. InAdvances in Cryptology—EUROCRYPT 2003, volume 2656 of Lecture Notes in Computer Science, pages 294–311. Springer, 2003

2003
[47]

A captcha design based on visual reasoning

Haipeng Wang, Feng Zheng, Zhuoming Chen, Yi Lu, Jing Gao, and Renjia Wei. A captcha design based on visual reasoning. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1967–1971. IEEE, 2018

1967
[48]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814, 2024

Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814, 2024

work page arXiv 2024
[50]

Mca-bench: A multi- modal benchmark for evaluating captcha robustness against vlm-based attacks

Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, and Yiren Song. Mca-bench: A multi- modal benchmark for evaluating captcha robustness against vlm-based attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38039–38047, 2026. 13

2026
[51]

Grok-4 model card

xAI. Grok-4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf , 2025

2025
[52]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[53]

Advagent: Controllable blackbox red-teaming on web agents.arXiv preprint arXiv:2410.17401, 2024

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advagent: Controllable blackbox red-teaming on web agents.arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024
[54]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

work page arXiv 2025
[55]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024
[57]

Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportu- nities.Journal of Network and Computer Applications, 191:103162, 2021

Ahmad Zairi Zaidi, Chun Yong Chong, Zhe Jin, Rajendran Parthiban, and Ali Safaa Sadiq. Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportu- nities.Journal of Network and Computer Applications, 191:103162, 2021

2021
[58]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025
[59]

Attacking vision-language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8387–8401, 2025

2025
[60]

Egotextvqa: Towards egocentric scene-text aware video question answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025

2025
[61]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 14 A Detailed Benchmark Specification This appendix provides detailed benchmark information complementary to Secti...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey.IEEE Internet of Things Journal, 8(1):65–84, 2020

Mohammed Abuhamad, Ahmed Abusnaina, DaeHun Nyang, and David Mohaisen. Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey.IEEE Internet of Things Journal, 8(1):65–84, 2020

2020

[2] [2]

Becaptcha: Detecting human behavior in smartphone interaction using multiple inbuilt sensors

Alejandro Acien, Aythami Morales, Julian Fierrez, Ruben Vera-Rodriguez, and Ivan Bartolome. Becaptcha: Detecting human behavior in smartphone interaction using multiple inbuilt sensors. arXiv preprint arXiv:2002.00918, 2020

work page arXiv 2002

[3] [3]

Becaptcha: Behavioral bot detection using touchscreen and mobile sensors bench- marked on humidb.Engineering Applications of Artificial Intelligence, 98:104058, 2021

Alejandro Acien, Aythami Morales, Julian Fierrez, Ruben Vera-Rodriguez, and Oscar Delgado- Mohatar. Becaptcha: Behavioral bot detection using touchscreen and mobile sensors bench- marked on humidb.Engineering Applications of Artificial Intelligence, 98:104058, 2021

2021

[4] [4]

A survey on captcha: Origin, applications and classification

Abdalnaser Muhammad Algwil. A survey on captcha: Origin, applications and classification. Journal of Basic Sciences, 36(1):1–37, 2023

2023

[5] [5]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, 2025

2025

[6] [6]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf, 2026

2026

[7] [7]

Abusing images and sounds for indirect instruction injection in multi-modal llms.arXiv preprint arXiv:2307.10490, 2023

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. Abusing images and sounds for indirect instruction injection in multi-modal llms.arXiv preprint arXiv:2307.10490, 2023

work page arXiv 2023

[8] [8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Text-based captcha strengths and weak- nesses

Elie Bursztein, Matthieu Martin, and John Mitchell. Text-based captcha strengths and weak- nesses. InProceedings of the 18th ACM conference on Computer and communications security, pages 125–138, 2011

2011

[10] [10]

Building segmen- tation based human-friendly human interaction proofs (hips)

Kumar Chellapilla, Kevin Larson, Patrice Y Simard, and Mary Czerwinski. Building segmen- tation based human-friendly human interaction proofs (hips). InInternational Workshop on Human Interactive Proofs, pages 1–26. Springer, 2005

2005

[11] [11]

Evaluating the robustness of multimodal agents against active environmental injection attacks

Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks. InProceedings of the 33rd ACM International Conference on Multimedia, pages 11648–11656, 2025

2025

[12] [12]

Os-kairos: Adaptive interaction for mllm-powered gui agents

Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. Os-kairos: Adaptive interaction for mllm-powered gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6701–6725, 2025

2025

[13] [13]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[14] [14]

Oedipus: Llm- enchanced reasoning captcha solver

Gelei Deng, Haoran Ou, Yi Liu, Jie Zhang, Tianwei Zhang, and Yang Liu. Oedipus: Llm- enchanced reasoning captcha solver. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 6–20, 2025

2025

[15] [15]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[16] [16]

Illusioncaptcha: A captcha based on visual illusion

Ziqi Ding, Gelei Deng, Yi Liu, Junchen Ding, Jieshan Chen, Yulei Sui, and Yuekang Li. Illusioncaptcha: A captcha based on visual illusion. InProceedings of the ACM on Web Conference 2025, pages 3683–3691, 2025

2025

[17] [17]

Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication

Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE transactions on information forensics and security, 8(1):136–148, 2012. 11

2012

[18] [18]

Research on the security of visual reasoning CAPTCHA

Yipeng Gao, Haichang Gao, Sainan Luo, Yang Zi, Shudong Zhang, Wenjie Mao, Ping Wang, Yulong Shen, and Jeff Yan. Research on the security of visual reasoning CAPTCHA. In30th USENIX Security Symposium (USENIX Security 21), pages 3291–3308, 2021

2021

[19] [19]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026

2026

[20] [20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017

[21] [21]

Capture the bot: Using adversarial examples to improve captcha robustness to bot attacks.IEEE Intelligent Systems, 36(5):104–112, 2020

Dorjan Hitaj, Briland Hitaj, Sushil Jajodia, and Luigi V Mancini. Capture the bot: Using adversarial examples to improve captcha robustness to bot attacks.IEEE Intelligent Systems, 36(5):104–112, 2020

2020

[22] [22]

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, et al. Glm-5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024

[24] [24]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017

[25] [25]

Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

Simon Khan, Charles Devlen, Michael Manno, and Daqing Hou. Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

2024

[26] [26]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024

[27] [27]

EIA: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage.arXiv preprint arXiv:2409.11295, 2024

work page arXiv 2024

[28] [28]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023

[29] [29]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024

[31] [31]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Open captchaworld: A comprehensive web-based platform for testing and benchmarking multimodal llm agents.arXiv preprint arXiv:2505.24878, 2025

Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, and Zhiqiang Shen. Open captchaworld: A comprehensive web-based platform for testing and benchmarking multimodal llm agents.arXiv preprint arXiv:2505.24878, 2025

work page arXiv 2025

[33] [33]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025. 12

2025

[34] [34]

Caution for the environment: Multimodal llm agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal llm agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024

work page arXiv 2024

[35] [35]

Minimax m2.7: Early echoes of self-evolution

MiniMax. Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/ minimax-m27-en, 2026. Accessed: 2026-04-30

2026

[36] [36]

Deep-captcha: a deep learning based captcha solver for vulnerability assessment.arXiv preprint arXiv:2006.08296, 2020

Zahra Noury and Mahdi Rezaei. Deep-captcha: a deep learning based captcha solver for vulnerability assessment.arXiv preprint arXiv:2006.08296, 2020

work page arXiv 2006

[37] [37]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, 2026. Accessed: 2026-04-30

2026

[38] [38]

WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Breaking recaptchav2

Andreas Plesner, Tobias V ontobel, and Roger Wattenhofer. Breaking recaptchav2. In2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1047–1056. IEEE, 2024

2024

[40] [40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

An empirical study & evaluation of modern {CAPTCHAs}

Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and Ai Enkoji. An empirical study & evaluation of modern {CAPTCHAs}. In32nd usenix security symposium (usenix security 23), pages 3081–3097, 2023

2023

[42] [42]

Adversarial captchas.IEEE transactions on cybernetics, 52(7):6095–6108, 2021

Chenghui Shi, Xiaogang Xu, Shouling Ji, Kai Bu, Jianhai Chen, Raheem Beyah, and Ting Wang. Adversarial captchas.IEEE transactions on cybernetics, 52(7):6095–6108, 2021

2021

[43] [43]

I am robot:(deep) learning to break semantic image captchas

Suphannee Sivakorn, Iasonas Polakis, and Angelos D Keromytis. I am robot:(deep) learning to break semantic image captchas. In2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 388–403. IEEE, 2016

2016

[44] [44]

Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model

Xiwen Teoh, Yun Lin, Siqi Li, Ruofan Liu, Avi Sollomoni, Yaniv Harel, and Jin Song Dong. Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model. In34th USENIX Security Symposium (USENIX Security 25), pages 3747–3766, 2025

2025

[45] [45]

Webgames: Challenging general-purpose web-browsing ai agents.arXiv preprint arXiv:2502.18356, 2025

George Thomas, Alex J Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, and Marvin Purtorab. Webgames: Challenging general-purpose web-browsing ai agents.arXiv preprint arXiv:2502.18356, 2025

work page arXiv 2025

[46] [46]

Hopper, and John Langford

Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. Captcha: Using hard ai problems for security. InAdvances in Cryptology—EUROCRYPT 2003, volume 2656 of Lecture Notes in Computer Science, pages 294–311. Springer, 2003

2003

[47] [47]

A captcha design based on visual reasoning

Haipeng Wang, Feng Zheng, Zhuoming Chen, Yi Lu, Jing Gao, and Renjia Wei. A captcha design based on visual reasoning. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1967–1971. IEEE, 2018

1967

[48] [48]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814, 2024

Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents.arXiv preprint arXiv:2406.12814, 2024

work page arXiv 2024

[50] [50]

Mca-bench: A multi- modal benchmark for evaluating captcha robustness against vlm-based attacks

Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, and Yiren Song. Mca-bench: A multi- modal benchmark for evaluating captcha robustness against vlm-based attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38039–38047, 2026. 13

2026

[51] [51]

Grok-4 model card

xAI. Grok-4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf , 2025

2025

[52] [52]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[53] [53]

Advagent: Controllable blackbox red-teaming on web agents.arXiv preprint arXiv:2410.17401, 2024

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advagent: Controllable blackbox red-teaming on web agents.arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024

[54] [54]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

work page arXiv 2025

[55] [55]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024

[57] [57]

Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportu- nities.Journal of Network and Computer Applications, 191:103162, 2021

Ahmad Zairi Zaidi, Chun Yong Chong, Zhe Jin, Rajendran Parthiban, and Ali Safaa Sadiq. Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportu- nities.Journal of Network and Computer Applications, 191:103162, 2021

2021

[58] [58]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025

[59] [59]

Attacking vision-language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8387–8401, 2025

2025

[60] [60]

Egotextvqa: Towards egocentric scene-text aware video question answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. Egotextvqa: Towards egocentric scene-text aware video question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3363–3373, 2025

2025

[61] [61]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 14 A Detailed Benchmark Specification This appendix provides detailed benchmark information complementary to Secti...

work page internal anchor Pith review Pith/arXiv arXiv 2023