pith. sign in

arxiv: 2512.02318 · v3 · submitted 2025-12-02 · 💻 cs.CR · cs.AI

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Pith reviewed 2026-05-17 03:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords CAPTCHA securitymultimodal LLMsvisual puzzlesautomated solversdefense mechanismslocalization tasks
0
0 comments X

The pith

Certain visual CAPTCHA designs using fine-grained localization and implicit counting reduce state-of-the-art MLLM solver success rates from over 95% to zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how multimodal large language models undermine visual CAPTCHA security by evaluating seven leading MLLMs across eighteen real-world task types. Models solve recognition and low-interaction tasks reliably at human-like cost and speed, but struggle with tasks demanding fine-grained localization, multi-step spatial reasoning, or cross-frame consistency. The authors analyze reasoning traces to derive defense guidelines and validate them through a case study where hardening a CAPTCHA with localization and implicit counting drops success rates to zero. This shows that targeted structural changes can restore CAPTCHA effectiveness against current automated solvers.

Core claim

Multimodal LLMs can solve many visual CAPTCHA tasks effectively, yet incorporating fine-grained localization and implicit counting into task design reduces their success rate from over 95% to 0%, providing a concrete way to strengthen defenses.

What carries the argument

Fine-grained localization and implicit counting, which require models to perform precise spatial analysis and enumeration within the CAPTCHA puzzle.

If this is right

  • Platform operators should prioritize CAPTCHA tasks that demand localization and counting to counter MLLM threats.
  • Analysis of model reasoning traces can guide the selection and strengthening of specific CAPTCHA types.
  • Current MLLMs remain limited on tasks involving multi-step spatial reasoning or cross-frame consistency.
  • Prompt engineering and few-shot examples boost solver performance on vulnerable tasks but not on hardened ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future advancements in MLLMs could eventually overcome these defenses, suggesting the need for continuous CAPTCHA evolution.
  • These defense principles might extend to other visual security challenges beyond CAPTCHAs.
  • Testing against a broader range of models and real-world deployments would further validate the approach.

Load-bearing premise

The seven evaluated MLLMs and eighteen task types represent the threat surface for visual CAPTCHAs deployed in the wild.

What would settle it

Demonstrating that an advanced MLLM or new prompt technique can solve the hardened CAPTCHA with high success rate would falsify the claim of effective defense.

Figures

Figures reproduced from arXiv: 2512.02318 by Changjia Zhu, Junjie Xiong, Junyu Wang, Lingyao Li, Mingkui Wei, Xu He, Yuanbo Zhou.

Figure 1
Figure 1. Figure 1: CAPTCHA robustness evaluation framework against MLLMs. 3.1 Problem Formulation We consider a generic web service that uses visual CAPTCHAs as part of its abuse-mitigation pipeline, for example, before cre￾ating new accounts, submitting content, or accessing high-value resources. Whenever a user reaches such a protected step, the ser￾vice displays a CAPTCHA widget in the browser. Each CAPTCHA [PITH_FULL_IM… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-model Pass@1 distributions per task type in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of CAPTCHA task difficulty in Exp2 (opti [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-model Pass@1 distributions per task type [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-task Pass@1 for GPT-5 (Medium) in Exp1 (origi [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Expected number of API calls until the first success [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cost and latency trade-offs for GPT-5 (Medium) across CAPTCHA task types. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. To validate these principles, we perform a case study by hardening a vulnerable CAPTCHA type using our guidelines. We demonstrate that incorporating fine-grained localization and implicit counting reduces the success rate of state-of-the-art MLLMs from over 95% to 0%, confirming that structural changes can effectively mitigate the threat. We conclude by discussing the implications for platform operators who deploy CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates seven commercial and open-source MLLMs across 18 real-world visual CAPTCHA task types, reporting single-shot accuracy, retry success, latency, and cost. It examines the effects of task-specific prompt engineering and few-shot demonstrations, identifies why models succeed or fail via reasoning traces, derives defense guidelines, and validates them in a case study showing that adding fine-grained localization and implicit counting reduces SOTA MLLM success from >95% to 0%.

Significance. If the central empirical results hold under consistent attack conditions, the work is significant for abuse-mitigation practice: it supplies concrete measurements of MLLM threat levels on recognition versus reasoning-heavy tasks and demonstrates that modest structural hardening can neutralize current solvers at human-like cost. The multi-model, multi-task design and explicit cost/latency data strengthen its utility for platform operators.

major comments (1)
  1. [Case study section] Case study / abstract claim: the headline result that fine-grained localization plus implicit counting drops success from >95% to 0% is load-bearing for the defense contribution. The manuscript separately demonstrates that task-specific prompt engineering and few-shot demonstrations materially raise solver accuracy on recognition-oriented tasks. It is not stated whether the same optimized prompting regime was applied when evaluating the hardened variant. If the 0% figure reflects only default or weaker prompts, the structural defense has not been stress-tested against the attack surface the authors themselves document.
minor comments (1)
  1. The code-availability statement points to an anonymous repository; the manuscript would benefit from a brief reproducibility note on prompt templates, retry protocols, and statistical controls even if the repository remains anonymous.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on ensuring the case study is evaluated under the strongest documented attack conditions is well taken, and we address it directly below.

read point-by-point responses
  1. Referee: [Case study section] Case study / abstract claim: the headline result that fine-grained localization plus implicit counting drops success from >95% to 0% is load-bearing for the defense contribution. The manuscript separately demonstrates that task-specific prompt engineering and few-shot demonstrations materially raise solver accuracy on recognition-oriented tasks. It is not stated whether the same optimized prompting regime was applied when evaluating the hardened variant. If the 0% figure reflects only default or weaker prompts, the structural defense has not been stress-tested against the attack surface the authors themselves document.

    Authors: We thank the referee for identifying this important clarification point. The case study evaluations of the hardened variant were performed using the same task-specific prompt engineering and few-shot demonstrations that maximized solver accuracy on the corresponding recognition-oriented tasks in the main evaluation. This choice was made precisely to stress-test the structural defense against the strongest attack surface we document. We agree, however, that the manuscript does not explicitly state the prompting regime used for the hardened variant. We will revise the case study section (and the corresponding abstract claim) to make this explicit, including a direct reference to the optimized prompting results from the earlier analysis. No changes to the reported numbers or experimental data are required. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation and case study with direct measurements

full rationale

The paper is an empirical study measuring MLLM success rates on 18 CAPTCHA task types across 7 models, analyzing prompt engineering effects, and validating defense guidelines via a single case study that hardens one task type. No equations, closed-form derivations, fitted parameters, or self-citation chains are present in the provided text. Reported accuracies (e.g., >95% to 0%) are direct experimental outcomes from the described evaluations and hardening, not reductions of predictions to inputs by construction. The work is self-contained against external benchmarks of MLLM performance on visual tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical measurement rather than theoretical derivation. The paper assumes off-the-shelf MLLMs can be prompted effectively for CAPTCHA solving and that the chosen 18 task types cover the relevant attack surface.

axioms (1)
  • domain assumption Off-the-shelf MLLMs with standard prompting can be treated as representative automated CAPTCHA solvers
    Invoked throughout the evaluation of 7 models and the analysis of reasoning traces

pith-pipeline@v0.9.0 · 5580 in / 1317 out tokens · 41206 ms · 2026-05-17T03:19:47.730555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    2025.Image CAPTCHA Solver — Online image CAPTCHA solving service

    2Captcha. 2025.Image CAPTCHA Solver — Online image CAPTCHA solving service. https://2captcha.com/p/image-picture-captcha-solver Accessed: 2025-11-20

  2. [2]

    Ismail Akrout, Amal Feriani, and Mohamed Akrout. 2019. Hacking google recaptcha v3 using reinforcement learning.arXiv preprint arXiv:1903.01003 (2019)

  3. [3]

    Elie Bursztein, Steven Bethard, Celine Fabry, John C Mitchell, and Dan Jurafsky

  4. [4]

    In 2010 IEEE symposium on security and privacy (SP)

    How good are humans at solving CAPTCHAs? A large scale evaluation. In 2010 IEEE symposium on security and privacy (SP). IEEE, 399–413

  5. [5]

    2025.Hybrid CAPTCHA Solving Service: API & human/OCR based service

    CaptchaCoder. 2025.Hybrid CAPTCHA Solving Service: API & human/OCR based service. https://captchacoder.com/ Accessed: 2025-11-20

  6. [6]

    2025.CAPTCHA Decoding

    DeCaptcher. 2025.CAPTCHA Decoding. https://www.decaptcher.com/ Accessed: 2025-11-20

  7. [7]

    Gelei Deng, Haoran Ou, Yi Liu, Jie Zhang, Tianwei Zhang, and Yang Liu. 2025. Oedipus: Llm-enchanced reasoning captcha solver. (2025)

  8. [8]

    Elie Dessant. 2020. Buster: Bypass CAPTCHA by filling fake audio challenges. https://github.com/dessant/buster. GitHub repository

  9. [9]

    Ziqi Ding, Gelei Deng, Yi Liu, Junchen Ding, Jieshan Chen, Yulei Sui, and Yuekang Li. 2025. IllusionCAPTCHA: A CAPTCHA based on visual illusion. InProceedings of the ACM on Web Conference 2025 (WWW). 3683–3691

  10. [10]

    Yipeng Gao, Haichang Gao, Sainan Luo, Yang Zi, Shudong Zhang, Wenjie Mao, Ping Wang, Yulong Shen, and Jeff Yan. 2021. Research on the security of visual reasoning {CAPTCHA}. In30th USENIX security symposium (USENIX security 21). 3291–3308

  11. [11]

    Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Gildas Avoine. 2020. Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB)14, 2 (2020), 1–33

  12. [12]

    Jingmeng Li, Lukang Fu, Surun Yang, and Hui Wei. 2025. MI-CAPTCHA: Enhance the Security of CAPTCHA Using Mooney Images. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 39. 1383–1391

  13. [13]

    2025.Websites using reCAPTCHA

    BuiltWith Pty Ltd. 2025.Websites using reCAPTCHA. https://trends.builtwith. com/websitelist/reCAPTCHA Accessed: 2025-11-20

  14. [14]

    Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, and Zhiqiang Shen. 2025. Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents. arXiv:2505.24878 [cs.AI] https://arxiv.org/abs/2505.24878

  15. [15]

    Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M Voelker, and Stefan Savage. 2010. Re:{CAPTCHAs—Understanding } {CAPTCHA- Solving} services in an economic context. In19th USENIX Security Symposium (USENIX Security 10)

  16. [16]

    Hoang Dai Nguyen, Karthika Subramani, Bhupendra Acharya, Roberto Perdisci, and Phani Vadrevu. 2024. C-Frame: Characterizing and measuring in-the-wild CAPTCHA attacks. In2024 IEEE Symposium on Security and Privacy (SP). 277–295. doi:10.1109/SP54263.2024.00200

  17. [17]

    NopeCHA LLC. 2025. NopeCHA API Documentation. https://developers. nopecha.com/. Accessed: 2025-05-23

  18. [18]

    Hoang, Mohammad Ali Tofighi, Cuong V

    Behzad Ousat, Esteban Schafir, Duc C. Hoang, Mohammad Ali Tofighi, Cuong V. Nguyen, Sajjad Arshad, Selcuk Uluagac, and Amin Kharraz. 2024. The Matter of Captchas: An Analysis of a Brittle Security Feature on the Modern Web. In Proceedings of the ACM Web Conference 2024 (WWW) (WWW ’24). 1835–1846. doi:10.1145/3589334.3645619

  19. [19]

    Andreas Plesner, Tobias Vontobel, and Roger Wattenhofer. 2024. Breaking re- CAPTCHAv2. In2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1047–1056. doi:10.1109/compsac61105.2024.00142

  20. [20]

    Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and Ai Enkoji. 2023. An Empirical Study & Evaluation of Modern CAPTCHAs. In32nd USENIX Security Symposium (USENIX Security 23). 3081– 3097

  21. [21]

    Chenghui Shi, Shouling Ji, Qianjun Liu, Changchang Liu, Yuefeng Chen, Yuan He, Zhe Liu, Raheem Beyah, and Ting Wang. 2020. Text captcha is dead? a large scale deployment and empirical study. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security (CCS). 1391–1406

  22. [22]

    Suphannee Sivakorn, Iasonas Polakis, and Angelos D Keromytis. 2016. I am robot:(deep) learning to break semantic image captchas. In2016 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 388–403

  23. [23]

    Keromytis

    Suphannee Sivakorn, Iason Polakis, and Angelos D. Keromytis. 2016. I’m Not a Human: Breaking the Google reCAPTCHA. InProceedings of the 2016 ACM Asia Conference on Computer and Communications Security (ASIACCS ’16). ACM, 191–202. doi:10.1145/2897845.2897847

  24. [24]

    Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, and Junfeng Yang

  25. [25]

    Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision- Language Models for CAPTCHA.arXiv preprint arXiv:2510.06067(2025)

  26. [26]

    2024.CAPTCHA Farms: The Forgotten Threat in Human Verification

    Verified Visitors Threat Research Team. 2024.CAPTCHA Farms: The Forgotten Threat in Human Verification. https://www.verifiedvisitors.com/threat-research/ captcha-farms Accessed: 2025-11-20

  27. [27]

    Xiwen Teoh, Yun Lin, Siqi Li, Ruofan Liu, Avi Sollomoni, Yaniv Harel, and Jin Song Dong. 2025. Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model. In34th USENIX Security Symposium (USENIX Security 25). 3747–3766

  28. [28]

    Theyka. 2025. Turnstile-Solver: GitHub repository for Cloudflare Turnstile bypass scripts. https://github.com/Theyka/Turnstile-Solver. Accessed: 2025-05-23

  29. [29]

    Sheng Tian and Tao Xiong. 2020. A generic solver combining unsupervised learn- ing and representation learning for breaking text-based captchas. InProceedings of The Web Conference 2020 (WWW). 860–871

  30. [30]

    Ilias Tsingenopoulos, Davy Preuveneers, Lieven Desmet, and Wouter Joosen

  31. [31]

    In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P)

    Captcha me if you can: Imitation Games with Reinforcement Learning. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P). IEEE, 719–735

  32. [32]

    Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, and Yiren Song. 2025. MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks.arXiv preprint arXiv:2506.05982(2025)

  33. [33]

    Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet another text captcha solver: A generative adversarial network based approach. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security (CCS). 332–348

  34. [34]

    Jiaming Zhang, Jitao Sang, Kaiyuan Xu, Shangxi Wu, Xian Zhao, Yanfeng Sun, Yongli Hu, and Jian Yu. 2020. Robust CAPTCHAs towards malicious OCR.IEEE Transactions on Multimedia23 (2020), 2575–2587

  35. [35]

    do not click

    Ruijie Zhao, Xianwen Deng, Yanhao Wang, Zhicong Yan, Zhengguang Han, Libo Chen, Zhi Xue, and Yijun Wang. 2023. GeeSolver: A generic, efficient, and effortless solver with self-supervised learning for breaking text captchas. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1649–1666. Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, and Ju...