COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers
Pith reviewed 2026-05-17 03:19 UTC · model grok-4.3
The pith
Certain visual CAPTCHA designs using fine-grained localization and implicit counting reduce state-of-the-art MLLM solver success rates from over 95% to zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multimodal LLMs can solve many visual CAPTCHA tasks effectively, yet incorporating fine-grained localization and implicit counting into task design reduces their success rate from over 95% to 0%, providing a concrete way to strengthen defenses.
What carries the argument
Fine-grained localization and implicit counting, which require models to perform precise spatial analysis and enumeration within the CAPTCHA puzzle.
If this is right
- Platform operators should prioritize CAPTCHA tasks that demand localization and counting to counter MLLM threats.
- Analysis of model reasoning traces can guide the selection and strengthening of specific CAPTCHA types.
- Current MLLMs remain limited on tasks involving multi-step spatial reasoning or cross-frame consistency.
- Prompt engineering and few-shot examples boost solver performance on vulnerable tasks but not on hardened ones.
Where Pith is reading between the lines
- Future advancements in MLLMs could eventually overcome these defenses, suggesting the need for continuous CAPTCHA evolution.
- These defense principles might extend to other visual security challenges beyond CAPTCHAs.
- Testing against a broader range of models and real-world deployments would further validate the approach.
Load-bearing premise
The seven evaluated MLLMs and eighteen task types represent the threat surface for visual CAPTCHAs deployed in the wild.
What would settle it
Demonstrating that an advanced MLLM or new prompt technique can solve the hardened CAPTCHA with high success rate would falsify the claim of effective defense.
Figures
read the original abstract
This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. To validate these principles, we perform a case study by hardening a vulnerable CAPTCHA type using our guidelines. We demonstrate that incorporating fine-grained localization and implicit counting reduces the success rate of state-of-the-art MLLMs from over 95% to 0%, confirming that structural changes can effectively mitigate the threat. We conclude by discussing the implications for platform operators who deploy CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates seven commercial and open-source MLLMs across 18 real-world visual CAPTCHA task types, reporting single-shot accuracy, retry success, latency, and cost. It examines the effects of task-specific prompt engineering and few-shot demonstrations, identifies why models succeed or fail via reasoning traces, derives defense guidelines, and validates them in a case study showing that adding fine-grained localization and implicit counting reduces SOTA MLLM success from >95% to 0%.
Significance. If the central empirical results hold under consistent attack conditions, the work is significant for abuse-mitigation practice: it supplies concrete measurements of MLLM threat levels on recognition versus reasoning-heavy tasks and demonstrates that modest structural hardening can neutralize current solvers at human-like cost. The multi-model, multi-task design and explicit cost/latency data strengthen its utility for platform operators.
major comments (1)
- [Case study section] Case study / abstract claim: the headline result that fine-grained localization plus implicit counting drops success from >95% to 0% is load-bearing for the defense contribution. The manuscript separately demonstrates that task-specific prompt engineering and few-shot demonstrations materially raise solver accuracy on recognition-oriented tasks. It is not stated whether the same optimized prompting regime was applied when evaluating the hardened variant. If the 0% figure reflects only default or weaker prompts, the structural defense has not been stress-tested against the attack surface the authors themselves document.
minor comments (1)
- The code-availability statement points to an anonymous repository; the manuscript would benefit from a brief reproducibility note on prompt templates, retry protocols, and statistical controls even if the repository remains anonymous.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The feedback on ensuring the case study is evaluated under the strongest documented attack conditions is well taken, and we address it directly below.
read point-by-point responses
-
Referee: [Case study section] Case study / abstract claim: the headline result that fine-grained localization plus implicit counting drops success from >95% to 0% is load-bearing for the defense contribution. The manuscript separately demonstrates that task-specific prompt engineering and few-shot demonstrations materially raise solver accuracy on recognition-oriented tasks. It is not stated whether the same optimized prompting regime was applied when evaluating the hardened variant. If the 0% figure reflects only default or weaker prompts, the structural defense has not been stress-tested against the attack surface the authors themselves document.
Authors: We thank the referee for identifying this important clarification point. The case study evaluations of the hardened variant were performed using the same task-specific prompt engineering and few-shot demonstrations that maximized solver accuracy on the corresponding recognition-oriented tasks in the main evaluation. This choice was made precisely to stress-test the structural defense against the strongest attack surface we document. We agree, however, that the manuscript does not explicitly state the prompting regime used for the hardened variant. We will revise the case study section (and the corresponding abstract claim) to make this explicit, including a direct reference to the optimized prompting results from the earlier analysis. No changes to the reported numbers or experimental data are required. revision: yes
Circularity Check
No circularity: empirical evaluation and case study with direct measurements
full rationale
The paper is an empirical study measuring MLLM success rates on 18 CAPTCHA task types across 7 models, analyzing prompt engineering effects, and validating defense guidelines via a single case study that hardens one task type. No equations, closed-form derivations, fitted parameters, or self-citation chains are present in the provided text. Reported accuracies (e.g., >95% to 0%) are direct experimental outcomes from the described evaluations and hardening, not reductions of predictions to inputs by construction. The work is self-contained against external benchmarks of MLLM performance on visual tasks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Off-the-shelf MLLMs with standard prompting can be treated as representative automated CAPTCHA solvers
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
incorporating fine-grained localization and implicit counting reduces the success rate of state-of-the-art MLLMs from over 95% to 0%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2025.Image CAPTCHA Solver — Online image CAPTCHA solving service
2Captcha. 2025.Image CAPTCHA Solver — Online image CAPTCHA solving service. https://2captcha.com/p/image-picture-captcha-solver Accessed: 2025-11-20
work page 2025
-
[2]
Ismail Akrout, Amal Feriani, and Mohamed Akrout. 2019. Hacking google recaptcha v3 using reinforcement learning.arXiv preprint arXiv:1903.01003 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
Elie Bursztein, Steven Bethard, Celine Fabry, John C Mitchell, and Dan Jurafsky
-
[4]
In 2010 IEEE symposium on security and privacy (SP)
How good are humans at solving CAPTCHAs? A large scale evaluation. In 2010 IEEE symposium on security and privacy (SP). IEEE, 399–413
work page 2010
-
[5]
2025.Hybrid CAPTCHA Solving Service: API & human/OCR based service
CaptchaCoder. 2025.Hybrid CAPTCHA Solving Service: API & human/OCR based service. https://captchacoder.com/ Accessed: 2025-11-20
work page 2025
-
[6]
DeCaptcher. 2025.CAPTCHA Decoding. https://www.decaptcher.com/ Accessed: 2025-11-20
work page 2025
-
[7]
Gelei Deng, Haoran Ou, Yi Liu, Jie Zhang, Tianwei Zhang, and Yang Liu. 2025. Oedipus: Llm-enchanced reasoning captcha solver. (2025)
work page 2025
-
[8]
Elie Dessant. 2020. Buster: Bypass CAPTCHA by filling fake audio challenges. https://github.com/dessant/buster. GitHub repository
work page 2020
-
[9]
Ziqi Ding, Gelei Deng, Yi Liu, Junchen Ding, Jieshan Chen, Yulei Sui, and Yuekang Li. 2025. IllusionCAPTCHA: A CAPTCHA based on visual illusion. InProceedings of the ACM on Web Conference 2025 (WWW). 3683–3691
work page 2025
-
[10]
Yipeng Gao, Haichang Gao, Sainan Luo, Yang Zi, Shudong Zhang, Wenjie Mao, Ping Wang, Yulong Shen, and Jeff Yan. 2021. Research on the security of visual reasoning {CAPTCHA}. In30th USENIX security symposium (USENIX security 21). 3291–3308
work page 2021
-
[11]
Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Gildas Avoine. 2020. Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB)14, 2 (2020), 1–33
work page 2020
-
[12]
Jingmeng Li, Lukang Fu, Surun Yang, and Hui Wei. 2025. MI-CAPTCHA: Enhance the Security of CAPTCHA Using Mooney Images. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 39. 1383–1391
work page 2025
-
[13]
BuiltWith Pty Ltd. 2025.Websites using reCAPTCHA. https://trends.builtwith. com/websitelist/reCAPTCHA Accessed: 2025-11-20
work page 2025
- [14]
-
[15]
Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M Voelker, and Stefan Savage. 2010. Re:{CAPTCHAs—Understanding } {CAPTCHA- Solving} services in an economic context. In19th USENIX Security Symposium (USENIX Security 10)
work page 2010
-
[16]
Hoang Dai Nguyen, Karthika Subramani, Bhupendra Acharya, Roberto Perdisci, and Phani Vadrevu. 2024. C-Frame: Characterizing and measuring in-the-wild CAPTCHA attacks. In2024 IEEE Symposium on Security and Privacy (SP). 277–295. doi:10.1109/SP54263.2024.00200
-
[17]
NopeCHA LLC. 2025. NopeCHA API Documentation. https://developers. nopecha.com/. Accessed: 2025-05-23
work page 2025
-
[18]
Hoang, Mohammad Ali Tofighi, Cuong V
Behzad Ousat, Esteban Schafir, Duc C. Hoang, Mohammad Ali Tofighi, Cuong V. Nguyen, Sajjad Arshad, Selcuk Uluagac, and Amin Kharraz. 2024. The Matter of Captchas: An Analysis of a Brittle Security Feature on the Modern Web. In Proceedings of the ACM Web Conference 2024 (WWW) (WWW ’24). 1835–1846. doi:10.1145/3589334.3645619
-
[19]
Andreas Plesner, Tobias Vontobel, and Roger Wattenhofer. 2024. Breaking re- CAPTCHAv2. In2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1047–1056. doi:10.1109/compsac61105.2024.00142
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/compsac61105.2024.00142 2024
-
[20]
Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and Ai Enkoji. 2023. An Empirical Study & Evaluation of Modern CAPTCHAs. In32nd USENIX Security Symposium (USENIX Security 23). 3081– 3097
work page 2023
-
[21]
Chenghui Shi, Shouling Ji, Qianjun Liu, Changchang Liu, Yuefeng Chen, Yuan He, Zhe Liu, Raheem Beyah, and Ting Wang. 2020. Text captcha is dead? a large scale deployment and empirical study. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security (CCS). 1391–1406
work page 2020
-
[22]
Suphannee Sivakorn, Iasonas Polakis, and Angelos D Keromytis. 2016. I am robot:(deep) learning to break semantic image captchas. In2016 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 388–403
work page 2016
-
[23]
Suphannee Sivakorn, Iason Polakis, and Angelos D. Keromytis. 2016. I’m Not a Human: Breaking the Google reCAPTCHA. InProceedings of the 2016 ACM Asia Conference on Computer and Communications Security (ASIACCS ’16). ACM, 191–202. doi:10.1145/2897845.2897847
-
[24]
Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, and Junfeng Yang
- [25]
-
[26]
2024.CAPTCHA Farms: The Forgotten Threat in Human Verification
Verified Visitors Threat Research Team. 2024.CAPTCHA Farms: The Forgotten Threat in Human Verification. https://www.verifiedvisitors.com/threat-research/ captcha-farms Accessed: 2025-11-20
work page 2024
-
[27]
Xiwen Teoh, Yun Lin, Siqi Li, Ruofan Liu, Avi Sollomoni, Yaniv Harel, and Jin Song Dong. 2025. Are {CAPTCHAs} still bot-hard? generalized visual {CAPTCHA} solving with agentic vision language model. In34th USENIX Security Symposium (USENIX Security 25). 3747–3766
work page 2025
-
[28]
Theyka. 2025. Turnstile-Solver: GitHub repository for Cloudflare Turnstile bypass scripts. https://github.com/Theyka/Turnstile-Solver. Accessed: 2025-05-23
work page 2025
-
[29]
Sheng Tian and Tao Xiong. 2020. A generic solver combining unsupervised learn- ing and representation learning for breaking text-based captchas. InProceedings of The Web Conference 2020 (WWW). 860–871
work page 2020
-
[30]
Ilias Tsingenopoulos, Davy Preuveneers, Lieven Desmet, and Wouter Joosen
-
[31]
In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P)
Captcha me if you can: Imitation Games with Reinforcement Learning. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P). IEEE, 719–735
- [32]
-
[33]
Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet another text captcha solver: A generative adversarial network based approach. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security (CCS). 332–348
work page 2018
-
[34]
Jiaming Zhang, Jitao Sang, Kaiyuan Xu, Shangxi Wu, Xian Zhao, Yanfeng Sun, Yongli Hu, and Jian Yu. 2020. Robust CAPTCHAs towards malicious OCR.IEEE Transactions on Multimedia23 (2020), 2575–2587
work page 2020
-
[35]
Ruijie Zhao, Xianwen Deng, Yanhao Wang, Zhicong Yan, Zhengguang Han, Libo Chen, Zhi Xue, and Yijun Wang. 2023. GeeSolver: A generic, efficient, and effortless solver with self-supervised learning for breaking text captchas. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1649–1666. Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, and Ju...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.