pith. sign in

arxiv: 2603.23559 · v2 · pith:TWOLITLOnew · submitted 2026-03-23 · 💻 cs.CR · cs.AI· cs.CV

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

classification 💻 cs.CR cs.AIcs.CV
keywords captchasolvingagentsgeneraldatanativeautomateddevelop
0
0 comments X
read the original abstract

GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that solves modern, interactive CAPTCHA challenges while retaining general GUI-agent performance. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving. Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across synthetic and real-world test sets, ReCAP substantially improves CAPTCHA-solving success over its base agents, while maintaining strong performance on general GUI-agent benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.