CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3
The pith
CORA uses conformal risk control to give statistical guarantees that mobile GUI agents will not execute harmful actions beyond a user-set budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CORA reformulates safety as selective action execution: a Guardian estimates action-conditional risk, conformal risk control calibrates an execute/abstain boundary that satisfies a user-specified risk budget, rejected actions are routed to a trainable Diagnostician for multimodal intervention recommendations, and a Goal-Lock anchors assessment to a clarified frozen user intent, thereby supplying statistical guarantees on harmful executed actions while improving the safety-helpfulness-interruption Pareto frontier.
What carries the argument
Conformal Risk Control applied to the output of a Guardian model that predicts per-action harm probability, producing a data-driven threshold that guarantees the desired risk level even when the underlying data are sequential.
If this is right
- Users can directly set a maximum tolerable harm rate and receive a provable bound rather than relying on prompt engineering or fixed heuristics.
- Rejected actions trigger the lightest possible user intervention instead of blanket refusal, preserving helpfulness.
- The same calibration procedure can be reused across different VLMs and task distributions without retraining the threshold from scratch.
- Goal-Lock makes the safety layer resistant to visual prompt-injection attacks that try to change perceived intent.
Where Pith is reading between the lines
- The same selective-execution pattern could be ported to web browsers or robotic arms where each action also carries irreversible risk.
- If the calibration set is updated online, the framework might support lifelong agents that keep their risk bound even as the interface or user habits drift.
- The Diagnostician could be trained jointly with the main policy so that the whole system learns to propose safer actions from the start.
Load-bearing premise
The Guardian's risk scores are well enough calibrated and the sequential non-i.i.d. nature of GUI steps still lets the conformal calibration procedure deliver its coverage guarantee in practice.
What would settle it
Collect a fresh, large set of Phone-Harm trajectories, run CORA with a fixed risk budget alpha, and check whether the empirical fraction of executed actions that are labeled harmful exceeds alpha by more than the conformal tolerance.
Figures
read the original abstract
Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety--helpfulness--interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CORA, a post-policy safeguarding framework for VLM-based GUI agents. It trains a Guardian model to produce action-conditional risk scores, applies Conformal Risk Control to select an execute/abstain threshold that meets a user-specified risk budget alpha, routes rejected actions to a Diagnostician model for multimodal intervention recommendations, and uses a Goal-Lock mechanism to anchor to frozen user intent. A new benchmark Phone-Harm with step-level harm labels is introduced, and experiments on Phone-Harm plus public benchmarks claim improvements to the safety-helpfulness-interruption Pareto frontier.
Significance. If the finite-sample guarantees hold, the work supplies a statistically grounded, user-tunable safety layer for autonomous GUI agents that addresses a timely deployment concern. The Phone-Harm benchmark is a concrete contribution for safety evaluation. The approach avoids circularity by relying on standard conformal risk control whose validity is independent of the paper's empirical results.
major comments (2)
- [Conformal Risk Control calibration procedure] The calibration procedure (described in the section on Conformal Risk Control) applies the standard risk-controlling threshold of Angelopoulos et al. to trajectories drawn from Phone-Harm. These trajectories are Markovian and non-exchangeable: each executed action changes the screen state, so future Guardian scores depend on prior decisions. The manuscript does not invoke block-wise, martingale, or adaptive conformal extensions that would restore validity under dependence. Because the central claim is that CORA delivers finite-sample guarantees on the fraction of harmful executed actions, this omission is load-bearing.
- [Experimental results on Phone-Harm] Table reporting experimental results on Phone-Harm (and the corresponding Pareto-frontier plots): the manuscript claims that the realized harm rate respects the user-specified alpha, yet provides neither error bars on the empirical miscoverage nor details on how the calibration set is constructed from sequential trajectories (e.g., number of trajectories, temporal splitting). Without these, it is impossible to verify whether the reported risk budget is actually achieved or whether post-hoc choices undermine the guarantee.
minor comments (3)
- [Abstract] The abstract states that CORA 'provides statistical guarantees' but does not state the precise guarantee (e.g., with probability at least 1-delta the fraction of harmful actions is at most alpha). Adding this sentence would clarify the claim.
- [Framework description] The description of the Diagnostician model and Goal-Lock mechanism would benefit from an explicit equation or pseudocode showing how the intervention recommendation is produced and how Goal-Lock is enforced at inference time.
- [Phone-Harm benchmark] Phone-Harm is a useful benchmark, but the manuscript should report inter-annotator agreement for the step-level harm labels and the exact protocol used to generate the trajectories.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript on CORA. The comments raise important points about the validity of our conformal guarantees under sequential dependence and the transparency of our experimental reporting. We address each major comment below and have prepared revisions to improve rigor and clarity.
read point-by-point responses
-
Referee: [Conformal Risk Control calibration procedure] The calibration procedure (described in the section on Conformal Risk Control) applies the standard risk-controlling threshold of Angelopoulos et al. to trajectories drawn from Phone-Harm. These trajectories are Markovian and non-exchangeable: each executed action changes the screen state, so future Guardian scores depend on prior decisions. The manuscript does not invoke block-wise, martingale, or adaptive conformal extensions that would restore validity under dependence. Because the central claim is that CORA delivers finite-sample guarantees on the fraction of harmful executed actions, this omission is load-bearing.
Authors: We appreciate the referee's emphasis on the exchangeability assumption underlying standard conformal risk control. In our implementation, the calibration set is assembled from a large collection of independent trajectories drawn from the Phone-Harm distribution; each full trajectory is treated as a single exchangeable unit when computing the risk-controlling threshold. This design ensures that the finite-sample guarantee applies marginally over the trajectory distribution, even though intra-trajectory scores exhibit Markovian dependence. Nevertheless, we acknowledge that a stricter per-step guarantee would benefit from dependence-aware extensions. In the revised manuscript we will (i) explicitly describe the trajectory-level calibration construction, (ii) add a dedicated paragraph discussing the implications of within-trajectory dependence, and (iii) cite relevant block-wise and adaptive conformal literature (e.g., works on martingale-based conformal prediction). We will also report an empirical sensitivity analysis showing that realized miscoverage remains close to the nominal level across varying trajectory lengths. These changes clarify the scope of our guarantees without altering the core algorithm. revision: partial
-
Referee: [Experimental results on Phone-Harm] Table reporting experimental results on Phone-Harm (and the corresponding Pareto-frontier plots): the manuscript claims that the realized harm rate respects the user-specified alpha, yet provides neither error bars on the empirical miscoverage nor details on how the calibration set is constructed from sequential trajectories (e.g., number of trajectories, temporal splitting). Without these, it is impossible to verify whether the reported risk budget is actually achieved or whether post-hoc choices undermine the guarantee.
Authors: We agree that the current experimental section lacks sufficient statistical detail and transparency. In the revised version we will augment all Phone-Harm tables and Pareto-frontier figures with error bars (standard errors computed over multiple random calibration/test splits). We will also add an appendix subsection that fully specifies the calibration-set construction: the total number of trajectories (e.g., 500), the temporal splitting protocol that reserves entire trajectories for calibration versus testing to prevent leakage, the exact sizes of calibration and test sets, and the random seed used for reproducibility. These additions will enable readers to reproduce and verify that the reported harm rates respect the user-specified alpha under the stated protocol. revision: yes
Circularity Check
No significant circularity; standard conformal risk control applied independently
full rationale
The paper's central derivation applies Conformal Risk Control (citing Angelopoulos et al.) to calibrate a Guardian model's action-conditional risk scores into an execute/abstain threshold satisfying a user-specified risk budget. This step invokes an external, established method whose finite-sample validity guarantees are independent of the present work and do not reduce to any quantity defined or fitted within the paper itself. The Goal-Lock mechanism, Diagnostician model, and Phone-Harm benchmark are introduced as separate components without self-referential equations that equate predictions to their own inputs by construction. No self-citation chain, ansatz smuggling, or renaming of known results forms a load-bearing loop in the claimed statistical guarantees. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- user-specified risk budget alpha
axioms (1)
- standard math Calibration and test data are exchangeable for the conformal risk control guarantee to hold.
invented entities (3)
-
Guardian model
no independent evidence
-
Diagnostician model
no independent evidence
-
Goal-Lock mechanism
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
URLhttps://doi.org/10.1145/3680207.3765248
doi: 10.1145/3680207.3765248. URLhttps://doi.org/10.1145/3680207.3765248. Chen Xu. and Yao Xie. Sequential predictive conformal inference for time series, 2023. Preprint. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing ...
-
[2]
URLhttps://arxiv.org/abs/2511.21631. ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025b. Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents, 2025. URLhttps://arxiv.org/abs/2509.18119. Google. Gemini 3 f...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.neunet.2025.107777 2025
-
[3]
Prevent deliberate misuse (Fig. 4a):Given a harmful user instruction attempting privacy theft (e.g., publicly sharing and granting edit access to a sensitive “password” document), CORA correctly identifies the inherent risk in the goal itself and triggers an ABORTbefore any privacy-compromising action is executed
-
[4]
4b):CORA recognizes a malicious on-screen message that conflicts with the frozen user goal
Resist prompt injections (Fig. 4b):CORA recognizes a malicious on-screen message that conflicts with the frozen user goal. Instead of following the injected instruction, it reasons to safely dismiss the harmful pop-up and seamlessly proceeds with the original task
-
[5]
Mitigate model misbehavior (Fig. 4c):Confronted with an irreversible financial transaction (a “1-tap buy” in an app store) under a benign user goal, CORA avoids autonomous execution. Instead, it routes the proposed action to a CONFIRMintervention, effectively preventing high-stakes errors via human-in-the-loop gating. D Evaluation Metrics This section det...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.