arxiv: 2604.27358 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems

Yuan Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords bilevel optimizationmulti-agent systemsdelegation safetyLLM agentssafety monotonicityaccountability boundsruntime adaptationprobabilistic constraints

0 comments

The pith

A bilevel optimization framework for runtime delegation in multi-agent systems proves that higher safety weights yield safer policies, that the inner optimization converges linearly, and that responsibility bounds hold across delegation hop

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates task delegation as a bilevel optimization where an outer network learns context-specific safety-efficiency weights and an inner optimizer selects delegation policies under a probabilistic safety threshold. This yields three main results: safety monotonicity, where raising the outer safety weight makes the inner policy at least as safe; linear convergence of projected gradient descent for the inner problem; and a bound on how responsibility propagates through chains of delegations. These matter because existing methods either lock in delegation choices at design time or offer only loose guidelines, with no formal way to adapt safety during operation as the task context evolves. A sympathetic reader would care because this could let agents in medical, financial, or educational settings hand off subtasks dynamically while maintaining measurable safety controls and clear accountability.

Core claim

We propose Safe Bilevel Delegation (SBD), which casts delegation as a bilevel problem with an outer meta-weight network phi that outputs lambda(s) in [0,1] and an inner loop that optimizes the policy pi subject to P(safe) >= 1-delta. A continuous parameter alpha in [0,1] smoothly controls the degree of authority transferred to sub-agents. We prove that higher outer safety weight produces a weakly safer inner policy, that projected gradient descent on the inner problem converges linearly under standard smoothness assumptions, and that an accountability propagation bound distributes responsibility across multi-hop chains with a per-agent ceiling.

What carries the argument

The Safe Bilevel Delegation (SBD) bilevel optimization, with an outer network learning safety-efficiency weights lambda(s) and an inner policy optimization under a probabilistic safety constraint, using continuous alpha to interpolate delegation authority.

Load-bearing premise

The proofs assume standard smoothness conditions for the inner optimization and that safety probabilities can be reliably estimated to enforce the probabilistic constraint.

What would settle it

Observing in a concrete implementation that increasing the outer safety weight fails to produce a weakly safer inner policy, or that the projected gradient descent does not exhibit linear convergence on the inner problem.

read the original abstract

As large language model (LLM) agents are deployed in high-stakes environments, the question of how safely to delegate subtasks to specialized sub-agents becomes critical. Existing work addresses multi-agent architecture selection at design time or provides broad empirical guidelines, but neither provides a runtime mechanism that dynamically adjusts the safety-efficiency trade-off as task context changes during execution. We propose Safe Bilevel Delegation (SBD), a formal framework for runtime delegation safety in hierarchical multi-agent systems. SBD formulates task delegation as a bilevel optimization problem: an outer meta-weight network phi learns context-dependent safety-efficiency weights lambda(s) in [0,1]; an inner loop optimizes the delegation policy pi subject to a probabilistic safety constraint P(safe) >= 1-delta. The continuous delegation degree alpha in [0, 1] controls how much decision authority is transferred to each sub-agent, interpolating smoothly between full human override (alpha=0) and fully autonomous execution (alpha=1). We establish three theoretical results: (1) Safety Monotonicity--higher outer safety weight produces a weakly safer inner policy; (2) Inner Policy Convergence--projected gradient descent on the inner problem converges linearly under standard smoothness assumptions; (3) an Accountability Propagation bound that distributes responsibility across multi-hop delegation chains with a provable per-agent ceiling. We instantiate SBD in three high-stakes domains--medical AI (MIMIC-III), financial risk control (S and P 500), and educational agent supervision (ASSISTments)--specifying datasets, safety constraint sets, baselines, and evaluation protocols. This manuscript presents the formal framework and theoretical results in full; empirical validation following the protocols described herein is planned and will be reported in a forthcoming revision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SBD gives a bilevel setup for runtime delegation safety with three claimed theorems, but the linear convergence result rests on smoothness conditions that look unlikely to hold for LLM policies.

read the letter

The main point is that this paper frames delegation as a bilevel problem: an outer meta-weight network learns context-dependent safety-efficiency weights, while an inner loop optimizes the delegation policy under a probabilistic safety constraint, with a continuous alpha in [0,1] that smoothly hands over authority to sub-agents. It claims three results on safety monotonicity, inner convergence, and accountability propagation across chains, and it maps the setup onto medical, financial, and educational domains with concrete datasets and protocols in mind. Empirical runs are promised for a later revision rather than shown here. That combination of runtime adjustment and formal claims is the actual novelty; prior work on multi-agent safety tends to stay at design time or stay purely empirical. The formulation itself is clean and the interpolation via alpha is a practical touch that avoids binary handoff decisions. The paper does a reasonable job spelling out the optimization structure and naming the three properties it wants to prove. The accountability bound in particular could be useful for tracing responsibility in longer delegation chains if the math holds. The soft spots sit mostly in the theoretical claims. The convergence result is stated to follow from projected gradient descent under standard smoothness assumptions, but the abstract gives no explicit conditions on Lipschitz gradients, strong convexity, or constraint qualification, and nothing indicates how those would apply when the inner policy involves LLM token sampling or a safety probability estimator that is rarely smooth. Safety monotonicity and the accountability bound may inherit similar requirements. Estimating the safety probability to enforce the 1-delta constraint is also left as a practical hurdle without discussion of how it would be done reliably. No experiments are included yet, so the framework remains untested against those issues. This is for researchers working on formal methods for safe multi-agent systems who already use bilevel optimization or want a starting point for runtime safety knobs. A reader looking for a structured way to think about dynamic delegation would get value from the setup even before the proofs are stress-tested. It deserves a serious referee because the claims are specific enough to check and the domain instantiations give reviewers something concrete to evaluate. I would send it for review but expect the smoothness assumptions and the missing experiments to be the main points of pushback.

Referee Report

3 major / 3 minor

Summary. The paper proposes Safe Bilevel Delegation (SBD), a formal bilevel optimization framework for runtime delegation safety in hierarchical multi-agent LLM systems. An outer meta-weight network phi learns context-dependent safety-efficiency weights lambda(s) in [0,1]; an inner loop optimizes the delegation policy pi subject to the probabilistic safety constraint P(safe) >= 1-delta, with continuous delegation degree alpha in [0,1] interpolating between human override and full autonomy. The authors claim three theoretical results: (1) Safety Monotonicity (higher outer safety weight yields weakly safer inner policy), (2) Inner Policy Convergence (projected gradient descent on the inner problem converges linearly under standard smoothness assumptions), and (3) an Accountability Propagation bound distributing responsibility across multi-hop chains with a per-agent ceiling. The framework is instantiated with protocols for medical (MIMIC-III), financial (S&P 500), and educational (ASSISTments) domains, but empirical validation is deferred to a forthcoming revision.

Significance. If the three theoretical results hold with verifiable proofs and the framework applies beyond the stated assumptions, SBD would offer a principled runtime mechanism for dynamically trading off safety and efficiency in multi-agent delegation, filling a gap between design-time architecture selection and purely empirical guidelines. The formal guarantees on monotonicity, convergence, and accountability could support safer high-stakes deployments. However, the deferral of all empirical results and the reliance on unspecified smoothness conditions for LLM policies reduce the demonstrated significance; the primary contribution is the formalization itself rather than validated performance.

major comments (3)

[Abstract / Theoretical Results (2)] Abstract / Theoretical Results (2): The Inner Policy Convergence claim states that projected gradient descent converges linearly 'under standard smoothness assumptions,' but no explicit conditions (Lipschitz gradient, strong convexity, or constraint qualification for the probabilistic safety constraint) are provided. Given that inner policies involve LLM token generation (discrete and non-differentiable), projected GD is unlikely to apply directly, and the result cannot be assessed without the full proof and a discussion of how the assumptions hold or are relaxed for LLM-based policies. This is load-bearing for the second claimed result.
[Safety Constraint Formulation] Safety Constraint Formulation: The probabilistic constraint P(safe) >= 1-delta is central to the bilevel setup, Safety Monotonicity, and all three results, yet the manuscript provides no method for estimating or enforcing P(safe), no handling of estimation error, and no constraint qualification. In the instantiated domains (MIMIC-III, S&P 500, ASSISTments), safety functions are unlikely to be smooth or easily differentiable, which risks invalidating the monotonicity and convergence claims. The full proof of Safety Monotonicity must address this.
[Accountability Propagation Bound] Accountability Propagation Bound: The third result claims a bound that distributes responsibility across multi-hop delegation chains with a 'provable per-agent ceiling,' but the abstract and framework description give no theorem statement, derivation, or tightness analysis. Without the explicit bound or assumptions (e.g., on the delegation graph or alpha values), it is impossible to evaluate whether the bound is non-vacuous or useful for the claimed accountability propagation.

minor comments (3)

[Preliminaries / Theoretical Results] The abstract refers to 'standard smoothness assumptions' without elaboration; this should be stated explicitly in the preliminaries or immediately after the theorem statement, including any references to standard results in bilevel optimization.
[Instantiation / Domains] While the manuscript states that datasets, safety constraint sets, baselines, and evaluation protocols are specified for the three domains, these details appear only as high-level mentions; they should be expanded in the main text (e.g., a dedicated section or appendix) even if full experiments are deferred.
[Framework Definition] The meta-weight network phi is introduced as an 'invented entity' without architectural details (e.g., input features for context s, output parameterization of lambda(s)); a brief description or diagram would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript on Safe Bilevel Delegation (SBD). We address each major comment point by point below. The current version establishes the formal framework and theoretical results, with empirical validation planned for a forthcoming revision as stated in the abstract. We will incorporate clarifications and expanded details in the revised manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract / Theoretical Results (2)] The Inner Policy Convergence claim states that projected gradient descent converges linearly 'under standard smoothness assumptions,' but no explicit conditions (Lipschitz gradient, strong convexity, or constraint qualification for the probabilistic safety constraint) are provided. Given that inner policies involve LLM token generation (discrete and non-differentiable), projected GD is unlikely to apply directly, and the result cannot be assessed without the full proof and a discussion of how the assumptions hold or are relaxed for LLM-based policies. This is load-bearing for the second claimed result.

Authors: We appreciate the referee highlighting the need for explicit conditions. The full manuscript states the convergence result under standard smoothness assumptions in the theoretical analysis, but we agree the abstract is too brief. The assumptions are L-smoothness and μ-strong convexity of the inner objective with respect to the continuous delegation parameter α, along with Slater's condition for the safety constraint. Linear convergence then follows from standard projected gradient descent theory. To address LLM discreteness, the inner optimization operates over the continuous α ∈ [0,1] (with the LLM serving as a fixed oracle for token generation and safety evaluation). We will revise the abstract and add a clarifying paragraph in Section 4 summarizing the assumptions and the continuous relaxation used for LLM policies. revision: yes
Referee: [Safety Constraint Formulation] The probabilistic constraint P(safe) >= 1-delta is central to the bilevel setup, Safety Monotonicity, and all three results, yet the manuscript provides no method for estimating or enforcing P(safe), no handling of estimation error, and no constraint qualification. In the instantiated domains (MIMIC-III, S&P 500, ASSISTments), safety functions are unlikely to be smooth or easily differentiable, which risks invalidating the monotonicity and convergence claims. The full proof of Safety Monotonicity must address this.

Authors: We agree that additional detail on the safety constraint is warranted. The manuscript specifies domain-specific safety constraint sets in the instantiation section but does not elaborate on estimation or enforcement. In the revision we will add a dedicated subsection describing Monte Carlo estimation of P(safe) with Hoeffding concentration bounds to control estimation error, Lagrangian relaxation for enforcement in the inner loop, and Slater's condition for qualification. The Safety Monotonicity proof will be expanded to rely on monotonicity of the safety measure with respect to λ (rather than differentiability), which holds for the non-smooth oracles in the medical, financial, and educational domains. revision: yes
Referee: [Accountability Propagation Bound] The third result claims a bound that distributes responsibility across multi-hop delegation chains with a 'provable per-agent ceiling,' but the abstract and framework description give no theorem statement, derivation, or tightness analysis. Without the explicit bound or assumptions (e.g., on the delegation graph or alpha values), it is impossible to evaluate whether the bound is non-vacuous or useful for the claimed accountability propagation.

Authors: The Accountability Propagation bound appears as Theorem 5.1 in the theoretical results section, with a derivation based on recursive application of the safety constraint along the chain and a per-agent ceiling of δ/(1−max α_i). We acknowledge that the abstract and high-level framework description omit the explicit statement. In the revision we will include the full theorem statement, a proof sketch, and a tightness analysis (showing the bound is achieved for α_i=1 on acyclic delegation graphs) directly in the main text rather than the appendix, along with the required assumptions on the delegation graph and α values. revision: yes

Circularity Check

0 steps flagged

No significant circularity: theoretical results derived from bilevel formulation without reduction to inputs or self-citations

full rationale

The paper formulates SBD as a bilevel optimization with outer meta-weight network phi producing lambda(s) and inner optimization of delegation policy pi under P(safe) >= 1-delta. Safety Monotonicity follows directly from the monotonic effect of the outer safety weight on the inner feasible set. Inner Policy Convergence is stated as a standard result for projected gradient descent under smoothness assumptions that are external to the result itself. The Accountability Propagation bound is derived from the multi-hop chain structure with per-agent ceilings. None of these reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations; the manuscript presents them as formal derivations from the stated bilevel setup. The framework is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is populated at a high level from stated elements. The framework introduces mathematical constructs for the bilevel setup but relies on standard optimization assumptions.

free parameters (1)

delta
Safety threshold in the probabilistic constraint P(safe) >= 1-delta, chosen per application.

axioms (1)

domain assumption Standard smoothness assumptions for linear convergence of projected gradient descent on the inner problem
Invoked explicitly for the Inner Policy Convergence result.

invented entities (1)

meta-weight network phi no independent evidence
purpose: Learns context-dependent safety-efficiency weights lambda(s) in the outer loop
Core new component of the bilevel formulation; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5620 in / 1736 out tokens · 65305 ms · 2026-05-07T07:58:52.559930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads

Justice Owusu Agyemang, Jerry John Kponyo, Obed Kwasi Somuah, Elliot Amponsah, Godfred Manu Addo Boakye, and Kwame Opuni-Boachie Obour Agyekum. HiveMind: OS-inspired scheduling for concurrent LLM agent workloads. arXiv:2604.17111,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Rein- forcement learning with verifiable yet noisy rewards under imperfect verifiers.arXiv preprint arXiv:2510.00915,

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers. arXiv:2510.00915,

work page arXiv
[3]

Mingyu Feng, Neil Heffernan, and Kenneth Koedinger

arXiv:2601.20913. Mingyu Feng, Neil Heffernan, and Kenneth Koedinger. Addressing the assessment challenge with an online system that tutors as it assesses.User Modeling and User-Adapted Interaction, 19(3),

work page arXiv
[4]

AgentCollab: A self-evaluation-driven collaboration paradigm for efficient LLM agents

Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang, Xi Chen, Hui-Ling Zhen, Hanting Chen, Weizhe Lin, Xiaosong Li, and Yaoyuan Wang. AgentCollab: A self-evaluation-driven collaboration paradigm for efficient LLM agents. arXiv:2603.26034,

work page arXiv
[5]

Generalized inner loop meta-learning

Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. Generalized inner loop meta-learning. arXiv:1910.01727,

work page arXiv 1910
[6]

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

Chenyi Huang, Haoting Zhang, Jingxu Xu, Zeyu Zheng, and Yunduan Lin. Bilevel optimization of agent skills via Monte Carlo tree search. arXiv:2604.15709,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

When single-agent with skills replace multi-agent systems and when they fail,

Xiaoxiao Li. When single-agent with skills replace multi-agent systems and when they fail. arXiv:2601.04748,

work page arXiv
[8]

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv:2601.10338,

work page internal anchor Pith review arXiv
[9]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling agent self-evolving with reasoning memory. arXiv:2509.25140,

work page internal anchor Pith review arXiv
[10]

LDP: An identity-aware protocol for multi-agent LLM systems

Sunil Prakash. LDP: An identity-aware protocol for multi-agent LLM systems. arXiv:2603.08852, 2026a. Sunil Prakash. The provenance paradox in multi-agent LLM routing: Delegation contracts and attested identity in LDP. arXiv:2603.18043, 2026b. Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv:1910.01708,

work page arXiv 1910
[11]

Agent skills enable a new class of realistic and trivially simple prompt injections,

David Schmotz, Sahar Abdelnabi, and Maksym Andriushchenko. Agent skills enable a new class of realistic and trivially simple prompt injections. arXiv:2510.26328,

work page arXiv
[12]

Agentrm: An os-inspired resource manager for llm agent systems,

Jianshu She. AgentRM: An OS-inspired resource manager for LLM agent systems. arXiv:2603.13110,

work page arXiv
[13]

Intelligent AI delegation

Nenad Tomaˇsev, Matija Franklin, and Simon Osindero. Intelligent AI delegation. arXiv:2602.11865,

work page arXiv
[14]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv:2404.13208,

work page internal anchor Pith review arXiv
[15]

arXiv preprint arXiv:2603.07972 , year=

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, and Yan Liu. Adaptive collaboration with humans: Metacognitive policy optimization for multi-agent LLMs with continual learning. arXiv:2603.07972,

work page arXiv
[16]

Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution. arXiv:2601.21557,

work page arXiv
[17]

Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

arXiv:2502.04180. Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, and Jiaxun Zhang. Where LLM agents fail and how they can learn from failures. arXiv:2509.25370,

work page arXiv