Soft-Label Governance for Distributional Safety in Multi-Agent Systems

Aizierjiang Aiersilan; Raeli Savitt

arxiv: 2604.19752 · v1 · submitted 2026-03-19 · 💻 cs.MA · cs.AI· cs.CY

Soft-Label Governance for Distributional Safety in Multi-Agent Systems

Aizierjiang Aiersilan , Raeli Savitt This is my paper

Pith reviewed 2026-05-15 09:15 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CY

keywords multi-agent systemssoft labelsgovernance mechanismsdistributional safetysimulation frameworkwelfare tradeoffstoxicity metricsproxy evaluation

0 comments

The pith

Soft probabilistic labels show strict governance cuts multi-agent welfare over 40% with no safety improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a simulation framework that evaluates multi-agent AI interactions using continuous soft labels for risk instead of binary good or bad classifications. It implements a governance engine with adjustable levers including taxes, circuit breakers, and audits to measure effects on welfare and toxicity through probabilistic metrics. Experiments across seven scenarios demonstrate that aggressive governance interventions produce substantial welfare losses while leaving toxicity levels unchanged. The framework also shows that soft labels can identify cases where agents game conventional binary safety checks. This setup applies directly to systems with live LLM agents and highlights the need for calibrated tradeoffs between system value and risk control.

Core claim

SWARM replaces binary classifications with soft probabilistic labels p = P(v=+1) in [0,1] to enable continuous payoff computation, expected toxicity E[1-p | accepted], and quality gap measurements. In replicated simulations, strict governance reduces welfare by over 40% without safety gains, while full internalization of externalities drops total welfare from a baseline of +262 to -67 with invariant toxicity. Circuit breakers require specific thresholds to avoid severe value loss, and soft metrics detect self-optimizing agents that pass binary evaluations. The same governance layer operates without modification on live LLM-backed agents such as Concordia entities, Claude, and GPT-4o Mini.

What carries the argument

SWARM simulation framework using soft probabilistic labels p = P(v=+1) for continuous risk assessment together with a modular governance engine that applies levers including transaction taxes, circuit breakers, reputation decay, and random audits.

If this is right

Strict governance policies reduce total system welfare by more than 40% with no corresponding improvement in safety metrics.
Aggressive internalization of system externalities through governance levers collapses aggregate welfare from positive to negative values while toxicity stays unchanged.
Circuit breaker thresholds must be calibrated precisely, as overly restrictive settings sharply reduce system value.
Soft probabilistic metrics identify proxy gaming by agents that pass conventional binary safety evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Ongoing welfare monitoring alongside safety checks may be needed in deployed systems to prevent large unintended value destruction from governance rules.
Soft-label approaches could be tested in other multi-agent domains such as economic markets or robotic teams to uncover similar hidden tradeoffs.
Comparing simulation results directly against real LLM agent runs would test whether proxy-based soft labels generalize to dynamic, long-horizon interactions.

Load-bearing premise

The soft label probability computed from proxy evaluations in simulation accurately reflects true long-term risk in live multi-agent deployments with LLM-backed agents.

What would settle it

Deploy the identical governance levers on live LLM-backed agents and check whether welfare still falls over 40% without toxicity reduction or whether toxicity remains flat under full externality internalization.

Figures

Figures reproduced from arXiv: 2604.19752 by Aizierjiang Aiersilan, Raeli Savitt.

**Figure 2.** Figure 2: Toxicity and welfare across seven scenarios (error bars: [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Risk-Welfare Pareto Frontier. A scatter plot of mean welfare against mean toxicity across simulated governance scenarios in SWARM. The shaded regions denote idealized low risk (toxicity) and high welfare outcomes, visualizing how governance interventions typically trade off these objectives in non-adaptive agents [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Epoch-by-epoch toxicity trajectories (averaged over 5 seeds). Most scenarios [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of externality internalization ρ. (a) Toxicity is largely invariant to ρ (∼0.315), but welfare decreases monotonically from 262.14 (ρ=0) to −67.51 (ρ=1.0). (b) The welfare– toxicity plot shows a vertical drop without a tradeoff, indicating cost redistribution alone does not improve safety. λ=1.0 (meaning no decay). This suggests that penalizing historical reputation broadly demotivates long-term coo… view at source ↗

**Figure 6.** Figure 6: Governance lever ablations (mean ± std, n = 5 seeds). Each panel shows toxicity (red, left axis) and welfare (blue, right axis) as a function of one governance parameter. 7 Validation Experiments and Insights To further establish the empirical validity of SWARM beyond rule-based actors, we conducted a series of companion studies using complex LLM-backed agents and extended simulation frameworks [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: Welfare trajectories (averaged over seeds). The threshold dancer scenario achieves [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Total proposed and accepted interactions across seven scenarios, illustrating how [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (\textbf{S}ystem-\textbf{W}ide \textbf{A}ssessment of \textbf{R}isk in \textbf{M}ulti-agent systems), a simulation framework that replaces binary good/bad labels with \emph{soft probabilistic labels} $p = P(v{=}+1) \in [0,1]$, enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity $\mathbb{E}[1{-}p \mid \text{accepted}]$ and quality gap $\mathbb{E}[p \mid \text{accepted}] - \mathbb{E}[p \mid \text{rejected}]$. Across seven scenarios with five-seed replication, strict governance reduces welfare by over 40\% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of $+262$ down to $-67$, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self-optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without modification. Results show distributional safety requires \emph{continuous} risk metrics and governance lever calibration involves quantifiable safety-welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm-ai.org/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWARM shows clear welfare penalties from governance levers in soft-label multi-agent sims, but the risk proxies stay unanchored to live LLM outcomes.

read the letter

The punchline is that SWARM uses soft probabilistic labels to run governance experiments in multi-agent systems and reports that strict controls cut welfare by more than 40 percent with no safety benefit, while some interventions leave toxicity unchanged. What the paper does well is introduce a modular framework with levers like transaction taxes, circuit breakers, reputation decay, and audits. It computes continuous metrics such as expected toxicity on accepted actions and the quality gap between accepted and rejected ones. The experiments cover seven scenarios with five-seed replication, and the source code is available online. This makes the setup reproducible and lets others tweak the free parameters like tax rates and thresholds. The soft spots are around validation. The p values are generated from proxy evaluations within the simulation, and the key results like the welfare drop from +262 to -67 or the invariant toxicity are all derived internally. There is no reported comparison of these proxies against independent risk measures on actual multi-turn interactions with LLM agents such as Claude or GPT-4o Mini. The abstract claims direct applicability without modification, but the evidence for that transfer is missing from the presented results. This work is aimed at researchers developing simulation tools for emergent risks in agent collectives. Readers focused on practical governance design will get value from the configurable engine and the quantified tradeoffs. It deserves a serious referee because the implementation is concrete and the questions about continuous risk metrics are timely, even if additional experiments are needed to anchor the proxies. I recommend sending it to peer review with a request for validation against live agent trajectories.

Referee Report

3 major / 2 minor

Summary. The paper introduces the SWARM simulation framework for multi-agent systems that replaces binary safety labels with soft probabilistic labels p = P(v=+1) to enable continuous payoff computation, toxicity measurement, and governance via levers such as transaction taxes, circuit breakers, reputation decay, and audits. It reports that across seven scenarios with five-seed replication, strict governance reduces welfare by over 40% without improving safety metrics, while aggressive internalization of externalities collapses welfare from a baseline of +262 to -67 with invariant toxicity; companion experiments claim soft metrics detect proxy gaming better than binary evaluations, and the framework applies directly to LLM-backed agents such as Concordia entities, Claude, and GPT-4o Mini.

Significance. If the simulation results and proxy validity hold, the work provides concrete evidence of quantifiable safety-welfare tradeoffs under continuous risk metrics and demonstrates the value of calibrated governance levers over binary approaches. The public release of source code at https://www.swarm-ai.org/ is a clear strength that supports reproducibility and external scrutiny of the reported numerical outcomes.

major comments (3)

[Abstract] Abstract and simulation results: the headline claims (strict governance reduces welfare >40% with no safety gain; externality internalization drops welfare from +262 to -67 while toxicity stays flat) are computed entirely inside the simulation loop using the same proxy-derived p both for governance decisions and for the reported metrics E[1-p | accepted] and quality gap, creating direct circularity that must be addressed with an external validation benchmark.
[Applicability to LLM Agents] Applicability section: the manuscript asserts that the basic governance layer applies to live LLM-backed agents (Concordia, Claude, GPT-4o Mini) without modification, yet supplies no side-by-side comparison of proxy p versus independent post-hoc risk labels on the same live trajectories, leaving the central safety-welfare tradeoff unanchored for real deployments.
[Simulation Results] Methods and results: the weakest assumption—that p = P(v=+1) computed from proxy evaluations accurately reflects true long-term risk under agent adaptation and multi-turn interaction—is load-bearing for all distributional safety conclusions, but the provided text contains no independent data source or post-simulation validation step to test this assumption.

minor comments (2)

[Methods] Clarify the exact definition and computation of the soft label p in the methods section, including how proxy evaluations are aggregated across multiple turns.
[Results] Add a table or figure caption that explicitly lists the seven scenarios and the precise parameter settings used for the circuit-breaker threshold and transaction tax rate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on circularity in our metrics, applicability claims, and validation of the core proxy assumption. We have revised the manuscript with added limitations discussions, toned-down language, and clarifications to address these points while preserving the simulation-based scope of the work.

read point-by-point responses

Referee: [Abstract] Abstract and simulation results: the headline claims (strict governance reduces welfare >40% with no safety gain; externality internalization drops welfare from +262 to -67 while toxicity stays flat) are computed entirely inside the simulation loop using the same proxy-derived p both for governance decisions and for the reported metrics E[1-p | accepted] and quality gap, creating direct circularity that must be addressed with an external validation benchmark.

Authors: We acknowledge the circularity concern. Within the SWARM model, the probabilistic label p serves as the consistent internal representation of risk for both decision-making and evaluation, which is a deliberate design to study governance effects under distributional metrics. This is not an oversight but a modeling choice to isolate lever impacts. In the revised version, we have added a new Limitations subsection explicitly discussing this internal consistency and recommending external benchmarks for future work. The abstract has been updated to clarify that headline results are simulation-derived. revision: partial
Referee: [Applicability to LLM Agents] Applicability section: the manuscript asserts that the basic governance layer applies to live LLM-backed agents (Concordia, Claude, GPT-4o Mini) without modification, yet supplies no side-by-side comparison of proxy p versus independent post-hoc risk labels on the same live trajectories, leaving the central safety-welfare tradeoff unanchored for real deployments.

Authors: The claim rests on the modular architecture of the governance engine, which accepts any calibrated p input irrespective of origin. We agree that the absence of direct trajectory comparisons leaves the real-world tradeoff unanchored. The revised applicability section now states that the layer 'is designed to be compatible with' such agents and explicitly notes the need for future empirical validation on live systems to confirm the reported tradeoffs. revision: partial
Referee: [Simulation Results] Methods and results: the weakest assumption—that p = P(v=+1) computed from proxy evaluations accurately reflects true long-term risk under agent adaptation and multi-turn interaction—is load-bearing for all distributional safety conclusions, but the provided text contains no independent data source or post-simulation validation step to test this assumption.

Authors: The proxy definition of p is the foundational modeling assumption enabling soft-label analysis. We have expanded the Methods section with further justification of the proxy calibration and added sensitivity analyses exploring robustness to p estimation error. The Discussion now includes explicit caveats on the lack of independent validation data and outlines directions for post-simulation checks in extensions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are simulation outputs

full rationale

The paper introduces the SWARM simulation framework that defines soft labels p = P(v=+1) as an input and computes metrics such as E[1-p | accepted] and quality gap directly from those labels by design. The headline empirical claims (welfare reductions under strict governance, invariant toxicity under externality internalization) are reported outcomes from running the modular simulation across seven scenarios with five-seed replication, not first-principles derivations or predictions that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results. The internal use of p for both governance decisions and metric computation is a deliberate feature of the framework rather than a circular reduction, and the study remains self-contained as a simulation experiment without external validation claims that loop back tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that proxy-based soft labels can be computed reliably and that the listed governance levers act independently on the simulated agents; no new physical entities are postulated.

free parameters (2)

circuit-breaker threshold
Chosen to balance welfare and toxicity; the abstract states an optimal value exists but does not report its numerical setting.
transaction tax rate
Configurable lever whose specific rates are not enumerated in the abstract.

axioms (1)

domain assumption Soft label p = P(v=+1) computed from proxy evaluations faithfully represents true risk.
Invoked when converting binary proxy outcomes into continuous payoffs and toxicity scores.

pith-pipeline@v0.9.0 · 5633 in / 1437 out tokens · 37429 ms · 2026-05-15T09:15:12.868596+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

soft probabilistic labels p = P(v=+1) ∈ [0,1] ... expected toxicity E[1−p|accepted] and quality gap E[p|accepted]−E[p|rejected]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

governance levers (transaction taxes, circuit breakers, reputation decay, random audits)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Typically generate ∆task ∼ N( 0.8, 0.1)

Honest Agents ( H)Optimize for task success. Typically generate ∆task ∼ N( 0.8, 0.1). Expected reworkn r and verifier rejectionsn v are aggressively bounded near 0

work page
[2]

Produce moderate task progress N( 0.4, 0.4), accepting higher variance in potential rework (nr ∼Poisson(1)) to minimize upfront effort

Opportunistic Agents (O)Aim for maximal immediate πi, ignoring the externality term Esoft and systemic metrics. Produce moderate task progress N( 0.4, 0.4), accepting higher variance in potential rework (nr ∼Poisson(1)) to minimize upfront effort

work page
[3]

Will generate high ∆engage and artificially minimize nr without creating genuine real value (∆task ≈0), resulting in severe adverse selection environments

Deceptive Agents ( D)Systematically inflate proxy scores to exploit the evaluation network. Will generate high ∆engage and artificially minimize nr without creating genuine real value (∆task ≈0), resulting in severe adverse selection environments

work page
[4]

They generate strongly negative signatures (∆task < 0, high rejection rates) to crash the system’s overall positive surplus

Adversarial Agents (A)Actively seek to maximize systemic harm ∑ Esoft. They generate strongly negative signatures (∆task < 0, high rejection rates) to crash the system’s overall positive surplus

work page
[5]

If frozen by the circuit breaker in epoch t− 1, they will emit ∆task > 0 momentarily in epoch t purely to rehabilitatepabove the threshold before returning to exploitation

Adaptive Adversarial (AA)A variant of A possessing partial state observability. If frozen by the circuit breaker in epoch t− 1, they will emit ∆task > 0 momentarily in epoch t purely to rehabilitatepabove the threshold before returning to exploitation. 19 Preprint. Under review

work page
[6]

They maintain strict acceptance thresholds parameter τaccept

Cautious / Cautious Reciprocator (C / CR)Characterized by severe risk aversion. They maintain strict acceptance thresholds parameter τaccept . They only engage with entities exhibiting an unblemished interaction history. Reciprocators mimic counterparty strategies over continuous turns (Tit-for-Tat)

work page
[7]

If the threshold is θCB, they mathematically scale random generation variables untilp≈θ CB +ϵ

Threshold Dancers (TD)Explicitly program themselves to target a proxy score slightly above the freeze limits. If the threshold is θCB, they mathematically scale random generation variables untilp≈θ CB +ϵ. C Detailed Configuration Parametrization To ensure exact reproducibility, Table 8 breaks down the detailed numerical constraints and continuous calibrat...

work page

[1] [1]

Typically generate ∆task ∼ N( 0.8, 0.1)

Honest Agents ( H)Optimize for task success. Typically generate ∆task ∼ N( 0.8, 0.1). Expected reworkn r and verifier rejectionsn v are aggressively bounded near 0

work page

[2] [2]

Produce moderate task progress N( 0.4, 0.4), accepting higher variance in potential rework (nr ∼Poisson(1)) to minimize upfront effort

Opportunistic Agents (O)Aim for maximal immediate πi, ignoring the externality term Esoft and systemic metrics. Produce moderate task progress N( 0.4, 0.4), accepting higher variance in potential rework (nr ∼Poisson(1)) to minimize upfront effort

work page

[3] [3]

Will generate high ∆engage and artificially minimize nr without creating genuine real value (∆task ≈0), resulting in severe adverse selection environments

Deceptive Agents ( D)Systematically inflate proxy scores to exploit the evaluation network. Will generate high ∆engage and artificially minimize nr without creating genuine real value (∆task ≈0), resulting in severe adverse selection environments

work page

[4] [4]

They generate strongly negative signatures (∆task < 0, high rejection rates) to crash the system’s overall positive surplus

Adversarial Agents (A)Actively seek to maximize systemic harm ∑ Esoft. They generate strongly negative signatures (∆task < 0, high rejection rates) to crash the system’s overall positive surplus

work page

[5] [5]

If frozen by the circuit breaker in epoch t− 1, they will emit ∆task > 0 momentarily in epoch t purely to rehabilitatepabove the threshold before returning to exploitation

Adaptive Adversarial (AA)A variant of A possessing partial state observability. If frozen by the circuit breaker in epoch t− 1, they will emit ∆task > 0 momentarily in epoch t purely to rehabilitatepabove the threshold before returning to exploitation. 19 Preprint. Under review

work page

[6] [6]

They maintain strict acceptance thresholds parameter τaccept

Cautious / Cautious Reciprocator (C / CR)Characterized by severe risk aversion. They maintain strict acceptance thresholds parameter τaccept . They only engage with entities exhibiting an unblemished interaction history. Reciprocators mimic counterparty strategies over continuous turns (Tit-for-Tat)

work page

[7] [7]

If the threshold is θCB, they mathematically scale random generation variables untilp≈θ CB +ϵ

Threshold Dancers (TD)Explicitly program themselves to target a proxy score slightly above the freeze limits. If the threshold is θCB, they mathematically scale random generation variables untilp≈θ CB +ϵ. C Detailed Configuration Parametrization To ensure exact reproducibility, Table 8 breaks down the detailed numerical constraints and continuous calibrat...

work page