pith. sign in

arxiv: 2510.08240 · v2 · submitted 2025-10-09 · 💻 cs.CL

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Pith reviewed 2026-05-18 08:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM safety alignmentmulti-agent reinforcement learningoverrefusalhelpful-harmless trade-offdynamic rewardcollaborative agentsadversarial robustness
0
0 comments X

The pith

Jointly training a conversation agent and feedback agent with an evolving reward cuts both unsafe LLM responses and overrefusals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLMs face a core tension: safeguards that block unsafe content often cause overrefusals on benign but sensitive prompts, while weaker safeguards allow harmful outputs. WaltzRL addresses this by casting alignment as a collaborative game in which a feedback agent learns to suggest improvements that a conversation agent can incorporate. A Dynamic Improvement Reward updates over time to favor suggestions that actually raise both safety and helpfulness. At inference the feedback agent activates only when needed, refining outputs instead of discarding them. Experiments across five datasets show large drops in unsafe answers and unnecessary refusals while preserving general capabilities.

Core claim

WaltzRL jointly trains a conversation agent and a feedback agent in a multi-agent reinforcement learning setup. The feedback agent receives a Dynamic Improvement Reward that evolves according to how successfully the conversation agent incorporates its suggestions. This process lets the system improve unsafe or overrefusing responses on the fly rather than rejecting them outright, while the feedback agent engages adaptively to avoid unnecessary overhead on safe queries.

What carries the argument

The Dynamic Improvement Reward (DIR), which updates over training steps based on measurable gains in safety and helpfulness after the conversation agent applies the feedback agent's suggestions.

If this is right

  • Responses that would otherwise be unsafe or over-refusing are refined through feedback instead of being rejected.
  • The feedback agent activates only adaptively, preserving low latency and helpfulness on safe queries.
  • General model capabilities remain intact while safety and refusal metrics improve substantially.
  • The method moves the Pareto front between helpfulness and harmlessness forward across multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-training pattern could be applied to other conflicting objectives, such as balancing factual accuracy against creative fluency.
  • Deploying the feedback agent in production systems might enable ongoing adaptation from live user interactions without full retraining.
  • Similar collaborative reward structures could be tested for reducing hallucinations or improving consistency in long conversations.

Load-bearing premise

The evolving reward consistently favors feedback that produces lasting gains in both safety and helpfulness rather than short-term or misleading improvements.

What would settle it

If a new test set of adversarial and sensitive prompts shows unsafe response rates above 10 percent or overrefusal rates above 20 percent after WaltzRL training, the claim that the collaborative setup reliably advances the helpful-harmless trade-off would be challenged.

read the original abstract

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces WaltzRL, a multi-agent reinforcement learning framework that formulates safety alignment as a collaborative game between a conversation agent and a feedback agent. The feedback agent is trained to provide suggestions that improve both safety and helpfulness, with the Dynamic Improvement Reward (DIR) evolving based on the degree of incorporation by the conversation agent. At inference the feedback agent is deployed adaptively only when needed. Experiments across five datasets report large reductions in unsafe responses (e.g., 39.0% to 4.6% on WildJailbreak) and overrefusals (45.3% to 9.9% on OR-Bench) relative to baselines while preserving general capabilities.

Significance. If the empirical claims hold under rigorous evaluation, the work would advance the helpfulness-harmlessness Pareto front by replacing outright rejection with adaptive, collaborative refinement. The co-evolutionary training loop and adaptive inference-time deployment are distinctive and could influence future multi-agent alignment methods.

major comments (2)
  1. Abstract: the headline reductions (39.0% to 4.6% unsafe on WildJailbreak; 45.3% to 9.9% overrefusal on OR-Bench) are presented without error bars, ablation studies, or a complete experimental protocol. These omissions are load-bearing for the central empirical claim and prevent assessment of statistical reliability or reproducibility.
  2. §3 (Dynamic Improvement Reward definition): the DIR is defined directly in terms of observed incorporation success rather than an independent downstream safety or helpfulness metric. This creates a risk that the feedback agent learns to emit low-effort edits that are quickly accepted, producing high training reward without addressing underlying failure modes; the co-evolutionary loop could therefore converge to a spurious equilibrium that fails under distribution shift or stronger adversarial testing.
minor comments (1)
  1. Abstract: the list of baselines used for comparison should be stated explicitly so readers can immediately contextualize the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful feedback. We provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: Abstract: the headline reductions (39.0% to 4.6% unsafe on WildJailbreak; 45.3% to 9.9% overrefusal on OR-Bench) are presented without error bars, ablation studies, or a complete experimental protocol. These omissions are load-bearing for the central empirical claim and prevent assessment of statistical reliability or reproducibility.

    Authors: We agree that the abstract would benefit from additional context on the statistical reliability of the results. In the revised manuscript, we will include references to the standard deviations and multiple-run averages reported in Section 4, as well as direct the reader to the complete experimental protocol in Section 4.1 and the ablation studies in Appendix B. These additions will allow readers to better assess reproducibility without altering the abstract's brevity. revision: yes

  2. Referee: §3 (Dynamic Improvement Reward definition): the DIR is defined directly in terms of observed incorporation success rather than an independent downstream safety or helpfulness metric. This creates a risk that the feedback agent learns to emit low-effort edits that are quickly accepted, producing high training reward without addressing underlying failure modes; the co-evolutionary loop could therefore converge to a spurious equilibrium that fails under distribution shift or stronger adversarial testing.

    Authors: This is a valid concern about the potential for the co-evolutionary process to settle on superficial solutions. The DIR incorporates not only incorporation rate but also a term based on independent safety and helpfulness evaluations of the final response to encourage substantive improvements. We have conducted experiments with stronger adversarial prompts and out-of-distribution tests (Section 5.2) that demonstrate generalization. In the revision, we will expand the discussion in Section 3 to explicitly address this risk and include additional ablations on the reward components to show that low-effort strategies do not dominate. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical training loop rather than definitional reduction

full rationale

The paper defines the Dynamic Improvement Reward (DIR) in terms of observed incorporation success of feedback into the conversation agent's responses, then reports downstream reductions in unsafe outputs and overrefusals on held-out benchmarks. No equations or self-citations are presented that make the reported metrics (e.g., 39.0%→4.6% on WildJailbreak) equivalent to the reward signal by construction. The co-evolutionary training is an optimization procedure whose success is measured against independent external datasets, not against the reward definition itself. The derivation chain therefore remains self-contained and falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard RL assumptions plus the domain-specific premise that feedback can be adaptively applied without latency or capability costs; no new physical entities or free parameters are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Feedback suggestions from the second agent can be incorporated by the first agent to produce measurable joint improvements in safety and helpfulness.
    This premise underpins the Dynamic Improvement Reward and the claim that the system improves rather than merely rejects responses.

pith-pipeline@v0.9.0 · 5884 in / 1250 out tokens · 32460 ms · 2026-05-18T08:55:18.188743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Internalizing Safety Understanding in Large Reasoning Models via Verification

    cs.AI 2026-05 unverdicted novelty 6.0

    Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...