MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Joss Armstrong

arxiv: 2605.22949 · v2 · pith:XZNNXOAMnew · submitted 2026-05-21 · 💻 cs.LG · cs.MA

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Joss Armstrong This is my paper

Pith reviewed 2026-05-25 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords confidence calibrationmulti-agent systemsfoundation modelsonline learningdistribution shiftruntime calibrationagent coordinationverbalized confidence

0 comments

The pith

MARGIN learns per-agent calibration factors online from the task stream to fix mis-calibrated confidence in multi-agent foundation model setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARGIN as an online calibration method that updates per-agent and per-confidence-band factors directly from the incoming task stream. Unlike design-time approaches that fit fixed corrections on held-out data and degrade under distribution shift, MARGIN uses symmetric exponentially weighted moving averages combined with Bayesian shrinkage. This requires no model access, no retraining, and only three hyperparameters with robust defaults. Experiments across 19 models, 8 benchmarks, and over 50,000 observations show it reduces calibration error by 3-6x and improves pairwise agent resolution from 45-56% to 70-89%, sometimes beating the always-best-model oracle. A reader would care because it enables reliable selection among agents whose self-reported confidence is often mis-calibrated or even inversely related to accuracy on hard tasks.

Core claim

MARGIN (Multi Agent Runtime Grading via Incremental Normalization) learns per-agent, per-confidence-band calibration factors from the task stream itself using symmetric exponentially weighted moving averages with Bayesian shrinkage blending. It requires no model access, no held-out data, and no retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, it raises pairwise resolution from 45-56% to 70-89% and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking, and

What carries the argument

Symmetric exponentially weighted moving averages with Bayesian shrinkage blending that produce per-agent, per-confidence-band calibration factors updated from the live task stream.

If this is right

Calibration error drops 3-6x versus the strongest design-time baseline under distribution shift.
Pairwise resolution in selecting which agent to trust rises from 45-56% to 70-89% on hard benchmarks.
Multi-agent selection can exceed the accuracy of always using the single best model on three of four benchmarks.
Convergence, tracking speed, and optimality of symmetric updates are guaranteed by six formal propositions for non-strategic agents.
The method operates with only three hyperparameters that have robust defaults and needs no held-out data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested in single-agent settings where only one model's confidence must be adjusted over time.
If agents begin reporting confidence strategically to game the updates, the optimality propositions would no longer apply and performance could degrade.
Because MARGIN needs no model internals, it could be inserted into existing multi-agent orchestration layers with minimal engineering effort.
The online nature suggests it would continue adapting in non-stationary environments where design-time methods would require periodic re-fitting.

Load-bearing premise

The formal claims on optimality of symmetric updates assume agents do not strategically adapt their confidence reports once calibration begins.

What would settle it

Run the same 50,000+ observations under distribution shift; if calibration error does not fall by a factor of at least 3 relative to the strongest design-time baseline or if pairwise resolution stays below 65% on hard benchmarks, the central performance claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22949 by Joss Armstrong.

**Figure 2.** Figure 2: Per-model raw ECE on HumanEval (phase 1, mild regime) versus BigCodeBench (phase 2 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: Reliability diagrams on the MMLU (STEM → Humanities) shift. Left: raw verbalized confidence is systematically overconfident, with reliability curves lying far below the diagonal across all confidence bins (ECE 7.3% → 18.5% under shift). Right: MARGIN-calibrated confidence tracks the diagonal closely in both phases, reducing ECE by 4× post-shift (2.7% → 4.6%). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Phase 2 ECE across all 11 distribution-shift conditions (8 code-generation + 3 QA/math), [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-agent selection results. Left: pass@1 (%) across four code-generation benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-task calibration transfer (mean across 9–10 cloud models). Left: phase 2 ECE [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness of MARGIN to dynamic agent pools. 11 cloud models with full QA and [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Ablations across three representative shift conditions (HE [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

read the original abstract

Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically miscalibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi-Agent Runtime Grading via Incremental Normalisation), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 18 foundation models, 8 benchmarks, and over 44,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence fails to beat random at pairwise resolution (43-50%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and closing 37-78% of the Raw-to-Oracle pass@1 gap across the five code-generation benchmarks without any oracle knowledge of which model is strongest. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARGIN gives a workable online per-agent calibration method that reports strong gains over static baselines on large-scale tests under shift, with the main limits being the narrow scope of its formal claims.

read the letter

MARGIN stands out for moving calibration to runtime with per-agent per-band updates driven by the live task stream, using symmetric EWMA and Bayesian shrinkage. The headline empirical results—3-6x lower calibration error than design-time methods and pairwise resolution lifted from 45-56% to 70-89%, beating the always-best oracle on three of four benchmarks across 19 models and 50k observations—are the things worth noting first. The approach requires no held-out data or model access, which matches real multi-agent deployments where distribution shift is common. The six formal propositions on convergence and symmetric updates add some grounding, and the paper illustrates them empirically as claimed. The work does a clean job explaining why temperature scaling and similar methods degrade under shift and then showing the online alternative in action at scale. The citation pattern looks standard for the calibration literature it builds on. The soft spots are proportionate: the optimality propositions are explicitly limited to non-strategic agents, so any adaptive or gaming behavior by agents would fall outside the stated guarantees. Three hyperparameters are involved even with robust defaults, and while the abstract presents the updates as stream-driven, a referee would still want to see the exact implementation details on data handling and any sensitivity checks. No obvious internal contradictions appear between the method description and the reported measurements. This paper is for people working on deployed multi-agent foundation model systems where selection accuracy under shift matters. A reader focused on practical calibration fixes would get direct value from the method and the scale of the tests. It deserves a serious referee because the empirical scope is large enough that the claims, if they hold, could affect how people handle agent coordination.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MARGIN, an online calibration method for multi-agent foundation model coordination. It learns per-agent, per-confidence-band factors from the live task stream using symmetric exponentially weighted moving averages with Bayesian shrinkage, requiring no model access, held-out data, or retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, it reports 3-6x lower calibration error than design-time baselines under distribution shift, raises pairwise resolution from 45-56% to 70-89%, and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions on convergence, tracking speed, and optimality of symmetric updates (scoped to non-strategic agents) are presented and illustrated empirically.

Significance. If the empirical results and derivations hold, the work is significant for enabling reliable multi-agent coordination with foundation models in dynamic settings where design-time calibration fails. The large-scale evaluation across many models and benchmarks, combined with formal propositions that are empirically illustrated, provides a strong foundation for the claims.

major comments (1)

[Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.

minor comments (2)

The three hyperparameters are stated to have robust defaults, but the specific default values and any sensitivity analysis should be reported explicitly (e.g., in a table or appendix) to support reproducibility.
Ensure the six formal propositions are numbered (e.g., Proposition 1, 2, ...) and cross-referenced in the empirical sections where their predictions are illustrated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on the scope of our optimality propositions. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.

Authors: We agree that the formal propositions are explicitly scoped to non-strategic agents, as stated in the manuscript (see Propositions 4–6 and the surrounding text). A full empirical test of strategic adaptation would require a separate experimental framework modeling adversarial or game-theoretic agent behaviors, which is outside the paper’s focus on cooperative coordination under standard reporting assumptions. We will therefore add a dedicated paragraph in the revised Discussion section (new Section 6.3) that (i) restates the non-strategic assumption, (ii) outlines plausible mechanisms by which strategic misreporting could erode calibration gains, and (iii) notes that the 3–6× error reduction and 70–89 % resolution improvements are not guaranteed under such conditions. This addition will make the boundary conditions of our claims explicit without altering the core technical contributions or requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claims rest on an online method that updates calibration factors directly from the live task stream using EWMA and Bayesian shrinkage, with no held-out data or pre-fitted parameters. Formal propositions on convergence and optimality are explicitly scoped to non-strategic agents and are illustrated by direct empirical measurements across 19 models and 50k+ observations rather than by construction from the method's own hyperparameters. No self-citation chains, self-definitional loops, or fitted-input-as-prediction reductions appear in the derivation; the performance numbers (3-6x error reduction, resolution lift) are presented as independent measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on three unspecified hyperparameters with robust defaults and the domain assumption that agents are non-strategic; no new entities are postulated.

free parameters (1)

three hyperparameters
Abstract states the method has three hyperparameters with robust defaults that control the online updates.

axioms (1)

domain assumption Agents are non-strategic
Formal propositions characterize optimality of symmetric updates specifically for non-strategic agents.

pith-pipeline@v0.9.0 · 5774 in / 1249 out tokens · 41001 ms · 2026-05-25T06:06:36.744083+00:00 · methodology

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)