pith. sign in

arxiv: 2605.22949 · v1 · pith:XZNNXOAMnew · submitted 2026-05-21 · 💻 cs.LG · cs.MA

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Pith reviewed 2026-05-25 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.MA
keywords confidence calibrationmulti-agent systemsfoundation modelsonline learningdistribution shiftruntime calibrationagent coordinationverbalized confidence
0
0 comments X

The pith

MARGIN learns per-agent calibration factors online from the task stream to fix mis-calibrated confidence in multi-agent foundation model setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARGIN as an online calibration method that updates per-agent and per-confidence-band factors directly from the incoming task stream. Unlike design-time approaches that fit fixed corrections on held-out data and degrade under distribution shift, MARGIN uses symmetric exponentially weighted moving averages combined with Bayesian shrinkage. This requires no model access, no retraining, and only three hyperparameters with robust defaults. Experiments across 19 models, 8 benchmarks, and over 50,000 observations show it reduces calibration error by 3-6x and improves pairwise agent resolution from 45-56% to 70-89%, sometimes beating the always-best-model oracle. A reader would care because it enables reliable selection among agents whose self-reported confidence is often mis-calibrated or even inversely related to accuracy on hard tasks.

Core claim

MARGIN (Multi Agent Runtime Grading via Incremental Normalization) learns per-agent, per-confidence-band calibration factors from the task stream itself using symmetric exponentially weighted moving averages with Bayesian shrinkage blending. It requires no model access, no held-out data, and no retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, it raises pairwise resolution from 45-56% to 70-89% and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking, and

What carries the argument

Symmetric exponentially weighted moving averages with Bayesian shrinkage blending that produce per-agent, per-confidence-band calibration factors updated from the live task stream.

If this is right

  • Calibration error drops 3-6x versus the strongest design-time baseline under distribution shift.
  • Pairwise resolution in selecting which agent to trust rises from 45-56% to 70-89% on hard benchmarks.
  • Multi-agent selection can exceed the accuracy of always using the single best model on three of four benchmarks.
  • Convergence, tracking speed, and optimality of symmetric updates are guaranteed by six formal propositions for non-strategic agents.
  • The method operates with only three hyperparameters that have robust defaults and needs no held-out data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested in single-agent settings where only one model's confidence must be adjusted over time.
  • If agents begin reporting confidence strategically to game the updates, the optimality propositions would no longer apply and performance could degrade.
  • Because MARGIN needs no model internals, it could be inserted into existing multi-agent orchestration layers with minimal engineering effort.
  • The online nature suggests it would continue adapting in non-stationary environments where design-time methods would require periodic re-fitting.

Load-bearing premise

The formal claims on optimality of symmetric updates assume agents do not strategically adapt their confidence reports once calibration begins.

What would settle it

Run the same 50,000+ observations under distribution shift; if calibration error does not fall by a factor of at least 3 relative to the strongest design-time baseline or if pairwise resolution stays below 65% on hard benchmarks, the central performance claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22949 by Joss Armstrong.

Figure 1
Figure 1. Figure 1: MARGIN pipeline. Each agent’s raw confidence [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-model raw ECE on HumanEval (phase 1, mild regime) versus BigCodeBench (phase 2 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams on the MMLU (STEM → Humanities) shift. Left: raw verbalized confidence is systematically overconfident, with reliability curves lying far below the diagonal across all confidence bins (ECE 7.3% → 18.5% under shift). Right: MARGIN-calibrated confidence tracks the diagonal closely in both phases, reducing ECE by 4× post-shift (2.7% → 4.6%). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Phase 2 ECE across all 11 distribution-shift conditions (8 code-generation + 3 QA/math), [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-agent selection results. Left: pass@1 (%) across four code-generation benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-task calibration transfer (mean across 9–10 cloud models). Left: phase 2 ECE [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Robustness of MARGIN to dynamic agent pools. 11 cloud models with full QA and [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablations across three representative shift conditions (HE [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
read the original abstract

Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MARGIN, an online calibration method for multi-agent foundation model coordination. It learns per-agent, per-confidence-band factors from the live task stream using symmetric exponentially weighted moving averages with Bayesian shrinkage, requiring no model access, held-out data, or retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, it reports 3-6x lower calibration error than design-time baselines under distribution shift, raises pairwise resolution from 45-56% to 70-89%, and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions on convergence, tracking speed, and optimality of symmetric updates (scoped to non-strategic agents) are presented and illustrated empirically.

Significance. If the empirical results and derivations hold, the work is significant for enabling reliable multi-agent coordination with foundation models in dynamic settings where design-time calibration fails. The large-scale evaluation across many models and benchmarks, combined with formal propositions that are empirically illustrated, provides a strong foundation for the claims.

major comments (1)
  1. [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.
minor comments (2)
  1. The three hyperparameters are stated to have robust defaults, but the specific default values and any sensitivity analysis should be reported explicitly (e.g., in a table or appendix) to support reproducibility.
  2. Ensure the six formal propositions are numbered (e.g., Proposition 1, 2, ...) and cross-referenced in the empirical sections where their predictions are illustrated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on the scope of our optimality propositions. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.

    Authors: We agree that the formal propositions are explicitly scoped to non-strategic agents, as stated in the manuscript (see Propositions 4–6 and the surrounding text). A full empirical test of strategic adaptation would require a separate experimental framework modeling adversarial or game-theoretic agent behaviors, which is outside the paper’s focus on cooperative coordination under standard reporting assumptions. We will therefore add a dedicated paragraph in the revised Discussion section (new Section 6.3) that (i) restates the non-strategic assumption, (ii) outlines plausible mechanisms by which strategic misreporting could erode calibration gains, and (iii) notes that the 3–6× error reduction and 70–89 % resolution improvements are not guaranteed under such conditions. This addition will make the boundary conditions of our claims explicit without altering the core technical contributions or requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claims rest on an online method that updates calibration factors directly from the live task stream using EWMA and Bayesian shrinkage, with no held-out data or pre-fitted parameters. Formal propositions on convergence and optimality are explicitly scoped to non-strategic agents and are illustrated by direct empirical measurements across 19 models and 50k+ observations rather than by construction from the method's own hyperparameters. No self-citation chains, self-definitional loops, or fitted-input-as-prediction reductions appear in the derivation; the performance numbers (3-6x error reduction, resolution lift) are presented as independent measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on three unspecified hyperparameters with robust defaults and the domain assumption that agents are non-strategic; no new entities are postulated.

free parameters (1)
  • three hyperparameters
    Abstract states the method has three hyperparameters with robust defaults that control the online updates.
axioms (1)
  • domain assumption Agents are non-strategic
    Formal propositions characterize optimality of symmetric updates specifically for non-strategic agents.

pith-pipeline@v0.9.0 · 5774 in / 1249 out tokens · 41001 ms · 2026-05-25T06:06:36.744083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

  1. [1]

    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

    Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.Foundations and Trends in Machine Learning, 16(4):494–591, 2023. arXiv:2107.07511. 21

  2. [2]

    Cambridge Univer- sity Press, 2006

    Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, 2006

  3. [3]

    FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

    Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

  4. [4]

    Philip Dawid

    A. Philip Dawid. The well-calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

  5. [5]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  6. [6]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

  7. [7]

    Foster and Rakesh V

    Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration.Biometrika, 85(2):379–390, 1998

  8. [8]

    A survey of confidence estimation and calibration in large language models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 6577–6595, 2024

  9. [9]

    Who knows the answer? finding the best model and prompt for each query using confidence-based search

    Walter Gerych, Yara Rizk, Vatche Isahagian, Vinod Muthusamy, Evelyn Duesterwald, and Praveen Venkateswaran. Who knows the answer? finding the best model and prompt for each query using confidence-based search. InProceedings of the 38th AAAI Conference on Artificial Intelligence, 2024

  10. [10]

    Adaptive conformal inference under distribution shift

    Isaac Gibbs and Emmanuel Cand` es. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  11. [11]

    Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

  12. [12]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1321–1330, 2017

  13. [13]

    Stuart Hunter

    J. Stuart Hunter. The exponentially weighted moving average.Journal of Quality Technology, 18(4):203–210, 1986

  14. [14]

    A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

    Audun Jøsang, Roslan Ismail, and Colin Boyd. A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

  15. [15]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DaSilva, Eli Elhage, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  16. [16]

    Kamvar, Mario T

    Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. EigenTrust: Reputation management in P2P networks. InProceedings of the 12th International Conference on World Wide Web (WWW), pages 640–651, 2003. 22

  17. [17]

    Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

  18. [18]

    Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge

    Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge. Large language models miss the multi- agent mark. InAdvances in Neural Information Processing Systems (NeurIPS), Position Track,

  19. [19]

    Simple and scalable pre- dictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2017

  20. [20]

    ConfTuner: LLM self-calibration via confidence tuning

    Zhiwei Li et al. ConfTuner: LLM self-calibration via confidence tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  21. [21]

    Uncertainty quantification and confidence calibration in large language models: A survey

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. arXiv:2503.15850

  22. [22]

    Your pre-trained LLM is secretly an unsupervised confidence calibrator

    Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. Your pre-trained LLM is secretly an unsupervised confidence calibrator. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.16690

  23. [23]

    Revisiting the calibration of modern neural networks

    Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

  24. [24]

    Cooper, and Milos Hauskrecht

    Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well cali- brated probabilities using Bayesian binning into quantiles. InProceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015

  25. [25]

    Sculley, Sebastian Nowozin, Joshua V

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  26. [26]

    John C. Platt. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

  27. [27]

    Thermometer: Towards universal calibration for large language models

    Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  28. [28]

    Barrett, and Arnu Pretorius

    Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going MAD? a look at multi-agent debate strategies for LLMs.arXiv preprint arXiv:2311.17371, 2024

  29. [29]

    Manning, and Chelsea Finn

    Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models 23 fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.14975

  30. [30]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

  31. [31]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

  32. [32]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InCOLM 2024, 2024. arXiv:2308.08155

  33. [33]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024. arXiv:2306.13063

  34. [34]

    Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University – Computer and Information Sciences, 2025. Appendix A Proofs of Formal Properties Proof of Proposition 1 (EWMA as Exponential Discounting) We proceed by induction. The base caset= 1 g...