MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination
Pith reviewed 2026-05-25 06:06 UTC · model grok-4.3
The pith
MARGIN learns per-agent calibration factors online from the task stream to fix mis-calibrated confidence in multi-agent foundation model setups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARGIN (Multi Agent Runtime Grading via Incremental Normalization) learns per-agent, per-confidence-band calibration factors from the task stream itself using symmetric exponentially weighted moving averages with Bayesian shrinkage blending. It requires no model access, no held-out data, and no retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, it raises pairwise resolution from 45-56% to 70-89% and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking, and
What carries the argument
Symmetric exponentially weighted moving averages with Bayesian shrinkage blending that produce per-agent, per-confidence-band calibration factors updated from the live task stream.
If this is right
- Calibration error drops 3-6x versus the strongest design-time baseline under distribution shift.
- Pairwise resolution in selecting which agent to trust rises from 45-56% to 70-89% on hard benchmarks.
- Multi-agent selection can exceed the accuracy of always using the single best model on three of four benchmarks.
- Convergence, tracking speed, and optimality of symmetric updates are guaranteed by six formal propositions for non-strategic agents.
- The method operates with only three hyperparameters that have robust defaults and needs no held-out data.
Where Pith is reading between the lines
- The approach could be tested in single-agent settings where only one model's confidence must be adjusted over time.
- If agents begin reporting confidence strategically to game the updates, the optimality propositions would no longer apply and performance could degrade.
- Because MARGIN needs no model internals, it could be inserted into existing multi-agent orchestration layers with minimal engineering effort.
- The online nature suggests it would continue adapting in non-stationary environments where design-time methods would require periodic re-fitting.
Load-bearing premise
The formal claims on optimality of symmetric updates assume agents do not strategically adapt their confidence reports once calibration begins.
What would settle it
Run the same 50,000+ observations under distribution shift; if calibration error does not fall by a factor of at least 3 relative to the strongest design-time baseline or if pairwise resolution stays below 65% on hard benchmarks, the central performance claim is falsified.
Figures
read the original abstract
Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MARGIN, an online calibration method for multi-agent foundation model coordination. It learns per-agent, per-confidence-band factors from the live task stream using symmetric exponentially weighted moving averages with Bayesian shrinkage, requiring no model access, held-out data, or retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, it reports 3-6x lower calibration error than design-time baselines under distribution shift, raises pairwise resolution from 45-56% to 70-89%, and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions on convergence, tracking speed, and optimality of symmetric updates (scoped to non-strategic agents) are presented and illustrated empirically.
Significance. If the empirical results and derivations hold, the work is significant for enabling reliable multi-agent coordination with foundation models in dynamic settings where design-time calibration fails. The large-scale evaluation across many models and benchmarks, combined with formal propositions that are empirically illustrated, provides a strong foundation for the claims.
major comments (1)
- [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.
minor comments (2)
- The three hyperparameters are stated to have robust defaults, but the specific default values and any sensitivity analysis should be reported explicitly (e.g., in a table or appendix) to support reproducibility.
- Ensure the six formal propositions are numbered (e.g., Proposition 1, 2, ...) and cross-referenced in the empirical sections where their predictions are illustrated.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on the scope of our optimality propositions. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.
Authors: We agree that the formal propositions are explicitly scoped to non-strategic agents, as stated in the manuscript (see Propositions 4–6 and the surrounding text). A full empirical test of strategic adaptation would require a separate experimental framework modeling adversarial or game-theoretic agent behaviors, which is outside the paper’s focus on cooperative coordination under standard reporting assumptions. We will therefore add a dedicated paragraph in the revised Discussion section (new Section 6.3) that (i) restates the non-strategic assumption, (ii) outlines plausible mechanisms by which strategic misreporting could erode calibration gains, and (iii) notes that the 3–6× error reduction and 70–89 % resolution improvements are not guaranteed under such conditions. This addition will make the boundary conditions of our claims explicit without altering the core technical contributions or requiring new experiments. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's core claims rest on an online method that updates calibration factors directly from the live task stream using EWMA and Bayesian shrinkage, with no held-out data or pre-fitted parameters. Formal propositions on convergence and optimality are explicitly scoped to non-strategic agents and are illustrated by direct empirical measurements across 19 models and 50k+ observations rather than by construction from the method's own hyperparameters. No self-citation chains, self-definitional loops, or fitted-input-as-prediction reductions appear in the derivation; the performance numbers (3-6x error reduction, resolution lift) are presented as independent measurements.
Axiom & Free-Parameter Ledger
free parameters (1)
- three hyperparameters
axioms (1)
- domain assumption Agents are non-strategic
Reference graph
Works this paper leans on
-
[1]
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.Foundations and Trends in Machine Learning, 16(4):494–591, 2023. arXiv:2107.07511. 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Cambridge Univer- sity Press, 2006
Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, 2006
work page 2006
-
[3]
Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,
-
[4]
A. Philip Dawid. The well-calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982
work page 1982
-
[5]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[6]
Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024
work page 2024
-
[7]
Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration.Biometrika, 85(2):379–390, 1998
work page 1998
-
[8]
A survey of confidence estimation and calibration in large language models
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 6577–6595, 2024
work page 2024
-
[9]
Who knows the answer? finding the best model and prompt for each query using confidence-based search
Walter Gerych, Yara Rizk, Vatche Isahagian, Vinod Muthusamy, Evelyn Duesterwald, and Praveen Venkateswaran. Who knows the answer? finding the best model and prompt for each query using confidence-based search. InProceedings of the 38th AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[10]
Adaptive conformal inference under distribution shift
Isaac Gibbs and Emmanuel Cand` es. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[11]
Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007
work page 2007
-
[12]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1321–1330, 2017
work page 2017
-
[13]
J. Stuart Hunter. The exponentially weighted moving average.Journal of Quality Technology, 18(4):203–210, 1986
work page 1986
-
[14]
Audun Jøsang, Roslan Ismail, and Colin Boyd. A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007
work page 2007
-
[15]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DaSilva, Eli Elhage, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. EigenTrust: Reputation management in P2P networks. InProceedings of the 12th International Conference on World Wide Web (WWW), pages 640–651, 2003. 22
work page 2003
-
[17]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[18]
Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge
Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge. Large language models miss the multi- agent mark. InAdvances in Neural Information Processing Systems (NeurIPS), Position Track,
-
[19]
Simple and scalable pre- dictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2017
work page 2017
-
[20]
ConfTuner: LLM self-calibration via confidence tuning
Zhiwei Li et al. ConfTuner: LLM self-calibration via confidence tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[21]
Uncertainty quantification and confidence calibration in large language models: A survey
Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. arXiv:2503.15850
-
[22]
Your pre-trained LLM is secretly an unsupervised confidence calibrator
Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. Your pre-trained LLM is secretly an unsupervised confidence calibrator. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.16690
-
[23]
Revisiting the calibration of modern neural networks
Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021
work page 2021
-
[24]
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well cali- brated probabilities using Bayesian binning into quantiles. InProceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015
work page 2015
-
[25]
Sculley, Sebastian Nowozin, Joshua V
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[26]
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999
work page 1999
-
[27]
Thermometer: Towards universal calibration for large language models
Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[28]
Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going MAD? a look at multi-agent debate strategies for LLMs.arXiv preprint arXiv:2311.17371, 2024
-
[29]
Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models 23 fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.14975
-
[30]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024
work page 2024
-
[31]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[32]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InCOLM 2024, 2024. arXiv:2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024. arXiv:2306.13063
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University – Computer and Information Sciences, 2025. Appendix A Proofs of Formal Properties Proof of Proposition 1 (EWMA as Exponential Discounting) We proceed by induction. The base caset= 1 g...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.