MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Joss Armstrong

arxiv: 2605.22949 · v1 · pith:XZNNXOAMnew · submitted 2026-05-21 · 💻 cs.LG · cs.MA

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Joss Armstrong This is my paper

Pith reviewed 2026-05-25 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords confidence calibrationmulti-agent systemsfoundation modelsonline learningdistribution shiftruntime calibrationagent coordinationverbalized confidence

0 comments

The pith

MARGIN learns per-agent calibration factors online from the task stream to fix mis-calibrated confidence in multi-agent foundation model setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARGIN as an online calibration method that updates per-agent and per-confidence-band factors directly from the incoming task stream. Unlike design-time approaches that fit fixed corrections on held-out data and degrade under distribution shift, MARGIN uses symmetric exponentially weighted moving averages combined with Bayesian shrinkage. This requires no model access, no retraining, and only three hyperparameters with robust defaults. Experiments across 19 models, 8 benchmarks, and over 50,000 observations show it reduces calibration error by 3-6x and improves pairwise agent resolution from 45-56% to 70-89%, sometimes beating the always-best-model oracle. A reader would care because it enables reliable selection among agents whose self-reported confidence is often mis-calibrated or even inversely related to accuracy on hard tasks.

Core claim

MARGIN (Multi Agent Runtime Grading via Incremental Normalization) learns per-agent, per-confidence-band calibration factors from the task stream itself using symmetric exponentially weighted moving averages with Bayesian shrinkage blending. It requires no model access, no held-out data, and no retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, it raises pairwise resolution from 45-56% to 70-89% and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking, and

What carries the argument

Symmetric exponentially weighted moving averages with Bayesian shrinkage blending that produce per-agent, per-confidence-band calibration factors updated from the live task stream.

If this is right

Calibration error drops 3-6x versus the strongest design-time baseline under distribution shift.
Pairwise resolution in selecting which agent to trust rises from 45-56% to 70-89% on hard benchmarks.
Multi-agent selection can exceed the accuracy of always using the single best model on three of four benchmarks.
Convergence, tracking speed, and optimality of symmetric updates are guaranteed by six formal propositions for non-strategic agents.
The method operates with only three hyperparameters that have robust defaults and needs no held-out data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested in single-agent settings where only one model's confidence must be adjusted over time.
If agents begin reporting confidence strategically to game the updates, the optimality propositions would no longer apply and performance could degrade.
Because MARGIN needs no model internals, it could be inserted into existing multi-agent orchestration layers with minimal engineering effort.
The online nature suggests it would continue adapting in non-stationary environments where design-time methods would require periodic re-fitting.

Load-bearing premise

The formal claims on optimality of symmetric updates assume agents do not strategically adapt their confidence reports once calibration begins.

What would settle it

Run the same 50,000+ observations under distribution shift; if calibration error does not fall by a factor of at least 3 relative to the strongest design-time baseline or if pairwise resolution stays below 65% on hard benchmarks, the central performance claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22949 by Joss Armstrong.

**Figure 2.** Figure 2: Per-model raw ECE on HumanEval (phase 1, mild regime) versus BigCodeBench (phase 2 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: Reliability diagrams on the MMLU (STEM → Humanities) shift. Left: raw verbalized confidence is systematically overconfident, with reliability curves lying far below the diagonal across all confidence bins (ECE 7.3% → 18.5% under shift). Right: MARGIN-calibrated confidence tracks the diagonal closely in both phases, reducing ECE by 4× post-shift (2.7% → 4.6%). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Phase 2 ECE across all 11 distribution-shift conditions (8 code-generation + 3 QA/math), [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-agent selection results. Left: pass@1 (%) across four code-generation benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-task calibration transfer (mean across 9–10 cloud models). Left: phase 2 ECE [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness of MARGIN to dynamic agent pools. 11 cloud models with full QA and [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Ablations across three representative shift conditions (HE [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

read the original abstract

Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARGIN gives a workable online per-agent calibration method that reports strong gains over static baselines on large-scale tests under shift, with the main limits being the narrow scope of its formal claims.

read the letter

MARGIN stands out for moving calibration to runtime with per-agent per-band updates driven by the live task stream, using symmetric EWMA and Bayesian shrinkage. The headline empirical results—3-6x lower calibration error than design-time methods and pairwise resolution lifted from 45-56% to 70-89%, beating the always-best oracle on three of four benchmarks across 19 models and 50k observations—are the things worth noting first. The approach requires no held-out data or model access, which matches real multi-agent deployments where distribution shift is common. The six formal propositions on convergence and symmetric updates add some grounding, and the paper illustrates them empirically as claimed. The work does a clean job explaining why temperature scaling and similar methods degrade under shift and then showing the online alternative in action at scale. The citation pattern looks standard for the calibration literature it builds on. The soft spots are proportionate: the optimality propositions are explicitly limited to non-strategic agents, so any adaptive or gaming behavior by agents would fall outside the stated guarantees. Three hyperparameters are involved even with robust defaults, and while the abstract presents the updates as stream-driven, a referee would still want to see the exact implementation details on data handling and any sensitivity checks. No obvious internal contradictions appear between the method description and the reported measurements. This paper is for people working on deployed multi-agent foundation model systems where selection accuracy under shift matters. A reader focused on practical calibration fixes would get direct value from the method and the scale of the tests. It deserves a serious referee because the empirical scope is large enough that the claims, if they hold, could affect how people handle agent coordination.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MARGIN, an online calibration method for multi-agent foundation model coordination. It learns per-agent, per-confidence-band factors from the live task stream using symmetric exponentially weighted moving averages with Bayesian shrinkage, requiring no model access, held-out data, or retraining. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, it reports 3-6x lower calibration error than design-time baselines under distribution shift, raises pairwise resolution from 45-56% to 70-89%, and surpasses the always-best-model oracle on three of four benchmarks. Six formal propositions on convergence, tracking speed, and optimality of symmetric updates (scoped to non-strategic agents) are presented and illustrated empirically.

Significance. If the empirical results and derivations hold, the work is significant for enabling reliable multi-agent coordination with foundation models in dynamic settings where design-time calibration fails. The large-scale evaluation across many models and benchmarks, combined with formal propositions that are empirically illustrated, provides a strong foundation for the claims.

major comments (1)

[Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.

minor comments (2)

The three hyperparameters are stated to have robust defaults, but the specific default values and any sensitivity analysis should be reported explicitly (e.g., in a table or appendix) to support reproducibility.
Ensure the six formal propositions are numbered (e.g., Proposition 1, 2, ...) and cross-referenced in the empirical sections where their predictions are illustrated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comment on the scope of our optimality propositions. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (final sentence): The optimality propositions for symmetric updates are explicitly limited to non-strategic agents; the manuscript should include a dedicated discussion or experiment testing whether strategic adaptation of confidence reporting by agents could degrade the observed 3-6x calibration gains or the 70-89% resolution lift.

Authors: We agree that the formal propositions are explicitly scoped to non-strategic agents, as stated in the manuscript (see Propositions 4–6 and the surrounding text). A full empirical test of strategic adaptation would require a separate experimental framework modeling adversarial or game-theoretic agent behaviors, which is outside the paper’s focus on cooperative coordination under standard reporting assumptions. We will therefore add a dedicated paragraph in the revised Discussion section (new Section 6.3) that (i) restates the non-strategic assumption, (ii) outlines plausible mechanisms by which strategic misreporting could erode calibration gains, and (iii) notes that the 3–6× error reduction and 70–89 % resolution improvements are not guaranteed under such conditions. This addition will make the boundary conditions of our claims explicit without altering the core technical contributions or requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claims rest on an online method that updates calibration factors directly from the live task stream using EWMA and Bayesian shrinkage, with no held-out data or pre-fitted parameters. Formal propositions on convergence and optimality are explicitly scoped to non-strategic agents and are illustrated by direct empirical measurements across 19 models and 50k+ observations rather than by construction from the method's own hyperparameters. No self-citation chains, self-definitional loops, or fitted-input-as-prediction reductions appear in the derivation; the performance numbers (3-6x error reduction, resolution lift) are presented as independent measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on three unspecified hyperparameters with robust defaults and the domain assumption that agents are non-strategic; no new entities are postulated.

free parameters (1)

three hyperparameters
Abstract states the method has three hyperparameters with robust defaults that control the online updates.

axioms (1)

domain assumption Agents are non-strategic
Formal propositions characterize optimality of symmetric updates specifically for non-strategic agents.

pith-pipeline@v0.9.0 · 5774 in / 1249 out tokens · 41001 ms · 2026-05-25T06:06:36.744083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

[1]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.Foundations and Trends in Machine Learning, 16(4):494–591, 2023. arXiv:2107.07511. 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Cambridge Univer- sity Press, 2006

Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, 2006

work page 2006
[3]

FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

work page
[4]

Philip Dawid

A. Philip Dawid. The well-calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

work page 1982
[5]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[6]

Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

work page 2024
[7]

Foster and Rakesh V

Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration.Biometrika, 85(2):379–390, 1998

work page 1998
[8]

A survey of confidence estimation and calibration in large language models

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 6577–6595, 2024

work page 2024
[9]

Who knows the answer? finding the best model and prompt for each query using confidence-based search

Walter Gerych, Yara Rizk, Vatche Isahagian, Vinod Muthusamy, Evelyn Duesterwald, and Praveen Venkateswaran. Who knows the answer? finding the best model and prompt for each query using confidence-based search. InProceedings of the 38th AAAI Conference on Artificial Intelligence, 2024

work page 2024
[10]

Adaptive conformal inference under distribution shift

Isaac Gibbs and Emmanuel Cand` es. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[11]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

work page 2007
[12]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1321–1330, 2017

work page 2017
[13]

Stuart Hunter

J. Stuart Hunter. The exponentially weighted moving average.Journal of Quality Technology, 18(4):203–210, 1986

work page 1986
[14]

A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

Audun Jøsang, Roslan Ismail, and Colin Boyd. A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

work page 2007
[15]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DaSilva, Eli Elhage, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Kamvar, Mario T

Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. EigenTrust: Reputation management in P2P networks. InProceedings of the 12th International Conference on World Wide Web (WWW), pages 640–651, 2003. 22

work page 2003
[17]

Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[18]

Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge

Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge. Large language models miss the multi- agent mark. InAdvances in Neural Information Processing Systems (NeurIPS), Position Track,

work page
[19]

Simple and scalable pre- dictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2017

work page 2017
[20]

ConfTuner: LLM self-calibration via confidence tuning

Zhiwei Li et al. ConfTuner: LLM self-calibration via confidence tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[21]

Uncertainty quantification and confidence calibration in large language models: A survey

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. arXiv:2503.15850

work page arXiv 2025
[22]

Your pre-trained LLM is secretly an unsupervised confidence calibrator

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. Your pre-trained LLM is secretly an unsupervised confidence calibrator. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.16690

work page arXiv 2025
[23]

Revisiting the calibration of modern neural networks

Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021
[24]

Cooper, and Milos Hauskrecht

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well cali- brated probabilities using Bayesian binning into quantiles. InProceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015

work page 2015
[25]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[26]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

work page 1999
[27]

Thermometer: Towards universal calibration for large language models

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[28]

Barrett, and Arnu Pretorius

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going MAD? a look at multi-agent debate strategies for LLMs.arXiv preprint arXiv:2311.17371, 2024

work page arXiv 2024
[29]

Manning, and Chelsea Finn

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models 23 fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.14975

work page arXiv 2023
[30]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

work page 2024
[31]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InCOLM 2024, 2024. arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024. arXiv:2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University – Computer and Information Sciences, 2025. Appendix A Proofs of Formal Properties Proof of Proposition 1 (EWMA as Exponential Discounting) We proceed by induction. The base caset= 1 g...

work page 2025

[1] [1]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.Foundations and Trends in Machine Learning, 16(4):494–591, 2023. arXiv:2107.07511. 21

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Cambridge Univer- sity Press, 2006

Nicol` o Cesa-Bianchi and G´ abor Lugosi.Prediction, Learning, and Games. Cambridge Univer- sity Press, 2006

work page 2006

[3] [3]

FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research,

work page

[4] [4]

Philip Dawid

A. Philip Dawid. The well-calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

work page 1982

[5] [5]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[6] [6]

Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

work page 2024

[7] [7]

Foster and Rakesh V

Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration.Biometrika, 85(2):379–390, 1998

work page 1998

[8] [8]

A survey of confidence estimation and calibration in large language models

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 6577–6595, 2024

work page 2024

[9] [9]

Who knows the answer? finding the best model and prompt for each query using confidence-based search

Walter Gerych, Yara Rizk, Vatche Isahagian, Vinod Muthusamy, Evelyn Duesterwald, and Praveen Venkateswaran. Who knows the answer? finding the best model and prompt for each query using confidence-based search. InProceedings of the 38th AAAI Conference on Artificial Intelligence, 2024

work page 2024

[10] [10]

Adaptive conformal inference under distribution shift

Isaac Gibbs and Emmanuel Cand` es. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[11] [11]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

work page 2007

[12] [12]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1321–1330, 2017

work page 2017

[13] [13]

Stuart Hunter

J. Stuart Hunter. The exponentially weighted moving average.Journal of Quality Technology, 18(4):203–210, 1986

work page 1986

[14] [14]

A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

Audun Jøsang, Roslan Ismail, and Colin Boyd. A survey of trust and reputation systems for online service provision.Decision Support Systems, 43(2):618–644, 2007

work page 2007

[15] [15]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DaSilva, Eli Elhage, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Kamvar, Mario T

Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. EigenTrust: Reputation management in P2P networks. InProceedings of the 12th International Conference on World Wide Web (WWW), pages 640–651, 2003. 22

work page 2003

[17] [17]

Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[18] [18]

Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge

Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael Wooldridge. Large language models miss the multi- agent mark. InAdvances in Neural Information Processing Systems (NeurIPS), Position Track,

work page

[19] [19]

Simple and scalable pre- dictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2017

work page 2017

[20] [20]

ConfTuner: LLM self-calibration via confidence tuning

Zhiwei Li et al. ConfTuner: LLM self-calibration via confidence tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[21] [21]

Uncertainty quantification and confidence calibration in large language models: A survey

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. arXiv:2503.15850

work page arXiv 2025

[22] [22]

Your pre-trained LLM is secretly an unsupervised confidence calibrator

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. Your pre-trained LLM is secretly an unsupervised confidence calibrator. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.16690

work page arXiv 2025

[23] [23]

Revisiting the calibration of modern neural networks

Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021

[24] [24]

Cooper, and Milos Hauskrecht

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well cali- brated probabilities using Bayesian binning into quantiles. InProceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015

work page 2015

[25] [25]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[26] [26]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

work page 1999

[27] [27]

Thermometer: Towards universal calibration for large language models

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. Thermometer: Towards universal calibration for large language models. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[28] [28]

Barrett, and Arnu Pretorius

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going MAD? a look at multi-agent debate strategies for LLMs.arXiv preprint arXiv:2311.17371, 2024

work page arXiv 2024

[29] [29]

Manning, and Chelsea Finn

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models 23 fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.14975

work page arXiv 2023

[30] [30]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

work page 2024

[31] [31]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[32] [32]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InCOLM 2024, 2024. arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024. arXiv:2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University – Computer and Information Sciences, 2025. Appendix A Proofs of Formal Properties Proof of Proposition 1 (EWMA as Exponential Discounting) We proceed by induction. The base caset= 1 g...

work page 2025