pith. sign in

arxiv: 2605.18858 · v1 · pith:MV2LDSYZnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.GT· stat.ML

When Individually Calibrated Models Become Collectively Miscalibrated

Pith reviewed 2026-05-20 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GTstat.ML
keywords calibrationBrier scorePrice of Anarchymulti-agentaggregationprobabilistic forecastingmachine learning
0
0 comments X

The pith

Individually calibrated models become collectively miscalibrated under Brier-score aggregation with correlated beliefs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when multiple agents report probability estimates to minimize their individual Brier scores and their beliefs are positively correlated due to shared data, the reports systematically underestimate the positive-class probability. This causes the aggregated prediction to be miscalibrated even if each individual report is calibrated. A reader would care because this challenges the common practice of assuming individual calibration ensures good aggregate performance in systems that combine multiple models. The analysis includes a bound on the resulting Price of Anarchy and demonstrates that VCG aggregation avoids the problem by aligning incentives.

Core claim

Under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In the canonical setting with n=5, pairwise correlation=0.5, base rate=0.3, the empirically measured PoA in false-negative rate reaches 7.25x. VCG-based aggregation aligns incentives and achieves dominant-strategy incentive compatibility.

What carries the argument

The game-theoretic strategic response to Brier-score aggregation, where agents optimize local scores without coordination, leading to underestimation when beliefs covary positively.

If this is right

  • Each agent's report underestimates the positive-class probability under positive covariance.
  • The aggregate shows higher false-negative rates, up to 7.25 times in the example case.
  • VCG aggregation provides incentive compatibility and maintains accuracy on real datasets.
  • Adaptive weighting improves performance under distribution shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar miscalibration could occur in other aggregation rules if they do not account for strategic reporting.
  • Monitoring correlations between model predictions could help detect potential collective miscalibration.
  • Extending this to non-probabilistic settings or different loss functions might reveal analogous incentive issues.

Load-bearing premise

Agents independently optimize their local Brier score reports without coordination and treat the aggregation rule as fixed when choosing their reports.

What would settle it

Comparing the frequency of positive outcomes to the aggregated probability estimate when agents use Brier-optimal reports versus when they report truthfully, in a controlled setting with known positive correlations.

Figures

Figures reproduced from arXiv: 2605.18858 by Zhaohui Wang.

Figure 1
Figure 1. Figure 1: Overview. (a) Individually calibrated agents become collectively miscalibrated under strategic interaction (aggregate bias ¯δ= − 0.375). (b) Brier scoring incurs 7.25× PoA in false￾negative rate; VCG achieves the lowest PoA among mechanisms studied. (c) VCG outperforms stacking and majority vote at n≤500 (9.4% fewer FNs at n=100). (d) Pipeline: feature-partitioned agents report probabilities; VCG computes … view at source ↗
Figure 2
Figure 2. Figure 2: Mixed n=8 ensemble (4 sklearn + 4 LLM prompts) on NSL-KDD. (a) VCG-cal aggregator [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Disagreement vs. FN rate. Higher disagreement does not consistently reduce FN. [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VCG vs. Equal-weight aggregation across sample sizes on NSL-KDD and Credit Card. [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: k-LOO approximation tradeoff: relative FN error (%, left axis) and aggregation latency (ms, right axis) vs. number of LOO evaluations k, for n ∈ {5, 10, 20} agents. Discussion. The k-LOO approximation is highly accurate: even k = 1 yields FN error < 0.005 across all n ( [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FN rate heatmap under adversarial corruption. VCG (bottom row in each panel) maintains [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Calibration reliability diagram (NSL￾KDD, n=5 agents). VCG and Bayesian log-odds track the diagonal; simple averaging overesti￾mates at low ˆp. We bin predicted probabilities into 10 bins ([0, 0.1), . . . , [0.9, 1.0]) and plot the mean predicted probability against the observed positive fraction. A perfectly calibrated system lies on the diagonal y = x. Key findings: VCG and Bayesian log-odds track the di… view at source ↗
Figure 8
Figure 8. Figure 8: VCG weight adaptation under sudden distribution shift ( [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: PoA vs. observability level kseen across (n, ρ) configurations. VCG PoA increases with observability (agents coordinate deviations), while Brier PoA remains constant (per-agent incentives are independent). I.5. Generalized Scoring Rules PoA Comparison We compare the Price of Anarchy across five scoring/aggregation rules (Brier, Log Score, Spherical, Brier+Regularization, and VCG) under best-response dynami… view at source ↗
Figure 10
Figure 10. Figure 10: Price of Anarchy across scoring rules, agent count [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Convergence of equilibrium deviation as n grows (Brier scoring). Left axis: PoA (solid) decreases to 0 by n ≥ 50. Right axis: n · δ ∗ (dashed) grows linearly—aggregate miscalibration persists even as per-agent deviations become negligible. I.7. Online Regret Sensitivity We evaluate the multiplicative-weight online learning algorithm (Theorem 7) across learning rates η and time horizons T, measuring cumula… view at source ↗
Figure 12
Figure 12. Figure 12: Normalized regret RT / √ T under three drift scenarios. Slower learning rates (η ∗/2, blue) consistently achieve the lowest regret across all scenarios and horizons [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reliability diagrams for each of the three binary datasets. Each panel shows the empirical [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
read the original abstract

Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that individually Brier-calibrated predictors become collectively miscalibrated under Brier-score aggregation when beliefs are positively correlated, because each agent’s myopic best response systematically underestimates the positive-class probability. It proves this underestimation result, shows that the resulting Price of Anarchy exceeds 1 whenever Cov(b_i, b_j) > 0, and reports an empirical PoA of 7.25× in false-negative rate for the canonical parameter set (n=5, pairwise correlation=0.5, base rate=0.3). The work contrasts this with VCG-based aggregation, which is dominant-strategy incentive compatible, and supports the claims with experiments on NSL-KDD, UNSW-NB15, and Credit Card Fraud datasets.

Significance. If the central game-theoretic result holds, the paper identifies a mechanism by which strategic local optimization can induce collective miscalibration even when every individual model is calibrated, with direct implications for ensemble methods and multi-model decision systems. The explicit PoA quantification, the VCG incentive-alignment proposal, and the three-dataset empirical evaluation are concrete strengths that would make the contribution noteworthy in the machine-learning literature.

major comments (2)
  1. [Proof of underestimation (Section 3)] The derivation of the individually optimal report (r_i = n b_i − E[∑_{j≠i} b_j | b_i]) treats the other agents’ reports as fixed at their private beliefs b_j. In a symmetric game the reports must satisfy the fixed-point condition that the assumed r_j equal the equilibrium strategy; substituting the equilibrium strategy back into the conditional expectations changes the bias term and can eliminate or reverse the claimed systematic underestimation. This assumption is load-bearing for both the underestimation theorem and the PoA > 1 claim.
  2. [Canonical setting and PoA measurement (Section 4)] The reported PoA of 7.25× is obtained under the myopic best-response model with a specific parameter triple (n=5, correlation=0.5, base rate=0.3). No sensitivity analysis or equilibrium-consistent re-computation is provided, so it is unclear whether the quantitative result survives the fixed-point correction required by the skeptic note.
minor comments (2)
  1. [Experiments] The PoA figure is presented without error bars or bootstrap intervals; adding these would make the empirical claim more robust.
  2. [Preliminaries] Notation for the aggregation rule (average of reports) and the exact Brier-score objective should be stated explicitly once at the beginning of the formal section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and insightful review. The comments highlight important distinctions between myopic best responses and full Nash equilibrium, which we address below. We outline planned revisions to clarify assumptions and strengthen the quantitative claims.

read point-by-point responses
  1. Referee: [Proof of underestimation (Section 3)] The derivation of the individually optimal report (r_i = n b_i − E[∑_{j≠i} b_j | b_i]) treats the other agents’ reports as fixed at their private beliefs b_j. In a symmetric game the reports must satisfy the fixed-point condition that the assumed r_j equal the equilibrium strategy; substituting the equilibrium strategy back into the conditional expectations changes the bias term and can eliminate or reverse the claimed systematic underestimation. This assumption is load-bearing for both the underestimation theorem and the PoA > 1 claim.

    Authors: We appreciate this observation on the modeling choice. Our analysis is explicitly framed under myopic best-response dynamics, in which each agent optimizes its report while treating others' reports as fixed at their private beliefs. This corresponds to the natural setting of independently trained models on overlapping data, where agents do not coordinate on a joint equilibrium strategy. The underestimation theorem and the resulting PoA > 1 result are derived and stated under this myopic regime, which we believe is the appropriate model for the paper's claims about collective miscalibration. We agree that a symmetric Nash equilibrium would require solving the fixed-point equations. In the revision we will add a clarifying paragraph in Section 3 that explicitly states the myopic assumption, contrasts it with full equilibrium, and notes that the directional bias from positive covariance is expected to persist (though possibly attenuated) under equilibrium play. revision: partial

  2. Referee: [Canonical setting and PoA measurement (Section 4)] The reported PoA of 7.25× is obtained under the myopic best-response model with a specific parameter triple (n=5, correlation=0.5, base rate=0.3). No sensitivity analysis or equilibrium-consistent re-computation is provided, so it is unclear whether the quantitative result survives the fixed-point correction required by the skeptic note.

    Authors: We acknowledge that the 7.25× figure is presented for a single canonical parameter set under the myopic model. In the revised manuscript we will expand Section 4 with a sensitivity analysis over ranges of n, pairwise correlation, and base rate, still under myopic best responses. In addition, we will numerically solve the symmetric fixed-point equations for the equilibrium reports at the canonical parameters and report the resulting PoA value, thereby directly addressing whether the quantitative conclusion is robust to the equilibrium correction. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is explicit game-theoretic model with independent simulation

full rationale

The paper derives the underestimation and PoA > 1 directly from the closed-form solution to each agent's local Brier minimization treating other reports as fixed at b_j, then computes the resulting false-negative-rate ratio on explicitly chosen parameters (n=5, correlation=0.5, base rate=0.3). This is a forward derivation from stated assumptions rather than any reduction of the target quantity to a fitted input, self-citation chain, or definitional equivalence. The empirical PoA figure is a simulation output under those parameters, not a prediction forced by reusing the same data or equilibrium fixed-point. No load-bearing self-citations, ansatzes, or renamings appear in the derivation chain.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on a game-theoretic model of strategic reporting and on specific illustrative parameter values. No new physical entities are postulated.

free parameters (3)
  • pairwise correlation = 0.5
    Chosen value of 0.5 used to compute the canonical PoA example
  • base rate = 0.3
    Chosen value of 0.3 used to compute the canonical PoA example
  • number of agents = 5
    Chosen value of n=5 used to compute the canonical PoA example
axioms (2)
  • domain assumption Each agent independently selects the report that maximizes its expected Brier score given the fixed aggregation rule
    Invoked to derive the systematic underestimation of positive-class probability
  • domain assumption Beliefs are positively correlated across agents
    Required for the PoA to exceed one

pith-pipeline@v0.9.0 · 5771 in / 1756 out tokens · 92149 ms · 2026-05-20T20:17:09.276148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Under the Brier score mechanism with n≥2 agents whose beliefs are correlated and outcome Pr(y=1|b1,...,bn)=1/n ∑j bj, reporting mi=bi is not the Brier-optimal strategy. The Brier-optimal report for agent i is m∗i=E[y|bi]≠bi

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages

  1. [1]

    A detailed analysis of the

    Tavallaee, Mahbod and Bagheri, Ebrahim and Lu, Wei and Ghorbani, Ali A , booktitle=. A detailed analysis of the. 2009 , doi=

  2. [2]

    Expert Systems with Applications , volume=

    Learned lessons in credit card fraud detection from a practitioner perspective , author=. Expert Systems with Applications , volume=. 2014 , doi=

  3. [3]

    Journal of the American Statistical Association , volume=

    Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American Statistical Association , volume=. 2007 , doi=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Truthful data acquisition via peer prediction , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    Proceedings of the 18th ACM Conference on Economics and Computation , pages=

    Machine-learning aided peer prediction , author=. Proceedings of the 18th ACM Conference on Economics and Computation , pages=. 2017 , doi=

  6. [6]

    Management Science , volume=

    Eliciting informative feedback: The peer-prediction method , author=. Management Science , volume=. 2005 , doi=

  7. [7]

    Proceedings of the 13th ACM Conference on Electronic Commerce , pages=

    Peer prediction without a common prior , author=. Proceedings of the 13th ACM Conference on Electronic Commerce , pages=. 2012 , doi=

  8. [8]

    Journal of Computer and System Sciences , volume=

    A decision-theoretic generalization of on-line learning and an application to boosting , author=. Journal of Computer and System Sciences , volume=. 1997 , doi=

  9. [9]

    Econometrica , volume=

    Incentives in teams , author=. Econometrica , volume=. 1973 , doi=

  10. [10]

    Monthly Weather Review , volume=

    Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review , volume=. 1950 , doi=

  11. [11]

    The well-calibrated

    Dawid, A Philip , journal=. The well-calibrated. 1982 , doi=

  12. [12]

    Journal of the Royal Statistical Society: Series D (The Statistician) , volume=

    The comparison and evaluation of forecasters , author=. Journal of the Royal Statistical Society: Series D (The Statistician) , volume=. 1983 , doi=

  13. [13]

    Proceedings of the 34th International Conference on Machine Learning , pages=

    On calibration of modern neural networks , author=. Proceedings of the 34th International Conference on Machine Learning , pages=

  14. [14]

    Multiple Classifier Systems , series=

    Ensemble methods in machine learning , author=. Multiple Classifier Systems , series=. 2000 , publisher=

  15. [15]

    2007 , publisher=

    Algorithmic Game Theory , author=. 2007 , publisher=

  16. [16]

    npj Digital Medicine , volume=

    Scalable and accurate deep learning with electronic health records , author=. npj Digital Medicine , volume=. 2018 , doi=

  17. [17]

    Nature , volume=

    A clinically applicable approach to continuous prediction of future acute kidney injury , author=. Nature , volume=. 2019 , doi=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Bayesian deep learning and a probabilistic perspective of generalization , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    The Lancet Digital Health , volume=

    The myth of generalisability in clinical research and machine learning in health care , author=. The Lancet Digital Health , volume=. 2020 , doi=

  21. [21]

    Science , volume=

    Dissecting racial bias in an algorithm used to manage the health of populations , author=. Science , volume=. 2019 , doi=

  22. [22]

    American Journal of Cardiology , volume=

    International application of a new probability algorithm for the diagnosis of coronary artery disease , author=. American Journal of Cardiology , volume=. 1989 , doi=

  23. [23]

    Using the

    Smith, Jack W and Everhart, James E and Dickson, W C and Knowler, William C and Johannes, Robert S , booktitle=. Using the

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Deep sets , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Proceedings of the 4th International Conference on Information Systems Security and Privacy , pages=

    Toward generating a new intrusion detection dataset and intrusion traffic characterization , author=. Proceedings of the 4th International Conference on Information Systems Security and Privacy , pages=. 2018 , doi=

  26. [26]

    2015 , doi=

    Moustafa, Nour and Slay, Jill , booktitle=. 2015 , doi=

  27. [27]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Arik, Sercan. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2021 , doi=

  28. [28]

    Statistical Science , volume=

    Combining probability distributions: A critique and an annotated bibliography , author=. Statistical Science , volume=. 1986 , doi=

  29. [29]

    The Journal of Finance , volume=

    Counterspeculation, auctions, and competitive sealed tenders , author=. The Journal of Finance , volume=. 1961 , doi=

  30. [30]

    Public Choice , volume=

    Multipart pricing of public goods , author=. Public Choice , volume=. 1971 , doi=

  31. [31]

    Annual Symposium on Theoretical Aspects of Computer Science , series=

    Worst-case equilibria , author=. Annual Symposium on Theoretical Aspects of Computer Science , series=. 1999 , publisher=

  32. [32]

    Machine learning with adversaries:

    Blanchard, Peva and El Mhamdi, El Mahdi and Guerraoui, Rachid and Stainer, Julien , booktitle=. Machine learning with adversaries:

  33. [33]

    Proceedings of the 35th International Conference on Machine Learning , pages=

    Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , author=. Proceedings of the 35th International Conference on Machine Learning , pages=

  34. [34]

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages=

    Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics , pages=

  35. [35]

    Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank and Stich, Sebastian and Suresh, Ananda Theertha , booktitle=

  36. [36]

    2006 , publisher=

    Prediction, Learning, and Games , author=. 2006 , publisher=

  37. [37]

    Contributions to the Theory of Games , editor=

    A value for n-person games , author=. Contributions to the Theory of Games , editor=. 1953 , publisher=

  38. [38]

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

    Combining probability forecasts , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=. 2010 , doi=

  39. [39]

    Maximum likelihood estimation of observer error-rates using the

    Dawid, A Philip and Skene, Allan M , journal=. Maximum likelihood estimation of observer error-rates using the. 1979 , doi=

  40. [40]

    Journal of the ACM , volume=

    Intrinsic Robustness of the Price of Anarchy , author=. Journal of the ACM , volume=. 2015 , doi=

  41. [41]

    Can You Trust Your Model's Uncertainty?

    Ovadia, Yaniv and Fertig, Emily and Ren, Jie and Nado, Zachary and Sculley, D and Nowozin, Sebastian and Dillon, Joshua V and Lakshminarayanan, Balaji and Snoek, Jasper , booktitle=. Can You Trust Your Model's Uncertainty?

  42. [42]

    IEEE Signal Processing Magazine , volume=

    Federated Learning: Challenges, Methods, and Future Directions , author=. IEEE Signal Processing Magazine , volume=. 2020 , doi=

  43. [43]

    International Conference on Machine Learning , pages=

    Online Learning under Delayed Feedback , author=. International Conference on Machine Learning , pages=

  44. [44]

    RAND Memorandum RM-2651 , year=

    Values of Large Games, IV: Evaluating the Electoral College by Montecarlo Techniques , author=. RAND Memorandum RM-2651 , year=

  45. [45]

    TabNet: Attentive Interpretable Tabular Learning.,

    Sercan \"O Arik and Tomas Pfister. TabNet : Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679--6687, 2021. doi:10.1609/aaai.v35i8.16826

  46. [46]

    Machine learning with adversaries: Byzantine tolerant gradient descent

    Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017

  47. [47]

    Verification of forecasts expressed in terms of probability

    Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1--3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

  48. [48]

    Prediction, Learning, and Games

    Nicol \`o Cesa-Bianchi and G \'a bor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. doi:10.1017/CBO9780511546921

  49. [49]

    Truthful data acquisition via peer prediction

    Yiling Chen, Yiheng Shen, and Shuran Zheng. Truthful data acquisition via peer prediction. In Advances in Neural Information Processing Systems, volume 33, pages 18879--18889, 2020

  50. [50]

    Multipart pricing of public goods

    Edward H Clarke. Multipart pricing of public goods. Public Choice, 11 0 (1): 0 17--33, 1971. doi:10.1007/BF01726210

  51. [51]

    Learned lessons in credit card fraud detection from a practitioner perspective

    Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge Waterschoot, and Gianluca Bontempi. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41 0 (10): 0 4915--4928, 2014. doi:10.1016/j.eswa.2014.02.026

  52. [52]

    The well-calibrated B ayesian

    A Philip Dawid. The well-calibrated B ayesian. Journal of the American Statistical Association, 77 0 (379): 0 605--610, 1982. doi:10.1080/01621459.1982.10477856

  53. [53]

    Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

    A Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28 0 (1): 0 20--28, 1979. doi:10.2307/2346806

  54. [54]

    DeGroot and Stephen E

    Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32 0 (1-2): 0 12--22, 1983. doi:10.2307/2987588

  55. [55]

    International application of a new probability algorithm for the diagnosis of coronary artery disease

    Robert Detrano, Ales Jan s a, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid, Sarbjit Sandhu, Kern H Guppy, Stella Lee, and Victor Froelicher. International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64 0 (5): 0 304--310, 1989. doi:10.1016/0002-9149(89)90524-9

  56. [56]

    Ensemble methods in machine learning

    Thomas G Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, Lecture Notes in Computer Science, pages 1--15. Springer, 2000. doi:10.1007/3-540-45014-9_1

  57. [57]

    Freund, R

    Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 0 (1): 0 119--139, 1997. doi:10.1006/jcss.1997.1504

  58. [58]

    The myth of generalisability in clinical research and machine learning in health care

    Joseph Futoma, Morgan Siber, and Jonathan A Quinn. The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health, 2 0 (9): 0 e489--e492, 2020. doi:10.1016/S2589-7500(20)30186-2

  59. [59]

    Hastie and R

    Christian Genest and James V Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1 0 (1): 0 114--135, 1986. doi:10.1214/ss/1177013825

  60. [60]

    Strictly

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 0 (477): 0 359--378, 2007. doi:10.1198/016214506000001437

  61. [61]

    Incentives in teams

    Theodore Groves. Incentives in teams. Econometrica, 41 0 (4): 0 617--631, 1973. doi:10.2307/1914085

  62. [62]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321--1330, 2017

  63. [63]

    Online learning under delayed feedback

    Pooria Joulani, Andras Gyorgy, and Csaba Szepesvari. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453--1461, 2013

  64. [64]

    SCAFFOLD : Stochastic controlled averaging for federated learning

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD : Stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, pages 5132--5143, 2020

  65. [65]

    Worst-case equilibria

    Elias Koutsoupias and Christos Papadimitriou. Worst-case equilibria. In Annual Symposium on Theoretical Aspects of Computer Science, volume 1563 of Lecture Notes in Computer Science, pages 404--413. Springer, 1999. doi:10.1007/3-540-49116-3_38

  66. [66]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017

  67. [67]

    Federated Learn- ing: Challenges, Methods, and Future Directions,

    Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37 0 (3): 0 50--60, 2020. doi:10.1109/MSP.2020.2975749

  68. [68]

    Machine-learning aided peer prediction

    Yang Liu and Yiling Chen. Machine-learning aided peer prediction. In Proceedings of the 18th ACM Conference on Economics and Computation, pages 63--80, 2017. doi:10.1145/3033274.3085126

  69. [69]

    Irwin Mann and Lloyd S. Shapley. Values of large games, iv: Evaluating the electoral college by montecarlo techniques. RAND Memorandum RM-2651, 1960

  70. [70]

    Communication-efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag \"u era y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1273--1282, 2017

  71. [71]

    Eliciting informative feedback: The peer-prediction method

    Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method. Management Science, 51 0 (9): 0 1359--1373, 2005. doi:10.1287/mnsc.1050.0379

  72. [72]

    UNSW-NB15 : A comprehensive data set for network intrusion detection systems ( UNSW-NB15 network data set)

    Nour Moustafa and Jill Slay. UNSW-NB15 : A comprehensive data set for network intrusion detection systems ( UNSW-NB15 network data set). In Military Communications and Information Systems Conference, pages 1--6, 2015. doi:10.1109/MilCIS.2015.7348942

  73. [73]

    Algorithmic Game Theory

    Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007. doi:10.1017/CBO9780511800481

  74. [74]

    Science , author =

    Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366 0 (6464): 0 447--453, 2019. doi:10.1126/science.aax2342

  75. [75]

    Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, volume 32, 2019

  76. [76]

    Scalable and accurate deep learning with electronic health records,

    Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1 0 (1): 0 18, 2018. doi:10.1038/s41746-018-0029-1

  77. [77]

    Combining probability forecasts

    Roopesh Ranjan and Tilmann Gneiting. Combining probability forecasts. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72 0 (1): 0 71--91, 2010. doi:10.1111/j.1467-9868.2009.00726.x

  78. [78]

    Intrinsic robustness of the price of ana rchy

    Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM, 62 0 (5): 0 1--42, 2015. doi:10.1145/2806883

  79. [79]

    A value for n-person games

    Lloyd S Shapley. A value for n-person games. In Harold W Kuhn and Albert W Tucker, editors, Contributions to the Theory of Games, volume 2, pages 307--317. Princeton University Press, 1953. doi:10.1515/9781400881970-018

  80. [80]

    Toward generating a new intrusion detection dataset and intrusion traffic characterization

    Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy, pages 108--116, 2018. doi:10.5220/0006639801080116

Showing first 80 references.