pith. sign in

arxiv: 2606.08084 · v1 · pith:SSFIASYNnew · submitted 2026-06-06 · 🧮 math.ST · stat.AP· stat.ML· stat.TH

Assessing model calibration with boosting trees

Pith reviewed 2026-06-27 19:18 UTC · model grok-4.3

classification 🧮 math.ST stat.APstat.MLstat.TH
keywords calibrationauto-calibrationboosting treesregression modelinginsurance pricingstatistical testingconditional expectation
0
0 comments X

The pith

Boosting trees can test necessary conditions for calibration and auto-calibration in regression models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that boosting trees can check necessary conditions for a regression model to be calibrated, so that predicted means match true conditional means for almost all feature sets. It also covers the weaker auto-calibration property, under which observations sharing the same predicted mean have an expectation that equals that prediction. This matters for applications such as insurance pricing, where auto-calibration prevents cross-subsidization across price groups even when full calibration remains out of reach with finite noisy data. The approach is backed by a numerical study on a large insurance dataset in which the tests demonstrate high power to detect violations.

Core claim

Boosting trees can be used to test necessary conditions for calibration and auto-calibration, respectively. The practical relevance of our approach is supported by a numerical example, in which the proposed tests prove to be very powerful on a large insurance dataset.

What carries the argument

Boosting trees applied to test necessary conditions for calibration and auto-calibration of a regression function.

If this is right

  • Passing the tests confirms that the model meets necessary conditions for matching predicted and true conditional means.
  • The tests can be used to verify auto-calibration and thereby rule out cross-subsidization between price cohorts in insurance applications.
  • The method remains applicable even when perfect calibration cannot be achieved because of finite samples and noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boosting-tree tests could be applied to compare calibration properties across different regression fitting procedures on the same dataset.
  • Repeated application of the tests during model development might identify feature transformations that improve satisfaction of the necessary conditions.
  • If the tests reject on a given model, retraining with added constraints that enforce the tested identities could be explored as a corrective step.

Load-bearing premise

The boosting-tree procedure itself does not introduce bias or dependence that would invalidate the test when applied to finite noisy samples.

What would settle it

Generate data from a regression model known to violate calibration or auto-calibration, apply the boosting-tree tests, and check whether they fail to reject the null hypothesis of no violation.

Figures

Figures reproduced from arXiv: 2606.08084 by Selim Gatti.

Figure 1
Figure 1. Figure 1: (a) The large rectangle represents the full dataset [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Histogram of the true annual frequency of claims for each policy. The [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of the mean estimates against the true conditional means on the test set [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Plots of the mean estimates against the true conditional means on the test set [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plots of binned averages of responses (actuals) against binned averages of mean [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Plots of binned averages of responses (actuals) against binned averages of mean [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The empirical density of the test statistics [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The empirical density of the test statistics [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Number of rejections of the null-hypothesis of calibration and auto-calibration, re [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Calibration plots of two different models. Only the case where the learning set [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
read the original abstract

The main goal in regression modelling consists in approximating the conditional mean of a response given a set of features. A regression function is said to be calibrated if the resulting mean estimates match the true conditional means for almost every set of features. Aiming for calibration seems not achievable in practice as one typically deals with finite samples of noisy observations. A weaker notion of calibration is auto-calibration, and it means that the expectation of responses being given the same mean estimate matches this estimate. This notion is important, e.g., in insurance pricing as it ensures no cross-subsidization between different price cohorts. In this paper, we show that boosting trees can be used to test necessary conditions for calibration and auto-calibration, respectively. The practical relevance of our approach is supported by a numerical example, in which the proposed tests prove to be very powerful on a large insurance dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that boosting trees can be used to test necessary conditions for calibration (matching true conditional means) and auto-calibration (matching conditional expectations given the predicted mean) in regression models. Auto-calibration is highlighted for its relevance in insurance pricing to avoid cross-subsidization. The approach is supported by a numerical example on a large insurance dataset in which the proposed tests are reported to be very powerful.

Significance. If the boosting-tree tests are shown to be valid without introducing bias or dependence on finite samples, the method would supply a practical, tree-based diagnostic for necessary conditions of calibration that is directly applicable to insurance and similar pricing contexts. The numerical example on real data is a strength if accompanied by proper controls and baselines, but the overall significance hinges on resolving the validity questions for the test construction itself.

major comments (2)
  1. [Abstract / numerical example] Abstract and numerical-example section: the claim that the tests 'prove to be very powerful' rests on a single insurance dataset example, yet no power calculations, error bars, baseline comparisons, or controls for the boosting procedure are described. This prevents verification that the observed power is not an artifact of the example design.
  2. [Method description (auto-calibration test)] Method for auto-calibration test: the boosting-tree procedure used to test the necessary condition for auto-calibration must be shown not to introduce bias or dependence on finite noisy samples. The manuscript should supply either a theoretical argument for validity under the null or simulation evidence that the test size is controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / numerical example] Abstract and numerical-example section: the claim that the tests 'prove to be very powerful' rests on a single insurance dataset example, yet no power calculations, error bars, baseline comparisons, or controls for the boosting procedure are described. This prevents verification that the observed power is not an artifact of the example design.

    Authors: We agree that the numerical example is limited to a single large insurance dataset and does not include power calculations, error bars, baseline comparisons, or explicit controls for the boosting procedure. While the dataset size supports the observed results, this is a valid concern. In the revision we will add a simulation study with known calibrated and miscalibrated models to quantify power, include bootstrap-based error bars on the insurance results, and provide comparisons against binning and isotonic regression baselines. We will also revise the abstract language from 'prove to be very powerful' to 'demonstrate strong performance'. revision: yes

  2. Referee: [Method description (auto-calibration test)] Method for auto-calibration test: the boosting-tree procedure used to test the necessary condition for auto-calibration must be shown not to introduce bias or dependence on finite noisy samples. The manuscript should supply either a theoretical argument for validity under the null or simulation evidence that the test size is controlled.

    Authors: We acknowledge that the manuscript currently provides neither a theoretical argument nor simulation evidence that the boosting-tree test controls size under the null of auto-calibration. Finite-sample bias in tree construction is a legitimate issue. We will add a simulation study in the revised version that generates data under the null for a range of sample sizes and noise levels, applies the test, and reports empirical rejection rates to confirm size control at the nominal level. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a method using boosting trees to test necessary conditions for calibration and auto-calibration, supported by a numerical example on insurance data. The abstract and available description present this as an independent diagnostic approach without any equations, derivations, or claims that reduce by construction to fitted parameters, self-definitions, or self-citation chains. No load-bearing steps matching the enumerated circularity patterns are identifiable from the provided content, and the central claim does not rely on renaming known results or importing uniqueness via prior author work. The derivation chain is self-contained as a proposed testing procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract. The approach appears to rely on standard properties of boosting and conditional expectation, but these are not itemized.

pith-pipeline@v0.9.1-grok · 5668 in / 1087 out tokens · 23884 ms · 2026-06-27T19:18:42.051694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages

  1. [1]

    Bierens, H. (1990). A consistent conditional moment test of functional form.Econometrica58, 1443–1458

  2. [2]

    Bross, I. D. J. (1953).Design for decision. New York: Macmillan

  3. [3]

    Delong, L., Gatti, S., W¨ uthrich, M. V. (2026). Calibration bands for mean estimates within the exponential dispersion family.Statistical Theory and Related Fields, 1-43

  4. [4]

    Delong, L., W¨ uthrich, M. V. (2025). Isotonic regression for variance estimation and its role in mean estimation and model validation.North American Actuarial Journal29/3, 563-591

  5. [5]

    Delong, L., W¨ uthrich, M. V. (2025). Universal inference for testing calibration of mean estimates within the exponential dispersion family.arXiv: 2510.23821

  6. [6]

    Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for insur- ance pricing with machine learning.Insurance: Mathematics and Economics101/B, 485-497

  7. [7]

    Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves.Insurance: Mathematics and Economics117, 130-139

  8. [8]

    (2018).CASdatasetsRpackage vignette.Reference manual

    Dutang, C., Charpentier, A. (2018).CASdatasetsRpackage vignette.Reference manual. Version 1.0-8, packaged 2018-05-20. 31

  9. [9]

    Fissler, T., Lorentzen, C., Mayer, M. (2022). Model comparison and calibration assessment: User guide for consistent scoring functions in machine learning and actuarial practice.arXiv: 2202.12780

  10. [10]

    Freund, Y., Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.Journal of Computer and System Sciences55/1, 119-139

  11. [11]

    Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statistics29/5, 1189-1232

  12. [12]

    Gneiting, T. (2011). Making and evaluating point forecasts.Journal of the American Statistical Association106, 746-762

  13. [13]

    Gneiting, T., Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association102/477, 359-378

  14. [14]

    Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecast evaluation: Conditional cali- bration, reliability diagrams, and coefficient of determination.Electronic Journal of Statistics17/2, 3226-3286

  15. [15]

    H¨ ardle, W., Mammen, E. (1993). Comparing nonparametric versus parametric regression fits.The Annals of Statistics21, 1926-1947

  16. [16]

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learn- ing: Data mining, inference, and prediction

    Hastie, T., Tibshirani, R., Friedman, J. (2009).The elements of statistical learning. Data min- ing, inference, and prediction.2nd ed. New York: Springer Series in Statistics.https://link. springer.com/book/10.1007/978-0-387-84858-7

  17. [17]

    Henzi, A., Puke, M., Dimitriadis, T., Ziegel, J. (2022). A safe Hosmer–Lemeshow test.Journal of Statistics in Data Science2/2, 175-189

  18. [18]

    W., Lemeshow, S

    Hosmer, D. W., Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model.Communications in Statistics - Theory and Methods9, 1043-1069

  19. [19]

    (2004).Probability essentials.2nd ed

    Jacod, J., Protter, P. (2004).Probability essentials.2nd ed. Berlin: Springer.https://link. springer.com/book/10.1007/978-3-642-55682-1

  20. [20]

    Kearns, M., Valiant, L.G. (1988). Learning boolean formulae or finite automata is hard as factoring. Technical Report TR-14-88. Harvard University Aiken Computation Laboratory

  21. [21]

    (2020).Probability Theory

    Klenke, A. (2020).Probability Theory. A comprehensive course.3rd ed. Cham: Springer.https: //link.springer.com/book/10.1007/978-3-030-56402-5

  22. [22]

    Kr¨ uger, F., Ziegel, J. F. (2021). Generic conditions for forecast dominance.Journal of Business & Economic Statistics39/4, 972-983

  23. [23]

    H., Winkler, R

    Murphy, A. H., Winkler, R. L. (1977). Reliability of subjective probability forecasts of precipitation and temperature.Applied Statistics26, 41-47

  24. [24]

    Pohle, M. O. (2020). The Murphy decomposition and the calibration-resolution principle: A new perspective on forecast evaluation.arXiv: 2005.01835

  25. [25]

    R Core Team (2021).R: A language and environment for statistical computing.RFoundation for Statistical Computing, Vienna, Austria.https://www.R-project.org/

  26. [26]

    Sanders, F. (1963). On subjective probability forecasting.Journal of Applied Meteorology and Cli- matology2, 191-201

  27. [27]

    Valiant, L.G. (1984). A theory of learnable.Communications of the Association for Computing Machinery ACM27/11, 1134-1142

  28. [28]

    W¨ uthrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions.European Actuarial Journal15, 335-341. 32

  29. [29]

    W¨ uthrich, M.V., Buser, C. (2016). Data analytics for non-life insurance pricing.Swiss Finance Institute Research Paper16/68, Available at SSRN:https://ssrn.com/abstract=2870308

  30. [30]

    (2023).Statistical Foundations of Actuarial Learning and its Applica- tions.Springer Actuarial.https://link.springer.com/book/10.1007/978-3-031-12409-9 15

    W¨ uthrich, M.V., Merz, M. (2023).Statistical foundations of actuarial learning and its applications. Cham: Springer.https://link.springer.com/book/10.1007/978-3-031-12409-9

  31. [31]

    W¨ uthrich, M.V., Richman, R., Avanzi, B., Lindholm, M., Maggi, M., Mayer, M., Schelldorfer, J, Scognamiglio, S. (2026). AI tools for actuaries. Available at SSRN:https://ssrn.com/abstract= 5162304

  32. [32]

    W¨ uthrich, M.V., Ziegel, J. (2024). Isotonic recalibration under a low signal-to-noise ratio.Scandi- navian Actuarial Journal2024/3, 279-299. 33 A Proof of Proposition 3.1 The proof of Proposition 3.1 is given in this appendix. As the proof of Proposition 3.4 is perfectly similar, it is omitted. Proof of Proposition 3.1. The random variables (Zi)n i=1 in...

  33. [33]

    This can be explained by two different factors

    There, we observe that the constructed calibration bands are only able to detect violations of calibration for the overfitting models DNN2 and GBM2 in the case where the learning set corresponds to 60% of the datasetD. This can be explained by two different factors. On the one hand, calibration bands become narrower as more data is used to construct them,...