Assessing model calibration with boosting trees

Selim Gatti

arxiv: 2606.08084 · v1 · pith:SSFIASYNnew · submitted 2026-06-06 · 🧮 math.ST · stat.AP· stat.ML· stat.TH

Assessing model calibration with boosting trees

Selim Gatti This is my paper

Pith reviewed 2026-06-27 19:18 UTC · model grok-4.3

classification 🧮 math.ST stat.APstat.MLstat.TH

keywords calibrationauto-calibrationboosting treesregression modelinginsurance pricingstatistical testingconditional expectation

0 comments

The pith

Boosting trees can test necessary conditions for calibration and auto-calibration in regression models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that boosting trees can check necessary conditions for a regression model to be calibrated, so that predicted means match true conditional means for almost all feature sets. It also covers the weaker auto-calibration property, under which observations sharing the same predicted mean have an expectation that equals that prediction. This matters for applications such as insurance pricing, where auto-calibration prevents cross-subsidization across price groups even when full calibration remains out of reach with finite noisy data. The approach is backed by a numerical study on a large insurance dataset in which the tests demonstrate high power to detect violations.

Core claim

Boosting trees can be used to test necessary conditions for calibration and auto-calibration, respectively. The practical relevance of our approach is supported by a numerical example, in which the proposed tests prove to be very powerful on a large insurance dataset.

What carries the argument

Boosting trees applied to test necessary conditions for calibration and auto-calibration of a regression function.

If this is right

Passing the tests confirms that the model meets necessary conditions for matching predicted and true conditional means.
The tests can be used to verify auto-calibration and thereby rule out cross-subsidization between price cohorts in insurance applications.
The method remains applicable even when perfect calibration cannot be achieved because of finite samples and noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boosting-tree tests could be applied to compare calibration properties across different regression fitting procedures on the same dataset.
Repeated application of the tests during model development might identify feature transformations that improve satisfaction of the necessary conditions.
If the tests reject on a given model, retraining with added constraints that enforce the tested identities could be explored as a corrective step.

Load-bearing premise

The boosting-tree procedure itself does not introduce bias or dependence that would invalidate the test when applied to finite noisy samples.

What would settle it

Generate data from a regression model known to violate calibration or auto-calibration, apply the boosting-tree tests, and check whether they fail to reject the null hypothesis of no violation.

Figures

Figures reproduced from arXiv: 2606.08084 by Selim Gatti.

**Figure 2.** Figure 2: Histogram of the true annual frequency of claims for each policy. The [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Plots of the mean estimates against the true conditional means on the test set [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Plots of the mean estimates against the true conditional means on the test set [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Plots of binned averages of responses (actuals) against binned averages of mean [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Plots of binned averages of responses (actuals) against binned averages of mean [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: The empirical density of the test statistics [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: The empirical density of the test statistics [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Number of rejections of the null-hypothesis of calibration and auto-calibration, re [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Calibration plots of two different models. Only the case where the learning set [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

read the original abstract

The main goal in regression modelling consists in approximating the conditional mean of a response given a set of features. A regression function is said to be calibrated if the resulting mean estimates match the true conditional means for almost every set of features. Aiming for calibration seems not achievable in practice as one typically deals with finite samples of noisy observations. A weaker notion of calibration is auto-calibration, and it means that the expectation of responses being given the same mean estimate matches this estimate. This notion is important, e.g., in insurance pricing as it ensures no cross-subsidization between different price cohorts. In this paper, we show that boosting trees can be used to test necessary conditions for calibration and auto-calibration, respectively. The practical relevance of our approach is supported by a numerical example, in which the proposed tests prove to be very powerful on a large insurance dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Boosting trees test for calibration conditions looks like a useful applied extension, but needs full method details to assess validity.

read the letter

The main takeaway is that the authors propose using boosting trees to test necessary conditions for calibration and auto-calibration in regression. They illustrate this on a large insurance dataset, claiming the tests are very powerful.

This connects machine learning tools to a statistical need in a direct way. The focus on auto-calibration is well-chosen because it addresses a real issue in pricing fairness without requiring perfect calibration, which is often impossible with finite data.

The paper does well in presenting a usable diagnostic and supporting it with practical evidence from an applied domain. The numerical example gives some sense of relevance.

What is actually new appears to be this particular application of boosting trees for the testing task. It is not a radical shift but a targeted extension that could be helpful.

Soft spots include the absence of visible theoretical support for why the boosting procedure serves as a valid test. The abstract does not detail how the test statistic is formed or any proof of its properties under the null hypothesis. The power claim on the insurance data would be more convincing with comparisons to other methods or sensitivity checks.

There is no sign of circularity or self-referential issues from the description.

This paper targets readers in statistical modeling and actuarial science who need tools for checking model calibration. It would be of interest to those working on regression in high-stakes applications like insurance.

Given the concrete example and the relevance of the topic, it deserves a serious referee even if the current description is limited.

I would recommend engaging with the work through peer review to clarify the method and strengthen the evidence.

Referee Report

2 major / 0 minor

Summary. The paper claims that boosting trees can be used to test necessary conditions for calibration (matching true conditional means) and auto-calibration (matching conditional expectations given the predicted mean) in regression models. Auto-calibration is highlighted for its relevance in insurance pricing to avoid cross-subsidization. The approach is supported by a numerical example on a large insurance dataset in which the proposed tests are reported to be very powerful.

Significance. If the boosting-tree tests are shown to be valid without introducing bias or dependence on finite samples, the method would supply a practical, tree-based diagnostic for necessary conditions of calibration that is directly applicable to insurance and similar pricing contexts. The numerical example on real data is a strength if accompanied by proper controls and baselines, but the overall significance hinges on resolving the validity questions for the test construction itself.

major comments (2)

[Abstract / numerical example] Abstract and numerical-example section: the claim that the tests 'prove to be very powerful' rests on a single insurance dataset example, yet no power calculations, error bars, baseline comparisons, or controls for the boosting procedure are described. This prevents verification that the observed power is not an artifact of the example design.
[Method description (auto-calibration test)] Method for auto-calibration test: the boosting-tree procedure used to test the necessary condition for auto-calibration must be shown not to introduce bias or dependence on finite noisy samples. The manuscript should supply either a theoretical argument for validity under the null or simulation evidence that the test size is controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / numerical example] Abstract and numerical-example section: the claim that the tests 'prove to be very powerful' rests on a single insurance dataset example, yet no power calculations, error bars, baseline comparisons, or controls for the boosting procedure are described. This prevents verification that the observed power is not an artifact of the example design.

Authors: We agree that the numerical example is limited to a single large insurance dataset and does not include power calculations, error bars, baseline comparisons, or explicit controls for the boosting procedure. While the dataset size supports the observed results, this is a valid concern. In the revision we will add a simulation study with known calibrated and miscalibrated models to quantify power, include bootstrap-based error bars on the insurance results, and provide comparisons against binning and isotonic regression baselines. We will also revise the abstract language from 'prove to be very powerful' to 'demonstrate strong performance'. revision: yes
Referee: [Method description (auto-calibration test)] Method for auto-calibration test: the boosting-tree procedure used to test the necessary condition for auto-calibration must be shown not to introduce bias or dependence on finite noisy samples. The manuscript should supply either a theoretical argument for validity under the null or simulation evidence that the test size is controlled.

Authors: We acknowledge that the manuscript currently provides neither a theoretical argument nor simulation evidence that the boosting-tree test controls size under the null of auto-calibration. Finite-sample bias in tree construction is a legitimate issue. We will add a simulation study in the revised version that generates data under the null for a range of sample sizes and noise levels, applies the test, and reports empirical rejection rates to confirm size control at the nominal level. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a method using boosting trees to test necessary conditions for calibration and auto-calibration, supported by a numerical example on insurance data. The abstract and available description present this as an independent diagnostic approach without any equations, derivations, or claims that reduce by construction to fitted parameters, self-definitions, or self-citation chains. No load-bearing steps matching the enumerated circularity patterns are identifiable from the provided content, and the central claim does not rely on renaming known results or importing uniqueness via prior author work. The derivation chain is self-contained as a proposed testing procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract. The approach appears to rely on standard properties of boosting and conditional expectation, but these are not itemized.

pith-pipeline@v0.9.1-grok · 5668 in / 1087 out tokens · 23884 ms · 2026-06-27T19:18:42.051694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages

[1]

Bierens, H. (1990). A consistent conditional moment test of functional form.Econometrica58, 1443–1458

1990
[2]

Bross, I. D. J. (1953).Design for decision. New York: Macmillan

1953
[3]

Delong, L., Gatti, S., W¨ uthrich, M. V. (2026). Calibration bands for mean estimates within the exponential dispersion family.Statistical Theory and Related Fields, 1-43

2026
[4]

Delong, L., W¨ uthrich, M. V. (2025). Isotonic regression for variance estimation and its role in mean estimation and model validation.North American Actuarial Journal29/3, 563-591

2025
[5]

Delong, L., W¨ uthrich, M. V. (2025). Universal inference for testing calibration of mean estimates within the exponential dispersion family.arXiv: 2510.23821

arXiv 2025
[6]

Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for insur- ance pricing with machine learning.Insurance: Mathematics and Economics101/B, 485-497

2021
[7]

Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves.Insurance: Mathematics and Economics117, 130-139

2024
[8]

(2018).CASdatasetsRpackage vignette.Reference manual

Dutang, C., Charpentier, A. (2018).CASdatasetsRpackage vignette.Reference manual. Version 1.0-8, packaged 2018-05-20. 31

2018
[9]

Fissler, T., Lorentzen, C., Mayer, M. (2022). Model comparison and calibration assessment: User guide for consistent scoring functions in machine learning and actuarial practice.arXiv: 2202.12780

arXiv 2022
[10]

Freund, Y., Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.Journal of Computer and System Sciences55/1, 119-139

1997
[11]

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statistics29/5, 1189-1232

2001
[12]

Gneiting, T. (2011). Making and evaluating point forecasts.Journal of the American Statistical Association106, 746-762

2011
[13]

Gneiting, T., Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association102/477, 359-378

2007
[14]

Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecast evaluation: Conditional cali- bration, reliability diagrams, and coefficient of determination.Electronic Journal of Statistics17/2, 3226-3286

2023
[15]

H¨ ardle, W., Mammen, E. (1993). Comparing nonparametric versus parametric regression fits.The Annals of Statistics21, 1926-1947

1993
[16]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learn- ing: Data mining, inference, and prediction

Hastie, T., Tibshirani, R., Friedman, J. (2009).The elements of statistical learning. Data min- ing, inference, and prediction.2nd ed. New York: Springer Series in Statistics.https://link. springer.com/book/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009
[17]

Henzi, A., Puke, M., Dimitriadis, T., Ziegel, J. (2022). A safe Hosmer–Lemeshow test.Journal of Statistics in Data Science2/2, 175-189

2022
[18]

W., Lemeshow, S

Hosmer, D. W., Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model.Communications in Statistics - Theory and Methods9, 1043-1069

1980
[19]

(2004).Probability essentials.2nd ed

Jacod, J., Protter, P. (2004).Probability essentials.2nd ed. Berlin: Springer.https://link. springer.com/book/10.1007/978-3-642-55682-1

work page doi:10.1007/978-3-642-55682-1 2004
[20]

Kearns, M., Valiant, L.G. (1988). Learning boolean formulae or finite automata is hard as factoring. Technical Report TR-14-88. Harvard University Aiken Computation Laboratory

1988
[21]

(2020).Probability Theory

Klenke, A. (2020).Probability Theory. A comprehensive course.3rd ed. Cham: Springer.https: //link.springer.com/book/10.1007/978-3-030-56402-5

work page doi:10.1007/978-3-030-56402-5 2020
[22]

Kr¨ uger, F., Ziegel, J. F. (2021). Generic conditions for forecast dominance.Journal of Business & Economic Statistics39/4, 972-983

2021
[23]

H., Winkler, R

Murphy, A. H., Winkler, R. L. (1977). Reliability of subjective probability forecasts of precipitation and temperature.Applied Statistics26, 41-47

1977
[24]

Pohle, M. O. (2020). The Murphy decomposition and the calibration-resolution principle: A new perspective on forecast evaluation.arXiv: 2005.01835

arXiv 2020
[25]

R Core Team (2021).R: A language and environment for statistical computing.RFoundation for Statistical Computing, Vienna, Austria.https://www.R-project.org/

2021
[26]

Sanders, F. (1963). On subjective probability forecasting.Journal of Applied Meteorology and Cli- matology2, 191-201

1963
[27]

Valiant, L.G. (1984). A theory of learnable.Communications of the Association for Computing Machinery ACM27/11, 1134-1142

1984
[28]

W¨ uthrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions.European Actuarial Journal15, 335-341. 32

2025
[29]

W¨ uthrich, M.V., Buser, C. (2016). Data analytics for non-life insurance pricing.Swiss Finance Institute Research Paper16/68, Available at SSRN:https://ssrn.com/abstract=2870308

2016
[30]

(2023).Statistical Foundations of Actuarial Learning and its Applica- tions.Springer Actuarial.https://link.springer.com/book/10.1007/978-3-031-12409-9 15

W¨ uthrich, M.V., Merz, M. (2023).Statistical foundations of actuarial learning and its applications. Cham: Springer.https://link.springer.com/book/10.1007/978-3-031-12409-9

work page doi:10.1007/978-3-031-12409-9 2023
[31]

W¨ uthrich, M.V., Richman, R., Avanzi, B., Lindholm, M., Maggi, M., Mayer, M., Schelldorfer, J, Scognamiglio, S. (2026). AI tools for actuaries. Available at SSRN:https://ssrn.com/abstract= 5162304

2026
[32]

W¨ uthrich, M.V., Ziegel, J. (2024). Isotonic recalibration under a low signal-to-noise ratio.Scandi- navian Actuarial Journal2024/3, 279-299. 33 A Proof of Proposition 3.1 The proof of Proposition 3.1 is given in this appendix. As the proof of Proposition 3.4 is perfectly similar, it is omitted. Proof of Proposition 3.1. The random variables (Zi)n i=1 in...

2024
[33]

This can be explained by two different factors

There, we observe that the constructed calibration bands are only able to detect violations of calibration for the overfitting models DNN2 and GBM2 in the case where the learning set corresponds to 60% of the datasetD. This can be explained by two different factors. On the one hand, calibration bands become narrower as more data is used to construct them,...

[1] [1]

Bierens, H. (1990). A consistent conditional moment test of functional form.Econometrica58, 1443–1458

1990

[2] [2]

Bross, I. D. J. (1953).Design for decision. New York: Macmillan

1953

[3] [3]

Delong, L., Gatti, S., W¨ uthrich, M. V. (2026). Calibration bands for mean estimates within the exponential dispersion family.Statistical Theory and Related Fields, 1-43

2026

[4] [4]

Delong, L., W¨ uthrich, M. V. (2025). Isotonic regression for variance estimation and its role in mean estimation and model validation.North American Actuarial Journal29/3, 563-591

2025

[5] [5]

Delong, L., W¨ uthrich, M. V. (2025). Universal inference for testing calibration of mean estimates within the exponential dispersion family.arXiv: 2510.23821

arXiv 2025

[6] [6]

Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for insur- ance pricing with machine learning.Insurance: Mathematics and Economics101/B, 485-497

2021

[7] [7]

Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves.Insurance: Mathematics and Economics117, 130-139

2024

[8] [8]

(2018).CASdatasetsRpackage vignette.Reference manual

Dutang, C., Charpentier, A. (2018).CASdatasetsRpackage vignette.Reference manual. Version 1.0-8, packaged 2018-05-20. 31

2018

[9] [9]

Fissler, T., Lorentzen, C., Mayer, M. (2022). Model comparison and calibration assessment: User guide for consistent scoring functions in machine learning and actuarial practice.arXiv: 2202.12780

arXiv 2022

[10] [10]

Freund, Y., Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.Journal of Computer and System Sciences55/1, 119-139

1997

[11] [11]

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statistics29/5, 1189-1232

2001

[12] [12]

Gneiting, T. (2011). Making and evaluating point forecasts.Journal of the American Statistical Association106, 746-762

2011

[13] [13]

Gneiting, T., Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association102/477, 359-378

2007

[14] [14]

Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecast evaluation: Conditional cali- bration, reliability diagrams, and coefficient of determination.Electronic Journal of Statistics17/2, 3226-3286

2023

[15] [15]

H¨ ardle, W., Mammen, E. (1993). Comparing nonparametric versus parametric regression fits.The Annals of Statistics21, 1926-1947

1993

[16] [16]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learn- ing: Data mining, inference, and prediction

Hastie, T., Tibshirani, R., Friedman, J. (2009).The elements of statistical learning. Data min- ing, inference, and prediction.2nd ed. New York: Springer Series in Statistics.https://link. springer.com/book/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009

[17] [17]

Henzi, A., Puke, M., Dimitriadis, T., Ziegel, J. (2022). A safe Hosmer–Lemeshow test.Journal of Statistics in Data Science2/2, 175-189

2022

[18] [18]

W., Lemeshow, S

Hosmer, D. W., Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model.Communications in Statistics - Theory and Methods9, 1043-1069

1980

[19] [19]

(2004).Probability essentials.2nd ed

Jacod, J., Protter, P. (2004).Probability essentials.2nd ed. Berlin: Springer.https://link. springer.com/book/10.1007/978-3-642-55682-1

work page doi:10.1007/978-3-642-55682-1 2004

[20] [20]

Kearns, M., Valiant, L.G. (1988). Learning boolean formulae or finite automata is hard as factoring. Technical Report TR-14-88. Harvard University Aiken Computation Laboratory

1988

[21] [21]

(2020).Probability Theory

Klenke, A. (2020).Probability Theory. A comprehensive course.3rd ed. Cham: Springer.https: //link.springer.com/book/10.1007/978-3-030-56402-5

work page doi:10.1007/978-3-030-56402-5 2020

[22] [22]

Kr¨ uger, F., Ziegel, J. F. (2021). Generic conditions for forecast dominance.Journal of Business & Economic Statistics39/4, 972-983

2021

[23] [23]

H., Winkler, R

Murphy, A. H., Winkler, R. L. (1977). Reliability of subjective probability forecasts of precipitation and temperature.Applied Statistics26, 41-47

1977

[24] [24]

Pohle, M. O. (2020). The Murphy decomposition and the calibration-resolution principle: A new perspective on forecast evaluation.arXiv: 2005.01835

arXiv 2020

[25] [25]

R Core Team (2021).R: A language and environment for statistical computing.RFoundation for Statistical Computing, Vienna, Austria.https://www.R-project.org/

2021

[26] [26]

Sanders, F. (1963). On subjective probability forecasting.Journal of Applied Meteorology and Cli- matology2, 191-201

1963

[27] [27]

Valiant, L.G. (1984). A theory of learnable.Communications of the Association for Computing Machinery ACM27/11, 1134-1142

1984

[28] [28]

W¨ uthrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions.European Actuarial Journal15, 335-341. 32

2025

[29] [29]

W¨ uthrich, M.V., Buser, C. (2016). Data analytics for non-life insurance pricing.Swiss Finance Institute Research Paper16/68, Available at SSRN:https://ssrn.com/abstract=2870308

2016

[30] [30]

(2023).Statistical Foundations of Actuarial Learning and its Applica- tions.Springer Actuarial.https://link.springer.com/book/10.1007/978-3-031-12409-9 15

W¨ uthrich, M.V., Merz, M. (2023).Statistical foundations of actuarial learning and its applications. Cham: Springer.https://link.springer.com/book/10.1007/978-3-031-12409-9

work page doi:10.1007/978-3-031-12409-9 2023

[31] [31]

W¨ uthrich, M.V., Richman, R., Avanzi, B., Lindholm, M., Maggi, M., Mayer, M., Schelldorfer, J, Scognamiglio, S. (2026). AI tools for actuaries. Available at SSRN:https://ssrn.com/abstract= 5162304

2026

[32] [32]

W¨ uthrich, M.V., Ziegel, J. (2024). Isotonic recalibration under a low signal-to-noise ratio.Scandi- navian Actuarial Journal2024/3, 279-299. 33 A Proof of Proposition 3.1 The proof of Proposition 3.1 is given in this appendix. As the proof of Proposition 3.4 is perfectly similar, it is omitted. Proof of Proposition 3.1. The random variables (Zi)n i=1 in...

2024

[33] [33]

This can be explained by two different factors

There, we observe that the constructed calibration bands are only able to detect violations of calibration for the overfitting models DNN2 and GBM2 in the case where the learning set corresponds to 60% of the datasetD. This can be explained by two different factors. On the one hand, calibration bands become narrower as more data is used to construct them,...