Combining a Large Pool of Forecasts of Value-at-Risk and Expected Shortfall

Chao Wang; James W. Taylor

arxiv: 2508.16919 · v2 · pith:3GB2ZONBnew · submitted 2025-08-23 · 💱 q-fin.RM

Combining a Large Pool of Forecasts of Value-at-Risk and Expected Shortfall

James W. Taylor , Chao Wang This is my paper

Pith reviewed 2026-05-18 21:42 UTC · model grok-4.3

classification 💱 q-fin.RM

keywords value-at-riskexpected shortfallforecast combinationrisk forecastingperformance weightingtrimmed meanfinancial risk management

0 comments

The pith

Combining forecasts from a small diverse set with performance-based weighting improves accuracy for value-at-risk and expected shortfall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests multiple ways to combine forecasts of value-at-risk and expected shortfall when many candidate methods are available. It evaluates simple approaches such as the mean, median, and mode, plus regularized weighting to limit overfitting, and interval-based techniques including trimmed means and a mixtures method. Results from an empirical study with 90 methods show that trimmed means, mixtures, and performance weighting perform well, yet accuracy improves further when the pool is reduced to six diverse methods and performance-based weighting is applied. A reader would care because more accurate risk forecasts support better capital allocation and regulatory compliance in financial institutions.

Core claim

When a large pool of candidate forecasts is available, combining value-at-risk and expected shortfall predictions through trimmed mean combinations, a mixtures approach based on inferred probability distributions, and performance-based weighting yields strong results. Selecting just six methods chosen for diversity and then applying performance-based weighting produces the highest forecasting accuracy overall.

What carries the argument

Performance-based weighting applied to a hand-selected pool of six diverse forecasting methods, together with trimmed means and the mixtures method for joint VaR and ES interval forecasts.

If this is right

A pool of just six diverse methods produces greater forecasting accuracy than the full set of 90 methods.
Performance-based weighting delivers the best overall performance among the tested combination approaches.
Trimmed mean combinations and the mixtures method also deliver particularly strong results.
Regularisation reduces overfitting when many weights must be estimated from the large pool.
Treating VaR and ES jointly as interval forecasts allows adapted combination methods to be applied effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Financial institutions could lower model maintenance costs by retaining only a small number of diverse forecasting approaches rather than maintaining large pools.
The results point to diversity across methods as a key driver of combination success, suggesting that adding similar forecasts may add little value.
Regulators might consider requiring or incentivizing the use of performance-weighted combinations to strengthen risk reporting.

Load-bearing premise

The 90 candidate methods contain enough genuine diversity that a hand-selected subset of six can represent the broader pool without selection bias favoring the reported combination methods.

What would settle it

Re-running the full empirical comparison on a fresh collection of 90 forecasting methods or on data from a later market period would show whether the accuracy advantage of the six-method performance-weighted combination persists.

read the original abstract

We consider the combination of value-at-risk (VaR) and expected shortfall (ES) forecasts when a large pool of candidate forecasts is available. Given the limited literature in this area, we implement a variety of new combining methods. In terms of simplistic methods, in addition to the mean, we consider the median and mode. As a complement to the previously proposed performance-based weighted combinations, we use regularisation to reduce overfitting in the presence of many weights. Treating VaR and ES forecasts jointly as interval forecasts allows the application of adapted interval forecast combination methods, including trimmed means and a mixtures approach based on inferred probability distributions. In an empirical study involving 90 forecasting methods, trimmed mean combinations, the mixtures method, and performance-based weighting delivered particularly strong results. However, greater forecasting accuracy resulted for a pool of just six methods, chosen to ensure diversity, with performance-based weighting producing the best overall performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds regularization and interval mixtures to VaR/ES combination and reports stronger results from performance weighting on a hand-picked six-method subset, but that subset choice needs explicit robustness checks.

read the letter

The main points are that the authors adapt regularization to handle many weights in performance-based combinations and treat VaR/ES pairs as intervals to borrow trimmed-mean and mixture methods from interval forecasting. These are presented as extensions beyond earlier weighting schemes, and the empirical comparison across 90 methods shows the mixtures, trimmed means, and performance weighting performing well, with the best numbers coming from the six-method pool.

Referee Report

2 major / 2 minor

Summary. The paper studies the combination of VaR and ES forecasts from a large pool of 90 candidate methods. It evaluates standard averaging methods (mean, median, mode), regularized performance-based weighting to mitigate overfitting, and interval-forecast adaptations such as trimmed means and a mixtures approach based on inferred distributions. The central empirical claim is that trimmed-mean, mixtures, and performance-based weighting methods perform strongly, with the best results obtained from a hand-selected diverse subset of only six methods under performance-based weighting.

Significance. If the empirical results are robust, the work adds to the sparse literature on combining VaR/ES forecasts by showing practical gains from regularization, trimmed means, and mixtures, and by illustrating that smaller, diverse pools can outperform larger ones. The explicit comparison of multiple combination strategies on a common large candidate set provides useful guidance for risk-management applications.

major comments (2)

[Empirical study] Empirical study section (and abstract): the superiority of performance-based weighting on the six-method pool over the full 90-method pool is load-bearing for the headline result, yet the selection of the six methods is described only as 'chosen to ensure diversity' with no pre-specified protocol, no comparison to randomly sampled or algorithmically diverse subsets of size six, and no robustness check that removes top individual performers before selection. This leaves open the possibility that the reported advantage is partly an artifact of post-selection on the evaluation sample.
[Results] Results section: the abstract and results report strong performance for trimmed means, mixtures, and performance-based weighting, but provide no information on the precise data periods, the loss functions used for ranking and evaluation, statistical significance tests for differences across methods, or out-of-sample robustness checks across sub-periods or market regimes.

minor comments (2)

[Methodology] Clarify the exact regularization penalty and the cross-validation procedure used to choose the regularization parameter in the performance-based weighting method.
[Empirical study] Add a table or figure that reports the individual performance of the six selected methods versus the full pool to allow readers to assess the diversity claim directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify and strengthen the presentation of our empirical results. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Empirical study] Empirical study section (and abstract): the superiority of performance-based weighting on the six-method pool over the full 90-method pool is load-bearing for the headline result, yet the selection of the six methods is described only as 'chosen to ensure diversity' with no pre-specified protocol, no comparison to randomly sampled or algorithmically diverse subsets of size six, and no robustness check that removes top individual performers before selection. This leaves open the possibility that the reported advantage is partly an artifact of post-selection on the evaluation sample.

Authors: The six methods were chosen prior to the full out-of-sample evaluation to represent distinct methodological families (parametric, nonparametric, and semi-parametric approaches with varying distributional assumptions). We agree, however, that the manuscript provides insufficient detail on the selection criteria and does not include the robustness checks suggested. In the revision we will add an explicit subsection describing the pre-specified diversity criteria, report results for randomly drawn subsets of size six, and include a check that removes the strongest individual performers before re-selecting a diverse six-method pool. These additions will directly address the concern about post-selection bias. revision: yes
Referee: [Results] Results section: the abstract and results report strong performance for trimmed means, mixtures, and performance-based weighting, but provide no information on the precise data periods, the loss functions used for ranking and evaluation, statistical significance tests for differences across methods, or out-of-sample robustness checks across sub-periods or market regimes.

Authors: We accept that greater transparency is needed. The current draft summarizes the overall sample but does not list exact start and end dates for the evaluation window, does not name the specific loss functions used for both ranking and final scoring, omits formal significance tests, and does not break results by sub-periods or volatility regimes. In the revised manuscript we will insert these details: precise sample dates, the loss functions employed, Diebold-Mariano tests for pairwise comparisons, and additional tables/figures showing performance in distinct market regimes. These changes will make the empirical claims fully reproducible and testable. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical forecast combination study

full rationale

The paper reports results from an empirical comparison of combination methods for VaR and ES forecasts across a pool of 90 methods and a hand-selected subset of six. All performance claims are obtained by applying the methods to held-out data and measuring accuracy metrics directly; no derivation, equation, or first-principles result is presented that reduces to its own inputs by construction, nor does any central claim rest on a self-citation chain or fitted parameter renamed as a prediction. The analysis is therefore self-contained against external benchmarks and receives a circularity score of zero.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The paper is an empirical forecast-combination study; its central claims rest on the assumption that the chosen data set is representative and that the 90 methods are sufficiently diverse. No new mathematical axioms or invented entities are introduced. Free parameters such as regularisation strength and performance weights are fitted to data but not quantified in the abstract.

free parameters (2)

regularisation parameter
Used to reduce overfitting when combining many forecasts; value not stated in abstract.
performance weights
Fitted to past forecast accuracy; central to the best reported method.

pith-pipeline@v0.9.0 · 5683 in / 1204 out tokens · 29449 ms · 2026-05-18T21:42:20.078758+00:00 · methodology

Combining a Large Pool of Forecasts of Value-at-Risk and Expected Shortfall

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)