WTMAD-4: A Fair Weighting Scheme for GMTKN55
Pith reviewed 2026-05-22 13:12 UTC · model grok-4.3
The pith
A new WTMAD-4 weighting scheme for the GMTKN55 benchmark set fixes under-weighting of some tests by orders of magnitude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that existing WTMAD definitions under-weight certain GMTKN55 benchmarks by orders of magnitude. It introduces WTMAD-4, whose weights are set from the typical errors of ten minimally empirical dispersion-corrected density-functional approximations, to produce fair treatment across all benchmarks. Reassessment of 115 DFAs with WTMAD-4 then shows that a functional previously optimised by minimising WTMAD-2 underperforms on the benchmarks that had been marginalised by the older metric.
What carries the argument
WTMAD-4, a weighted mean absolute deviation whose component weights are chosen inversely to the typical errors observed for a fixed set of ten minimally empirical dispersion-corrected density functionals.
If this is right
- DFAs that were optimised by minimising older WTMAD variants will show weaker results on the benchmarks that had received negligible weight.
- Overall performance rankings of density functionals shift when the fairer weighting is used instead of WTMAD-2 or WTMAD-3.
- The new metric gives comparable influence to small-molecule thermochemistry, reaction barriers, and non-covalent interaction tests.
- Literature functionals tuned to the old weighting scheme require re-evaluation for balanced accuracy across the full GMTKN55 collection.
Where Pith is reading between the lines
- The same error-based weighting idea could be tested on other large benchmark collections to check whether similar imbalances exist.
- Method developers could adopt WTMAD-4 as a default composite score when designing or comparing new density functionals.
- Composite metrics in computational chemistry may need periodic recalibration whenever the typical error profile of standard methods changes.
Load-bearing premise
The typical errors observed for the specific set of ten minimally empirical dispersion-corrected DFAs provide an appropriate and representative basis for defining weights that achieve fair treatment of all GMTKN55 component benchmarks.
What would settle it
A substantially different set of reference DFAs yields weights that change the relative importance of the GMTKN55 benchmarks by more than a factor of two, or re-ranking the 115 functionals with WTMAD-4 leaves the literature example unchanged in its relative performance.
read the original abstract
The GMTKN55 data set is a collection of standard benchmarks used in molecular quantum chemistry that spans small- and large-molecule thermochemistry, reaction barriers, and non-covalent interactions. Herein, we identify a flaw in the weighted mean absolute deviation (WTMAD) definitions commonly used to quantify performance of various electronic-structure methods for the GMTKN55 set, which under-weight some of its component benchmarks by orders of magnitude. A new WTMAD-4 metric is proposed, based on typical errors observed for a set of ten minimally empirical dispersion-corrected density-functional approximations (DFAs), ensuring fair treatment across all benchmarks. The performance of 115 DFAs is then reassessed using WTMAD-4 and we highlight a literature example where a DFA parametrised by minimising WTMAD-2 underperforms for benchmarks marginalised by that metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies a flaw in existing WTMAD definitions for the GMTKN55 benchmark set, which under-weight certain component benchmarks by orders of magnitude. It proposes a new WTMAD-4 metric whose benchmark weights are set inversely proportional to the typical errors observed across a fixed set of ten minimally empirical dispersion-corrected DFAs. The performance of 115 DFAs is then reassessed with WTMAD-4, and a literature example is highlighted in which a DFA parametrized by minimizing WTMAD-2 underperforms on benchmarks that were marginalized by the earlier weighting.
Significance. If the central construction holds, WTMAD-4 would provide a more balanced ranking of density functionals on GMTKN55 by ensuring that benchmarks with intrinsically larger errors (e.g., certain reaction barriers or non-covalent interactions) receive appropriate weight. The reassessment of 115 methods supplies a substantial body of comparative data, and the explicit demonstration of a practical consequence for a previously published parametrization strengthens the case for adoption. The approach is grounded in observed error statistics rather than arbitrary scaling factors.
major comments (2)
- [§3] §3 (Definition of WTMAD-4): the weights are constructed from the mean absolute errors of a hand-selected reference set of exactly ten minimally empirical dispersion-corrected DFAs. No sensitivity test is reported that quantifies how the resulting weights, or the final ordering of the 115 evaluated DFAs, change when the reference set is replaced by another plausible collection (e.g., including range-separated hybrids). Because the claim of “fair treatment across all benchmarks” rests on the representativeness of these ten error profiles, the absence of such a test is load-bearing.
- [Results] Results section (reassessment of 115 DFAs): while aggregate WTMAD-4 values are tabulated, the manuscript provides no uncertainty estimates or bootstrap-style error bars on the weights themselves. Small differences in reported WTMAD-4 scores between closely ranked functionals could therefore be statistically insignificant, weakening the interpretation of the ranking shifts relative to WTMAD-2.
minor comments (2)
- [Abstract] The abstract refers to “a literature example” without naming the functional or the original reference; adding the specific DFA and citation would improve immediate clarity.
- [§2] Notation for the ten reference DFAs is introduced without a compact table listing their exact names, dispersion corrections, and basis sets; a short summary table would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments identify valid opportunities to strengthen the robustness of WTMAD-4. We address each major comment below and will incorporate the suggested analyses in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Definition of WTMAD-4): the weights are constructed from the mean absolute errors of a hand-selected reference set of exactly ten minimally empirical dispersion-corrected DFAs. No sensitivity test is reported that quantifies how the resulting weights, or the final ordering of the 115 evaluated DFAs, change when the reference set is replaced by another plausible collection (e.g., including range-separated hybrids). Because the claim of “fair treatment across all benchmarks” rests on the representativeness of these ten error profiles, the absence of such a test is load-bearing.
Authors: We selected the ten minimally empirical dispersion-corrected DFAs because they span a range of functional forms while avoiding heavy empirical parametrization that could bias the error profiles toward particular benchmarks. This choice was intended to provide a balanced, representative sampling of typical errors. Nevertheless, we agree that an explicit sensitivity test would reinforce the claim of fair treatment. In the revised manuscript we will add a new subsection (or appendix) that repeats the weight derivation using an alternative reference set that includes range-separated hybrids and report the resulting changes (or lack thereof) in benchmark weights and the ordering of the 115 DFAs. revision: yes
-
Referee: [Results] Results section (reassessment of 115 DFAs): while aggregate WTMAD-4 values are tabulated, the manuscript provides no uncertainty estimates or bootstrap-style error bars on the weights themselves. Small differences in reported WTMAD-4 scores between closely ranked functionals could therefore be statistically insignificant, weakening the interpretation of the ranking shifts relative to WTMAD-2.
Authors: We concur that uncertainty quantification would improve the interpretation of small ranking differences. We will perform a bootstrap resampling analysis on the underlying MAE values used to derive the weights, propagate the resulting variability to the WTMAD-4 scores, and include error bars (or confidence intervals) in the revised tables and figures. This will allow readers to assess whether observed differences between closely ranked functionals are statistically meaningful. revision: yes
Circularity Check
No significant circularity in WTMAD-4 derivation
full rationale
The paper defines WTMAD-4 weights from typical errors observed on a fixed external reference set of ten minimally empirical dispersion-corrected DFAs. This reference set is chosen independently of the 115 DFAs later reassessed, and the provided abstract and description contain no equations or steps in which the new metric or benchmark weights reduce by construction to a fit or prediction performed on the target evaluation set itself. The construction is a definitional reweighting scheme grounded in external benchmarks rather than a self-referential loop, satisfying the criteria for a self-contained derivation against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- Choice of ten minimally empirical DFAs
axioms (1)
- domain assumption GMTKN55 component benchmarks should receive weights inversely related to typical DFA errors to achieve fair treatment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
WTMAD-4 weights are chosen such that each benchmark contributes roughly evenly (between 1% and 3%) to the overall WTMAD-4... based on the magnitudes of expected errors rather than on the absolute energy scales.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L. Goerigk, A. Hansen, C. Bauer, S. Ehrlich, A. Najibi and S. Grimme, Phys. Chem. Chem. Phys., 2017, 19, 32184--32215
work page 2017
- [2]
-
[3]
L. Wittmann, H. Neugebauer, S. Grimme and M. Bursch, J. Chem. Phys., 2023, 159, 224103
work page 2023
-
[4]
A. D. Becke, J. Chem. Phys., 2024, 160, 204118
work page 2024
-
[5]
A. M. Teale, T. Helgaker, A. Savin, C. Adamo, B. Aradi, A. V. Arbuznikov, P. W. Ayers, E. J. Baerends, V. Barone, P. Calaminici, E. Canc\` e s, E. A. Carter, P. K. Chattaraj, H. Chermette, I. Ciofini, T. D. Crawford, F. D. Proft, J. F. Dobson, C. Draxl, T. Frauenheim, E. Fromager, P. Fuentealba, L. Gagliardi, G. Galli, J. Gao, P. Geerlings, N. Gidopoulos,...
work page 2022
-
[6]
A. J. A. Price, A. Otero-de-la-Roza and E. R. Johnson, Chem. Sci., 2023, 14, 1252--1262
work page 2023
-
[7]
V. Blum, R. Gehrke, F. Hanke, P. Havu, V. Havu, X. Ren, K. Reuter and M. Scheffler, Comput. Phys. Commun., 2009, 180, 2175--2196
work page 2009
-
[8]
X. Ren, P. Rinke, V. Blum, J. Wieferink, A. Tkatchenko, A. Sanfilippo, K. Reuter and M. Scheffler, New J. Phys., 2012, 14, 053020
work page 2012
-
[9]
S. V. Levchenko, X. Ren, J. Wieferink, R. Johanni, P. Rinke, V. Blum and M. Scheffler, Comput. Phys. Commun., 2015, 192, 60--69
work page 2015
- [10]
-
[11]
V. W.-z. Yu, F. Corsetti, A. Garc \'i a, W. P. Huhn, M. Jacquelin, W. Jia, B. Lange, L. Lin, J. Lu, W. Mi, A. Seifitokaldani, A. V\' a zquez-Mayagoitia, C. Yang, H. Yang and V. Blum, Comput. Phys. Commun., 2018, 222, 267--285
work page 2018
-
[12]
V. Havu, V. Blum, P. Havu and M. Scheffler, J. Chem. Phys., 2009, 228, 8367--8379
work page 2009
-
[13]
A. C. Ihrig, J. Wieferink, I. Y. Zhang, M. Ropo, X. Ren, P. Rinke, M. Scheffler and V. Blum, New J. Phys., 2015, 17, 093020
work page 2015
-
[14]
E. R. Johnson, gmtkn55-fhiaims, 2024, https://github.com/erin-r-johnson/gmtkn55-fhiaims
work page 2024
-
[15]
Roadmap on Advancements of the FHI-aims Software Package
J. W. Abbott, C. M. Acosta, A. Akkoush, A. Ambrosetti, V. Atalla, A. Bagrets, J. Behler, D. Berger, B. Bieniek, J. Bj\" o rk, V. Blum, S. Bohloul, C. L. Box, N. Boyer, D. S. Brambila, G. A. Bramley, K. R. Bryenton, M. Camarasa-G\' o mez, C. Carbogno, F. Caruso, S. Chutia, M. Ceriotti, G. Cs\' a nyi, W. Dawson, F. A. Delesma, F. D. Sala, B. Delley, R. A. D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
F. O. Kannemann and A. D. Becke, J. Chem. Theory Comput., 2010, 6, 1081--1088
work page 2010
-
[17]
A. D. Becke, J. Chem. Phys., 1986, 84, 4524--4529
work page 1986
-
[18]
J. P. Perdew, K. Burke and M. Ernzerhof, Phys. Rev. Lett., 1996, 77, 3865
work page 1996
-
[19]
J. P. Perdew, K. Burke and M. Ernzerhof, Phys. Rev. Lett., 1997, 78, 1396--1396
work page 1997
- [20]
-
[21]
A. D. Becke, J. Chem. Phys., 1986, 85, 7184--7187
work page 1986
-
[22]
A. D. Becke, Phys. Rev. A, 1988, 38, 3098
work page 1988
-
[23]
A. D. Beck, J. Chem. Phys., 1993, 98, 5648--5652
work page 1993
-
[24]
C. Lee, W. Yang and R. G. Parr, Phys. Rev. B, 1988, 37, 785
work page 1988
-
[25]
P. J. Stephens, F. J. Devlin, C. F. Chabalowski and M. J. Frisch, J. Phys. Chem., 1994, 98, 11623--11627
work page 1994
-
[26]
S. H. Vosko, L. Wilk and M. Nusair, Can. J. Phys., 1980, 58, 1200--1211
work page 1980
- [27]
-
[28]
A. D. Becke, J. Chem. Phys., 1993, 98, 1372--1377
work page 1993
-
[29]
K. R. Bryenton, A. A. Adeleke, S. G. Dale and E. R. Johnson, WIRES: Comput. Mol. Sci., 2023, 13, e1631
work page 2023
-
[30]
O. A. Vydrov and G. E. Scuseria, J. Chem. Phys., 2006, 125, 234109
work page 2006
-
[31]
O. A. Vydrov, G. E. Scuseria and J. P. Perdew, J. Chem. Phys., 2007, 126, 234109
work page 2007
-
[32]
J. A. Pople, M. Head-Gordon, D. J. Fox, K. Raghavachari and L. A. Curtiss, J. Chem. Phys., 1989, 90, 5622--5629
work page 1989
-
[33]
L. A. Curtiss, C. Jones, G. W. Trucks, K. Raghavachari and J. A. Pople, J. Chem. Phys., 1990, 93, 2537--2545
work page 1990
- [34]
-
[35]
A. Tkatchenko, R. A. DiStasio Jr, R. Car and M. Scheffler, Phys. Rev. Lett., 2012, 108, 236402
work page 2012
-
[36]
A. Ambrosetti, A. M. Reilly, R. A. DiStasio Jr and A. Tkatchenko, J. Chem. Phys., 2014, 140, 18A508
work page 2014
- [37]
-
[38]
M. M\" u ller and S. Ehlert, gmtkn55, 2021, https://github.com/grimme-lab/GMTKN55, https://github.com/grimme-lab/GMTKN55
work page 2021
- [39]
-
[40]
A. J. A. Price, K. R. Bryenton and E. R. Johnson, J. Chem. Phys., 2021, 154, 230902
work page 2021
- [41]
-
[42]
T. Gasevic, M. Müller, J. Schöps, S. Lanius, J. Hermann, S. Grimme and A. Hansen, ChemRxiv preprint chemrxiv-2025-rdsd0, 2025
work page 2025
-
[43]
C. A. Goodhart, in Monetary theory and practice: The UK experience, Springer, 1984, pp. 91--121
work page 1984
-
[44]
M. Strathern, Eur. Rev., 1997, 5, 305--321 mcitethebibliography main.tex0000664000000000000000000021315215066047405011237 0ustar rootroot [twoside,twocolumn,9pt] article extsizes [super,sort&compress,comma] natbib [version=3] mhchem [left=1.5cm, right=1.5cm, top=1.785cm, bottom=2.0cm] geometry balance mathptmx sectsty graphicx lastpage [format=plain,justi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.