Training ML Models with Predictable Failures
Pith reviewed 2026-05-20 20:22 UTC · model grok-4.3
The pith
A forecastability loss fine-tunes ML models so that failure rates predicted from small evaluation sets remain accurate even when rare high-failure modes appear at deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We give a finite-k decomposition of the estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode.
What carries the argument
The forecastability loss, a fine-tuning objective that trains models to emit failure scores whose extrapolation from the top k values is robust to missing rare high-failure modes.
If this is right
- The estimator tends to over-predict failures except when rare modes are missed, producing a safety-favorable bias in most cases.
- The forecastability loss reduces held-out forecast error without degrading primary-task capability.
- Safety performance after fine-tuning remains comparable to that of supervised baselines.
- The finite-k decomposition isolates the contribution of missed rare modes to forecast error.
Where Pith is reading between the lines
- The same loss could be tested with other k-based or tail-extrapolation estimators beyond the one analyzed here.
- If the loss generalizes, practitioners could rely on smaller evaluation sets for safety assessment in additional domains.
- The approach highlights a trade-off between making failures predictable and maintaining task performance that may appear in other safety fine-tuning methods.
Load-bearing premise
The primary source of under-prediction arises specifically when the evaluation set misses a rare high-failure mode that is present in the deployment distribution.
What would settle it
If applying the forecastability loss during fine-tuning fails to reduce held-out forecast error in the password-game or gridworld experiments while preserving primary-task performance, the claim that the loss corrects the identified under-prediction mode would be falsified.
Figures
read the original abstract
Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the extrapolation estimator from Jones et al. (2025) for predicting ML model failure rates at deployment scale using the largest k failure scores from a small evaluation set. It provides a finite-k decomposition of the estimator's forecast error, showing a built-in bias toward over-prediction in typical cases (safety-favorable). This bias is offset when the evaluation set misses a rare high-failure mode present in the deployment set, resulting in under-prediction. The authors introduce a forecastability loss as a fine-tuning objective to address this specific failure mode. In two proof-of-concept experiments (a language-model password game and an RL gridworld), fine-tuning with this loss substantially reduces held-out forecast error while preserving primary-task capability and achieving safety levels similar to supervised baselines.
Significance. If the finite-k decomposition is valid and the forecastability loss reliably mitigates the identified under-prediction without degrading other properties, the work could meaningfully advance pre-deployment safety assessment for ML systems by making failure-rate forecasts more robust to distribution shifts in rare events. The explicit identification of a safety-favorable bias direction and its offset condition, combined with reproducible experiments in controlled domains, strengthens the contribution; however, the practical impact depends on whether the decomposition generalizes beyond the assumed tail behavior.
major comments (1)
- [finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.
minor comments (2)
- [Experiments] The abstract and experiments section refer to 'held-out forecast error' and 'safety similar to that of supervised baselines,' but the precise definitions of these metrics (e.g., how forecast error is computed and what safety metric is used) should be stated explicitly with equations or pseudocode for reproducibility.
- [Method] The paper introduces the 'forecastability loss' as a new objective; a short comparison table showing its form relative to standard losses (e.g., cross-entropy or the original extrapolation objective) would clarify the modification.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's focus on the assumptions underlying the finite-k decomposition and address the major comment point by point below.
read point-by-point responses
-
Referee: [finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.
Authors: We agree that the finite-k decomposition is derived under the explicit modeling assumption that top-k failure scores follow order statistics from a fixed single-mode tail distribution preserved between evaluation and deployment sets, as stated in the theoretical section. This assumption enables the closed-form identification of the over-prediction bias in the typical case and the precise offset condition when a rare high-failure mode is missed. We acknowledge that the bias-offset claim does not follow directly under additional variation such as multiple modes or correlated shifts, which would require extensions to the decomposition. The manuscript's experiments are conducted in controlled domains where the single-mode tail approximately holds, serving as proof-of-concept. In the revision we add a dedicated paragraph in the discussion section clarifying the scope of the assumption, noting that the safety-favorable bias direction provides insight even if the exact offset does not generalize, and outlining future work on multi-modal extensions. The forecastability loss is motivated by the identified failure mode and empirically reduces forecast error in the reported settings. revision: partial
Circularity Check
No circularity: finite-k decomposition and new loss are independent of fitted inputs
full rationale
The paper derives a finite-k decomposition of forecast error for the Jones et al. (2025) extrapolation estimator, identifies a typical over-prediction bias and its offset under missed rare modes, then introduces the forecastability loss as a targeted fine-tuning objective. This chain relies on explicit mathematical decomposition of order statistics and experimental validation rather than self-definition, renaming of known results, or load-bearing self-citations. The central claims do not reduce to parameters defined by the paper's own fits or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Failure scores in evaluation sets can be extrapolated to deployment scale using the largest k values.
invented entities (1)
-
forecastability loss
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We give a finite-k decomposition of this estimator's forecast error... curvature term Cθ ∝ −q''θ(yM), occupancy term Gθ
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gumbel-tail method assumes... logS(τ)≈aτ+b
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster
Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InProceedings of the 12th International Con- ference on Learning Representations, 2024
work page 2024
-
[2]
Anastasios N. Angelopoulos, Stephen Bates, Em- manuel J. Cand`es, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.The Annals of Applied Statistics, 19(2):1641–1662, 2025. doi: 10.1214/24-AOAS1998
-
[3]
Nicolas Atienza, Christophe Labreuche, Johanne Co- hen, and Mich `ele Sebag. Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025
work page 2025
-
[4]
Yuan-Lu Bai, Zhi-Yuan Huang, Henry Lam, and Ding Zhao. Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025. doi: 10.1007/s40305-025-00585-0
-
[5]
An alignment safety case sketch based on debate, 2025
Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, and Geoffrey Irving. An alignment safety case sketch based on debate, 2025
work page 2025
-
[6]
Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024
work page 2024
-
[7]
Risk-sensitive and robust decision-making: A CVaR optimization approach
Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, 2015
work page 2015
-
[8]
Safety cases: How to justify the safety of advanced AI systems, 2024
Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced AI systems, 2024
work page 2024
-
[9]
Anthony Corso, Kyu-Young Kim, Shubh Gupta, Grace Gao, and Mykel J. Kochenderfer. A deep reinforce- ment learning approach to rare event estimation, 2022. URL https://arxiv.org/abs/2211.124 70
work page 2022
-
[10]
Parisa Davar, Fr ´ed´eric Godin, and Jose Garrido. Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,
-
[11]
URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170
doi: 10.1016/j.jfds.2025.100165. URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170
-
[12]
Evaluation-aware reinforcement learn- ing, 2025
Shripad Vilasrao Deshmukh, Will Schwarzer, and Scott Niekum. Evaluation-aware reinforcement learn- ing, 2025
work page 2025
-
[13]
Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, and Juan P. Garrahan. Rare event analysis of large language models, 2026. URL https://arxiv.org/abs/2602.06791
-
[14]
Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022
work page 2022
-
[15]
Extreme value policy optimization for safe reinforcement learning
Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, and Xinbing Wang. Extreme value policy optimization for safe reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, vol- ume 267 ofProceedings of Machine Learning Re- search, pages 18772–18793. PMLR, 2025. URL https://proceedings.m...
work page 2025
-
[16]
Irving I. Gringorten. A plotting rule for extreme prob- ability paper.Journal of Geophysical Research, 68(3): 813–814, 1963. doi: 10.1029/JZ068i003p00813
-
[17]
Josiah P. Hanna, Philip S. Thomas, Peter Stone, and Scott Niekum. Data-efficient policy evaluation through behavior policy search. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th In- ternational Conference on Machine Learning, vol- ume 70 ofProceedings of Machine Learning Research, pages 1394–1403. PMLR, 06–11 Aug 2017. URL https://proce...
work page 2017
- [18]
-
[19]
Henry, III and Ping-Hung Hsieh
John B. Henry, III and Ping-Hung Hsieh. Extreme value analysis for partitioned insurance losses.Vari- ance, 3(2):214–238, 2009
work page 2009
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. URL https://arxiv.or g/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
S*: Test Time Scaling for Code Generation
Zhifeng Jiang, Zhihua Jin, and Guoliang He. Prompt- Keeper: Safeguarding system prompts for LLMs. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2712–2728, Suzhou, China, November 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings- emnlp.147. URL https://aclanthology...
-
[22]
Forecast- ing rare language model behaviors, 2025
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mah- foud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecast- ing rare language model behaviors, 2025
work page 2025
-
[23]
Mea- suring AI ability to complete long tasks, 2025
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Mea- suring AI ability to complete long tasks, 2025
work page 2025
-
[24]
List of dirty, naughty, ob- scene, and otherwise bad words (English)
LDNOOBW Contributors. List of dirty, naughty, ob- scene, and otherwise bad words (English). https: //github.com/LDNOOBW/List-of-Dirty -Naughty-Obscene-and-Otherwise-Bad -Words, 2023
work page 2023
-
[25]
Tilted empirical risk minimization
Tianyu Li, Ahmad Beirami, Maziar Sanjabi, and Vir- ginia Smith. Tilted empirical risk minimization. In Proceedings of the 9th International Conference on Learning Representations, 2021
work page 2021
-
[26]
Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn
Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Im- proving group robustness without training group in- formation. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceedings of Machine Learning Research, pages 6781–
-
[27]
METR. Autonomy evaluation resources. METR blog, March 2024. URL https://metr.org/blog/ 2024-03-13-autonomy-evaluation-res ources/
work page 2024
-
[28]
A closer look at system prompt robustness,
Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A closer look at system prompt robustness,
-
[29]
URL https://arxiv.org/abs/2502 .12197
-
[30]
Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, John C. Duchi, and Russ Tedrake. Scalable end-to-end autonomous vehicle testing via rare-event simulation. InAdvances in Neural Information Processing Sys- tems, volume 31, 2018
work page 2018
-
[31]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448. Association for Computational Linguistics, 2022
work page 2022
-
[32]
Evaluating frontier models for dan- gerous capabilities, 2024
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, et al. Evaluating frontier models for dan- gerous capabilities, 2024
work page 2024
-
[33]
Qwen Team. Qwen3 technical report, 2025. Model card: https://huggingface.co/Qwen/Qw en3-0.6B
work page 2025
-
[34]
HCAST: Human-calibrated autonomy software tasks, 2025
David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O’Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, et al. HCAST: Human-calibrated autonomy software tasks, 2025
work page 2025
-
[35]
Extreme quantile regression with deep learning, 2024
Jordan Richards and Rapha¨el Huser. Extreme quantile regression with deep learning, 2024
work page 2024
-
[36]
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the 8th International Conference on Learning Representations, 2020
work page 2020
-
[37]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Charac- terizing and evaluating In-The-Wild jailbreak prompts on large language models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Commu- nications Security. ACM, 2024. doi: 10.1145/365864 4.3670388
-
[38]
Model evaluation for extreme risks, 2023
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks, 2023. 10 Training ML Models with Predictable Failures
work page 2023
-
[39]
S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M
Karthik Somayaji N. S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M. Halappanavar, Frank Liu, and Peng Li. Extreme risk mitigation in reinforcement learning using extreme value theory.Transactions on Machine Learning Research, 2024. URLhttps:// openreview.net/forum?id=098mb06uhA
work page 2024
-
[40]
Op- timizing the CVaR via sampling
Aviv Tamar, Yonatan Glassner, and Shie Mannor. Op- timizing the CVaR via sampling. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intel- ligence, pages 2993–2999, 2015
work page 2015
-
[41]
Tensor trust: Inter- pretable prompt injection attacks from an online game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Inter- pretable prompt injection attacks from an online game. InProceedings of the 12th International Conference on Learning Representations, 2024
work page 2024
-
[42]
Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures
Jonathan Uesato, Ananya Kumar, Csaba Szepesv´ari, Tom Erez, Avraham Ruderman, Keith Anderson, Kr- ishnamurthy Dvijotham, Nicolas Heess, and Pushmeet Kohli. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. InProceed- ings of the 7th International Conference on Learning Representations, 2019
work page 2019
-
[43]
Stefan Webb, Tom Rainforth, Yee Whye Teh, and M. Pawan Kumar. A statistical approach to assess- ing neural network robustness. InProceedings of the 7th International Conference on Learning Representa- tions, 2019
work page 2019
-
[44]
Evaluating the robustness of neural networks: An extreme value theory approach
Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. InProceedings of the 6th International Conference on Learning Repre- sentations, 2018
work page 2018
-
[45]
Enforcing tail calibration when training probabilistic forecast models, 2025
Jakob Benjamin Wessel, Maybritt Schillinger, Frank Kwasniok, and Sam Allen. Enforcing tail calibration when training probabilistic forecast models, 2025
work page 2025
-
[46]
Estimating the probabil- ities of rare outputs in language models
Gabriel Wu and Jacob Hilton. Estimating the probabil- ities of rare outputs in language models. InProceed- ings of the 13th International Conference on Learning Representations, 2025
work page 2025
-
[47]
Estimating tail risk in neural networks
Mark Xu. Estimating tail risk in neural networks. Alignment Research Center blog, September 2024. URL https://alignment.org/blog/e stimating-tail-risk/ . Blog post describ- ing joint research with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu
work page 2024
-
[48]
WildChat: 1M Chat- GPT interaction logs in the wild
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M Chat- GPT interaction logs in the wild. InInternational Con- ference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2405.014 70
work page 2024
-
[49]
Antanas ˇZilinskas, Robert N. Shorten, and Jakub Mareˇcek. EVEREST: An evidential, tail-aware trans- former for rare-event time-series forecasting. InPro- ceedings of the 14th International Conference on Learning Representations, 2026. doi: 10.48550/a rXiv.2601.19022
work page doi:10.48550/a 2026
-
[50]
Zico Kolter, and Matt Fredrikson
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 11 Training ML Models with Predictable Failures A Experimental details This appendix collects the specific configuration choices for the two canonical runs that produced Figures 3 and 4. ...
work page 2023
-
[51]
Future work should systematically examine how forecastability scales with model size
– and hence why forecastability training has less potential improvement to offer. Future work should systematically examine how forecastability scales with model size. G A finite-kdecomposition for the inverse-OLS Gumbel-tail extrapolator Headline result.We prove a finite- k decomposition of the forecast error of the inverse-OLS Gumbel-tail extrapolator (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.