pith. sign in

arxiv: 2605.15134 · v2 · pith:TZBVH2UCnew · submitted 2026-05-14 · 💻 cs.LG

Training ML Models with Predictable Failures

Pith reviewed 2026-05-20 20:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords failure rate estimationML safetyfine-tuning objectivesforecast errorrare failure modesextrapolationpredictable failures
0
0 comments X

The pith

A forecastability loss fine-tunes ML models so that failure rates predicted from small evaluation sets remain accurate even when rare high-failure modes appear at deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to estimate the rate at which an ML model will fail when deployed at scale, using only a small evaluation set that may not contain the failures that matter most. It analyzes an existing extrapolation method based on the largest k failure scores and supplies a finite-k decomposition of the forecast error, showing a built-in tendency to over-predict failures in ordinary cases. This over-prediction is reversed to under-prediction precisely when the evaluation set omits a rare but severe failure mode that the larger deployment distribution contains. The authors introduce a new fine-tuning objective, the forecastability loss, that trains the model to produce failure scores whose extrapolation is less vulnerable to this omission. In two proof-of-concept settings, a language-model password game and an RL gridworld, the loss lowers held-out forecast error while leaving primary-task performance and overall safety intact.

Core claim

We give a finite-k decomposition of the estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode.

What carries the argument

The forecastability loss, a fine-tuning objective that trains models to emit failure scores whose extrapolation from the top k values is robust to missing rare high-failure modes.

If this is right

  • The estimator tends to over-predict failures except when rare modes are missed, producing a safety-favorable bias in most cases.
  • The forecastability loss reduces held-out forecast error without degrading primary-task capability.
  • Safety performance after fine-tuning remains comparable to that of supervised baselines.
  • The finite-k decomposition isolates the contribution of missed rare modes to forecast error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loss could be tested with other k-based or tail-extrapolation estimators beyond the one analyzed here.
  • If the loss generalizes, practitioners could rely on smaller evaluation sets for safety assessment in additional domains.
  • The approach highlights a trade-off between making failures predictable and maintaining task performance that may appear in other safety fine-tuning methods.

Load-bearing premise

The primary source of under-prediction arises specifically when the evaluation set misses a rare high-failure mode that is present in the deployment distribution.

What would settle it

If applying the forecastability loss during fine-tuning fails to reduce held-out forecast error in the password-game or gridworld experiments while preserving primary-task performance, the claim that the loss corrects the identified under-prediction mode would be falsified.

Figures

Figures reproduced from arXiv: 2605.15134 by Scott Niekum, Will Schwarzer.

Figure 1
Figure 1. Figure 1: Decomposition of forecast error on canonical distributions. (a) Error decomposition across six distributions at k = 10, R = 10; diamonds are empirical means, and Mixture is an Exp(1) bulk plus a rare shifted-Exp(1) component. The rank bar (blue) is constant, the curvature bar (orange) tracks the sign of −q ′′ θ (y) (positive for lighter-than-exponential tails, negative for heavier), the occupancy bar (red)… view at source ↗
Figure 2
Figure 2. Figure 2: Tail-shape change under fine-tuning, illustrative. Empirical log-survival of one held-out password’s deploy-set scores (n = 1998) plotted against the transformed score ψ = − log(− log p) of Section 3.1, before (left) and after (right) forecastability training. The dashed line is the OLS extrapolation fit to the fit-set top-10 scores; the open circle marks the line’s predicted worst-rank deploy score at log… view at source ↗
Figure 3
Figure 3. Figure 3: Three-axis comparison on the language-model password game. Error bars are seed-level standard errors over n = 10 seeds per condition. The three panels report, respectively, absolute WildChat single-token KL divergence (lower is better; the pretrained baseline has KL= 0 and is not drawn); decades of leak-probability reduction at the worst-rank held-out prompt over the pretrained baseline (higher is better; … view at source ↗
Figure 4
Figure 4. Figure 4: Three-axis comparison on the gridworld setting. Bars show mean fold improvement over the pretrained baseline (dashed line at 1×); error bars are seed-level standard errors over n = 30 seeds per condition; higher is better in every panel. Capability preservation uses the policy’s mean return on the pre-training task pool; safety uses worst-case regret on the held-out deployment set; forecast precision uses … view at source ↗
Figure 5
Figure 5. Figure 5: Two-axis comparison on the 8B password game. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rank-term coefficient, confirmed by simulation. Even on a perfectly linear tail-quantile curve (Exp(1), so q(y) = y, q ′ (y) = 1, q ′′ = 0, isolating term (a) of the decomposition), the inverse-OLS estimator with k = 10 has the predicted non-vanishing finite-k rank bias. We run the actual estimator on m = 10,000 evaluation samples per trial and average over one million independent trials per R. Left: empir… view at source ↗
Figure 7
Figure 7. Figure 7: WildChat hazard diagnostics with reference distributions. Top four rows: WildChat scores (length and mean NLL raw; harmful log-prob and Detoxify after Gumbel-prob). Bottom three rows: n = 200,000 samples from Exp(1), Pareto(α = 3), Uniform(0, 1) (Gumbel, Frechet, reverse-Weibull). Rising hazard on length, harmful, and Detoxify matches ´ the Uniform reference; mean NLL rises slowly. 31 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 8
Figure 8. Figure 8: Empirical decomposition at (M, N, k, R) = (5,000, 50,000, 10, 10). The blue rank bar is identical at 0.794 q ′ (y) across all columns by construction; curvature (orange) and residual (gray) vary by score. Empirical mean errors (black diamonds) match the algebraic sum of the colored bars to within Monte Carlo error. Total NLL is the only score on which the empirical mean is negative; the four body scores ha… view at source ↗
Figure 9
Figure 9. Figure 9: k-sensitivity at fixed R = 10. Empirical mean error in q ′ (y) units against k, on assistant length (blue) and per-token mean NLL (green), alongside the theoretical ξ(k, R = 10) ridge from a synthetic Exp(1) Monte Carlo (black). Length matches the theoretical ridge through the predicted sign-flip near k = 100; per-token mean NLL diverges upward as the curvature term grows with k. 32 [PITH_FULL_IMAGE:figur… view at source ↗
read the original abstract

Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper extends the extrapolation estimator from Jones et al. (2025) for predicting ML model failure rates at deployment scale using the largest k failure scores from a small evaluation set. It provides a finite-k decomposition of the estimator's forecast error, showing a built-in bias toward over-prediction in typical cases (safety-favorable). This bias is offset when the evaluation set misses a rare high-failure mode present in the deployment set, resulting in under-prediction. The authors introduce a forecastability loss as a fine-tuning objective to address this specific failure mode. In two proof-of-concept experiments (a language-model password game and an RL gridworld), fine-tuning with this loss substantially reduces held-out forecast error while preserving primary-task capability and achieving safety levels similar to supervised baselines.

Significance. If the finite-k decomposition is valid and the forecastability loss reliably mitigates the identified under-prediction without degrading other properties, the work could meaningfully advance pre-deployment safety assessment for ML systems by making failure-rate forecasts more robust to distribution shifts in rare events. The explicit identification of a safety-favorable bias direction and its offset condition, combined with reproducible experiments in controlled domains, strengthens the contribution; however, the practical impact depends on whether the decomposition generalizes beyond the assumed tail behavior.

major comments (1)
  1. [finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.
minor comments (2)
  1. [Experiments] The abstract and experiments section refer to 'held-out forecast error' and 'safety similar to that of supervised baselines,' but the precise definitions of these metrics (e.g., how forecast error is computed and what safety metric is used) should be stated explicitly with equations or pseudocode for reproducibility.
  2. [Method] The paper introduces the 'forecastability loss' as a new objective; a short comparison table showing its form relative to standard losses (e.g., cross-entropy or the original extrapolation objective) would clarify the modification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's focus on the assumptions underlying the finite-k decomposition and address the major comment point by point below.

read point-by-point responses
  1. Referee: [finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.

    Authors: We agree that the finite-k decomposition is derived under the explicit modeling assumption that top-k failure scores follow order statistics from a fixed single-mode tail distribution preserved between evaluation and deployment sets, as stated in the theoretical section. This assumption enables the closed-form identification of the over-prediction bias in the typical case and the precise offset condition when a rare high-failure mode is missed. We acknowledge that the bias-offset claim does not follow directly under additional variation such as multiple modes or correlated shifts, which would require extensions to the decomposition. The manuscript's experiments are conducted in controlled domains where the single-mode tail approximately holds, serving as proof-of-concept. In the revision we add a dedicated paragraph in the discussion section clarifying the scope of the assumption, noting that the safety-favorable bias direction provides insight even if the exact offset does not generalize, and outlining future work on multi-modal extensions. The forecastability loss is motivated by the identified failure mode and empirically reduces forecast error in the reported settings. revision: partial

Circularity Check

0 steps flagged

No circularity: finite-k decomposition and new loss are independent of fitted inputs

full rationale

The paper derives a finite-k decomposition of forecast error for the Jones et al. (2025) extrapolation estimator, identifies a typical over-prediction bias and its offset under missed rare modes, then introduces the forecastability loss as a targeted fine-tuning objective. This chain relies on explicit mathematical decomposition of order statistics and experimental validation rather than self-definition, renaming of known results, or load-bearing self-citations. The central claims do not reduce to parameters defined by the paper's own fits or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the estimator from prior work is taken as given and the new loss is introduced without visible free parameters or additional axioms in the summary.

axioms (1)
  • domain assumption Failure scores in evaluation sets can be extrapolated to deployment scale using the largest k values.
    Implicit in the use and analysis of the Jones et al. (2025) estimator referenced in the abstract.
invented entities (1)
  • forecastability loss no independent evidence
    purpose: Fine-tuning objective to reduce under-prediction when rare high-failure modes are missed in evaluation.
    New objective proposed to address the identified failure mode of the extrapolation estimator.

pith-pipeline@v0.9.0 · 5691 in / 1277 out tokens · 82158 ms · 2026-05-20T20:22:56.924135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

    Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InProceedings of the 12th International Con- ference on Learning Representations, 2024

  2. [2]

    Barber, R

    Anastasios N. Angelopoulos, Stephen Bates, Em- manuel J. Cand`es, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.The Annals of Applied Statistics, 19(2):1641–1662, 2025. doi: 10.1214/24-AOAS1998

  3. [3]

    Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

    Nicolas Atienza, Christophe Labreuche, Johanne Co- hen, and Mich `ele Sebag. Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

  4. [4]

    Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025

    Yuan-Lu Bai, Zhi-Yuan Huang, Henry Lam, and Ding Zhao. Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025. doi: 10.1007/s40305-025-00585-0

  5. [5]

    An alignment safety case sketch based on debate, 2025

    Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, and Geoffrey Irving. An alignment safety case sketch based on debate, 2025

  6. [6]

    Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024

  7. [7]

    Risk-sensitive and robust decision-making: A CVaR optimization approach

    Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, 2015

  8. [8]

    Safety cases: How to justify the safety of advanced AI systems, 2024

    Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced AI systems, 2024

  9. [9]

    Kochenderfer

    Anthony Corso, Kyu-Young Kim, Shubh Gupta, Grace Gao, and Mykel J. Kochenderfer. A deep reinforce- ment learning approach to rare event estimation, 2022. URL https://arxiv.org/abs/2211.124 70

  10. [10]

    Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

    Parisa Davar, Fr ´ed´eric Godin, and Jose Garrido. Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

  11. [11]

    URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

    doi: 10.1016/j.jfds.2025.100165. URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

  12. [12]

    Evaluation-aware reinforcement learn- ing, 2025

    Shripad Vilasrao Deshmukh, Will Schwarzer, and Scott Niekum. Evaluation-aware reinforcement learn- ing, 2025

  13. [13]

    Rose, Jamie F

    Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, and Juan P. Garrahan. Rare event analysis of large language models, 2026. URL https://arxiv.org/abs/2602.06791

  14. [14]

    Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

  15. [15]

    Extreme value policy optimization for safe reinforcement learning

    Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, and Xinbing Wang. Extreme value policy optimization for safe reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, vol- ume 267 ofProceedings of Machine Learning Re- search, pages 18772–18793. PMLR, 2025. URL https://proceedings.m...

  16. [16]

    Gringorten

    Irving I. Gringorten. A plotting rule for extreme prob- ability paper.Journal of Geophysical Research, 68(3): 813–814, 1963. doi: 10.1029/JZ068i003p00813

  17. [17]

    Hanna, Philip S

    Josiah P. Hanna, Philip S. Thomas, Peter Stone, and Scott Niekum. Data-efficient policy evaluation through behavior policy search. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th In- ternational Conference on Machine Learning, vol- ume 70 ofProceedings of Machine Learning Research, pages 1394–1403. PMLR, 06–11 Aug 2017. URL https://proce...

  18. [18]

    Detoxify

    Laura Hanu and Unitary team. Detoxify. https: //github.com/unitaryai/detoxify, 2020. 9 Training ML Models with Predictable Failures

  19. [19]

    Henry, III and Ping-Hung Hsieh

    John B. Henry, III and Ping-Hung Hsieh. Extreme value analysis for partitioned insurance losses.Vari- ance, 3(2):214–238, 2009

  20. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. URL https://arxiv.or g/abs/2106.09685

  21. [21]

    S*: Test Time Scaling for Code Generation

    Zhifeng Jiang, Zhihua Jin, and Guoliang He. Prompt- Keeper: Safeguarding system prompts for LLMs. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2712–2728, Suzhou, China, November 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings- emnlp.147. URL https://aclanthology...

  22. [22]

    Forecast- ing rare language model behaviors, 2025

    Erik Jones, Meg Tong, Jesse Mu, Mohammed Mah- foud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecast- ing rare language model behaviors, 2025

  23. [23]

    Mea- suring AI ability to complete long tasks, 2025

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Mea- suring AI ability to complete long tasks, 2025

  24. [24]

    List of dirty, naughty, ob- scene, and otherwise bad words (English)

    LDNOOBW Contributors. List of dirty, naughty, ob- scene, and otherwise bad words (English). https: //github.com/LDNOOBW/List-of-Dirty -Naughty-Obscene-and-Otherwise-Bad -Words, 2023

  25. [25]

    Tilted empirical risk minimization

    Tianyu Li, Ahmad Beirami, Maziar Sanjabi, and Vir- ginia Smith. Tilted empirical risk minimization. In Proceedings of the 9th International Conference on Learning Representations, 2021

  26. [26]

    Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn

    Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Im- proving group robustness without training group in- formation. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceedings of Machine Learning Research, pages 6781–

  27. [27]

    Autonomy evaluation resources

    METR. Autonomy evaluation resources. METR blog, March 2024. URL https://metr.org/blog/ 2024-03-13-autonomy-evaluation-res ources/

  28. [28]

    A closer look at system prompt robustness,

    Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A closer look at system prompt robustness,

  29. [29]

    URL https://arxiv.org/abs/2502 .12197

  30. [30]

    Duchi, and Russ Tedrake

    Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, John C. Duchi, and Russ Tedrake. Scalable end-to-end autonomous vehicle testing via rare-event simulation. InAdvances in Neural Information Processing Sys- tems, volume 31, 2018

  31. [31]

    Red teaming language models with language models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448. Association for Computational Linguistics, 2022

  32. [32]

    Evaluating frontier models for dan- gerous capabilities, 2024

    Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, et al. Evaluating frontier models for dan- gerous capabilities, 2024

  33. [33]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. Model card: https://huggingface.co/Qwen/Qw en3-0.6B

  34. [34]

    HCAST: Human-calibrated autonomy software tasks, 2025

    David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O’Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, et al. HCAST: Human-calibrated autonomy software tasks, 2025

  35. [35]

    Extreme quantile regression with deep learning, 2024

    Jordan Richards and Rapha¨el Huser. Extreme quantile regression with deep learning, 2024

  36. [36]

    Hashimoto, and Percy Liang

    Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the 8th International Conference on Learning Representations, 2020

  37. [37]

    Do Anything Now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Charac- terizing and evaluating In-The-Wild jailbreak prompts on large language models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Commu- nications Security. ACM, 2024. doi: 10.1145/365864 4.3670388

  38. [38]

    Model evaluation for extreme risks, 2023

    Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks, 2023. 10 Training ML Models with Predictable Failures

  39. [39]

    S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M

    Karthik Somayaji N. S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M. Halappanavar, Frank Liu, and Peng Li. Extreme risk mitigation in reinforcement learning using extreme value theory.Transactions on Machine Learning Research, 2024. URLhttps:// openreview.net/forum?id=098mb06uhA

  40. [40]

    Op- timizing the CVaR via sampling

    Aviv Tamar, Yonatan Glassner, and Shie Mannor. Op- timizing the CVaR via sampling. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intel- ligence, pages 2993–2999, 2015

  41. [41]

    Tensor trust: Inter- pretable prompt injection attacks from an online game

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Inter- pretable prompt injection attacks from an online game. InProceedings of the 12th International Conference on Learning Representations, 2024

  42. [42]

    Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures

    Jonathan Uesato, Ananya Kumar, Csaba Szepesv´ari, Tom Erez, Avraham Ruderman, Keith Anderson, Kr- ishnamurthy Dvijotham, Nicolas Heess, and Pushmeet Kohli. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. InProceed- ings of the 7th International Conference on Learning Representations, 2019

  43. [43]

    Pawan Kumar

    Stefan Webb, Tom Rainforth, Yee Whye Teh, and M. Pawan Kumar. A statistical approach to assess- ing neural network robustness. InProceedings of the 7th International Conference on Learning Representa- tions, 2019

  44. [44]

    Evaluating the robustness of neural networks: An extreme value theory approach

    Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. InProceedings of the 6th International Conference on Learning Repre- sentations, 2018

  45. [45]

    Enforcing tail calibration when training probabilistic forecast models, 2025

    Jakob Benjamin Wessel, Maybritt Schillinger, Frank Kwasniok, and Sam Allen. Enforcing tail calibration when training probabilistic forecast models, 2025

  46. [46]

    Estimating the probabil- ities of rare outputs in language models

    Gabriel Wu and Jacob Hilton. Estimating the probabil- ities of rare outputs in language models. InProceed- ings of the 13th International Conference on Learning Representations, 2025

  47. [47]

    Estimating tail risk in neural networks

    Mark Xu. Estimating tail risk in neural networks. Alignment Research Center blog, September 2024. URL https://alignment.org/blog/e stimating-tail-risk/ . Blog post describ- ing joint research with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu

  48. [48]

    WildChat: 1M Chat- GPT interaction logs in the wild

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M Chat- GPT interaction logs in the wild. InInternational Con- ference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2405.014 70

  49. [49]

    Shorten, and Jakub Mareˇcek

    Antanas ˇZilinskas, Robert N. Shorten, and Jakub Mareˇcek. EVEREST: An evidential, tail-aware trans- former for rare-event time-series forecasting. InPro- ceedings of the 14th International Conference on Learning Representations, 2026. doi: 10.48550/a rXiv.2601.19022

  50. [50]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 11 Training ML Models with Predictable Failures A Experimental details This appendix collects the specific configuration choices for the two canonical runs that produced Figures 3 and 4. ...

  51. [51]

    Future work should systematically examine how forecastability scales with model size

    – and hence why forecastability training has less potential improvement to offer. Future work should systematically examine how forecastability scales with model size. G A finite-kdecomposition for the inverse-OLS Gumbel-tail extrapolator Headline result.We prove a finite- k decomposition of the forecast error of the inverse-OLS Gumbel-tail extrapolator (...