Training ML Models with Predictable Failures

Scott Niekum; Will Schwarzer

arxiv: 2605.15134 · v2 · pith:TZBVH2UCnew · submitted 2026-05-14 · 💻 cs.LG

Training ML Models with Predictable Failures

Will Schwarzer , Scott Niekum This is my paper

Pith reviewed 2026-05-20 20:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords failure rate estimationML safetyfine-tuning objectivesforecast errorrare failure modesextrapolationpredictable failures

0 comments

The pith

A forecastability loss fine-tunes ML models so that failure rates predicted from small evaluation sets remain accurate even when rare high-failure modes appear at deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to estimate the rate at which an ML model will fail when deployed at scale, using only a small evaluation set that may not contain the failures that matter most. It analyzes an existing extrapolation method based on the largest k failure scores and supplies a finite-k decomposition of the forecast error, showing a built-in tendency to over-predict failures in ordinary cases. This over-prediction is reversed to under-prediction precisely when the evaluation set omits a rare but severe failure mode that the larger deployment distribution contains. The authors introduce a new fine-tuning objective, the forecastability loss, that trains the model to produce failure scores whose extrapolation is less vulnerable to this omission. In two proof-of-concept settings, a language-model password game and an RL gridworld, the loss lowers held-out forecast error while leaving primary-task performance and overall safety intact.

Core claim

We give a finite-k decomposition of the estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode.

What carries the argument

The forecastability loss, a fine-tuning objective that trains models to emit failure scores whose extrapolation from the top k values is robust to missing rare high-failure modes.

If this is right

The estimator tends to over-predict failures except when rare modes are missed, producing a safety-favorable bias in most cases.
The forecastability loss reduces held-out forecast error without degrading primary-task capability.
Safety performance after fine-tuning remains comparable to that of supervised baselines.
The finite-k decomposition isolates the contribution of missed rare modes to forecast error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loss could be tested with other k-based or tail-extrapolation estimators beyond the one analyzed here.
If the loss generalizes, practitioners could rely on smaller evaluation sets for safety assessment in additional domains.
The approach highlights a trade-off between making failures predictable and maintaining task performance that may appear in other safety fine-tuning methods.

Load-bearing premise

The primary source of under-prediction arises specifically when the evaluation set misses a rare high-failure mode that is present in the deployment distribution.

What would settle it

If applying the forecastability loss during fine-tuning fails to reduce held-out forecast error in the password-game or gridworld experiments while preserving primary-task performance, the claim that the loss corrects the identified under-prediction mode would be falsified.

Figures

Figures reproduced from arXiv: 2605.15134 by Scott Niekum, Will Schwarzer.

**Figure 1.** Figure 1: Decomposition of forecast error on canonical distributions. (a) Error decomposition across six distributions at k = 10, R = 10; diamonds are empirical means, and Mixture is an Exp(1) bulk plus a rare shifted-Exp(1) component. The rank bar (blue) is constant, the curvature bar (orange) tracks the sign of −q ′′ θ (y) (positive for lighter-than-exponential tails, negative for heavier), the occupancy bar (red)… view at source ↗

**Figure 2.** Figure 2: Tail-shape change under fine-tuning, illustrative. Empirical log-survival of one held-out password’s deploy-set scores (n = 1998) plotted against the transformed score ψ = − log(− log p) of Section 3.1, before (left) and after (right) forecastability training. The dashed line is the OLS extrapolation fit to the fit-set top-10 scores; the open circle marks the line’s predicted worst-rank deploy score at log… view at source ↗

**Figure 3.** Figure 3: Three-axis comparison on the language-model password game. Error bars are seed-level standard errors over n = 10 seeds per condition. The three panels report, respectively, absolute WildChat single-token KL divergence (lower is better; the pretrained baseline has KL= 0 and is not drawn); decades of leak-probability reduction at the worst-rank held-out prompt over the pretrained baseline (higher is better; … view at source ↗

**Figure 4.** Figure 4: Three-axis comparison on the gridworld setting. Bars show mean fold improvement over the pretrained baseline (dashed line at 1×); error bars are seed-level standard errors over n = 30 seeds per condition; higher is better in every panel. Capability preservation uses the policy’s mean return on the pre-training task pool; safety uses worst-case regret on the held-out deployment set; forecast precision uses … view at source ↗

**Figure 5.** Figure 5: Two-axis comparison on the 8B password game. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Rank-term coefficient, confirmed by simulation. Even on a perfectly linear tail-quantile curve (Exp(1), so q(y) = y, q ′ (y) = 1, q ′′ = 0, isolating term (a) of the decomposition), the inverse-OLS estimator with k = 10 has the predicted non-vanishing finite-k rank bias. We run the actual estimator on m = 10,000 evaluation samples per trial and average over one million independent trials per R. Left: empir… view at source ↗

**Figure 7.** Figure 7: WildChat hazard diagnostics with reference distributions. Top four rows: WildChat scores (length and mean NLL raw; harmful log-prob and Detoxify after Gumbel-prob). Bottom three rows: n = 200,000 samples from Exp(1), Pareto(α = 3), Uniform(0, 1) (Gumbel, Frechet, reverse-Weibull). Rising hazard on length, harmful, and Detoxify matches ´ the Uniform reference; mean NLL rises slowly. 31 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 8.** Figure 8: Empirical decomposition at (M, N, k, R) = (5,000, 50,000, 10, 10). The blue rank bar is identical at 0.794 q ′ (y) across all columns by construction; curvature (orange) and residual (gray) vary by score. Empirical mean errors (black diamonds) match the algebraic sum of the colored bars to within Monte Carlo error. Total NLL is the only score on which the empirical mean is negative; the four body scores ha… view at source ↗

**Figure 9.** Figure 9: k-sensitivity at fixed R = 10. Empirical mean error in q ′ (y) units against k, on assistant length (blue) and per-token mean NLL (green), alongside the theoretical ξ(k, R = 10) ridge from a synthetic Exp(1) Monte Carlo (black). Length matches the theoretical ridge through the predicted sign-flip near k = 100; per-token mean NLL diverges upward as the curvature term grows with k. 32 [PITH_FULL_IMAGE:figur… view at source ↗

read the original abstract

Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes forecast error in Jones et al.'s finite-k extrapolation and adds a forecastability loss to fix under-prediction from missed rare failure modes, with early experiments showing reduced error in two toy domains.

read the letter

The core contribution is a finite-k decomposition of the Jones et al. (2025) estimator's error that identifies a built-in over-prediction bias in typical cases, which flips to under-prediction precisely when the evaluation set misses a rare high-failure mode present at deployment. They then introduce a forecastability loss to fine-tune against that specific offset. This is new relative to the cited estimator and directly targets a safety-relevant failure mode in extrapolation. The two proof-of-concept experiments (language-model password game and RL gridworld) show the loss lowers held-out forecast error while preserving primary-task performance and reaching safety levels comparable to supervised baselines. That is concrete and worth noting. The decomposition itself appears to follow from standard order-statistic arguments under the single-mode tail assumption stated in the abstract. The experiments are small but internally consistent and report the relevant metrics without obvious cherry-picking. The main soft spot is that the central claim depends on failure scores behaving as order statistics from a fixed distribution whose tail properties carry over across sets. If deployment introduces even modest additional variation, such as a second correlated failure mode or a shift that alters the conditional distribution of the k-th score, the claimed bias and its offset no longer follow directly. The abstract gives no indication that the derivation bounds or removes those interactions, and the stress-test note correctly flags this. The experiments are too narrow to test robustness against such cases. This work is aimed at researchers focused on pre-deployment failure-rate estimation and rare-event extrapolation in ML safety. A reader already working with Jones-style estimators or similar tail-extrapolation methods will get the most out of the decomposition and the proposed loss. It is coherent on its own terms and has enough formal grounding plus reproducible experiments to merit referee time rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper extends the extrapolation estimator from Jones et al. (2025) for predicting ML model failure rates at deployment scale using the largest k failure scores from a small evaluation set. It provides a finite-k decomposition of the estimator's forecast error, showing a built-in bias toward over-prediction in typical cases (safety-favorable). This bias is offset when the evaluation set misses a rare high-failure mode present in the deployment set, resulting in under-prediction. The authors introduce a forecastability loss as a fine-tuning objective to address this specific failure mode. In two proof-of-concept experiments (a language-model password game and an RL gridworld), fine-tuning with this loss substantially reduces held-out forecast error while preserving primary-task capability and achieving safety levels similar to supervised baselines.

Significance. If the finite-k decomposition is valid and the forecastability loss reliably mitigates the identified under-prediction without degrading other properties, the work could meaningfully advance pre-deployment safety assessment for ML systems by making failure-rate forecasts more robust to distribution shifts in rare events. The explicit identification of a safety-favorable bias direction and its offset condition, combined with reproducible experiments in controlled domains, strengthens the contribution; however, the practical impact depends on whether the decomposition generalizes beyond the assumed tail behavior.

major comments (1)

[finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.

minor comments (2)

[Experiments] The abstract and experiments section refer to 'held-out forecast error' and 'safety similar to that of supervised baselines,' but the precise definitions of these metrics (e.g., how forecast error is computed and what safety metric is used) should be stated explicitly with equations or pseudocode for reproducibility.
[Method] The paper introduces the 'forecastability loss' as a new objective; a short comparison table showing its form relative to standard losses (e.g., cross-entropy or the original extrapolation objective) would clarify the modification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's focus on the assumptions underlying the finite-k decomposition and address the major comment point by point below.

read point-by-point responses

Referee: [finite-k decomposition derivation] The finite-k decomposition (described in the main theoretical section following the abstract) assumes that top-k failure scores behave as order statistics from a fixed single-mode tail distribution whose properties are preserved across evaluation and deployment sets. This assumption is load-bearing for the central claim that the over-prediction bias is offset specifically when the evaluation set misses one rare high-failure mode. If deployment introduces even modest additional variation (e.g., a second failure mode whose scores correlate with the first or a shift altering the conditional distribution of the k-th order statistic), the claimed bias offset no longer follows directly, as noted in the skeptic analysis of unmodeled interactions.

Authors: We agree that the finite-k decomposition is derived under the explicit modeling assumption that top-k failure scores follow order statistics from a fixed single-mode tail distribution preserved between evaluation and deployment sets, as stated in the theoretical section. This assumption enables the closed-form identification of the over-prediction bias in the typical case and the precise offset condition when a rare high-failure mode is missed. We acknowledge that the bias-offset claim does not follow directly under additional variation such as multiple modes or correlated shifts, which would require extensions to the decomposition. The manuscript's experiments are conducted in controlled domains where the single-mode tail approximately holds, serving as proof-of-concept. In the revision we add a dedicated paragraph in the discussion section clarifying the scope of the assumption, noting that the safety-favorable bias direction provides insight even if the exact offset does not generalize, and outlining future work on multi-modal extensions. The forecastability loss is motivated by the identified failure mode and empirically reduces forecast error in the reported settings. revision: partial

Circularity Check

0 steps flagged

No circularity: finite-k decomposition and new loss are independent of fitted inputs

full rationale

The paper derives a finite-k decomposition of forecast error for the Jones et al. (2025) extrapolation estimator, identifies a typical over-prediction bias and its offset under missed rare modes, then introduces the forecastability loss as a targeted fine-tuning objective. This chain relies on explicit mathematical decomposition of order statistics and experimental validation rather than self-definition, renaming of known results, or load-bearing self-citations. The central claims do not reduce to parameters defined by the paper's own fits or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the estimator from prior work is taken as given and the new loss is introduced without visible free parameters or additional axioms in the summary.

axioms (1)

domain assumption Failure scores in evaluation sets can be extrapolated to deployment scale using the largest k values.
Implicit in the use and analysis of the Jones et al. (2025) estimator referenced in the abstract.

invented entities (1)

forecastability loss no independent evidence
purpose: Fine-tuning objective to reduce under-prediction when rare high-failure modes are missed in evaluation.
New objective proposed to address the identified failure mode of the extrapolation estimator.

pith-pipeline@v0.9.0 · 5691 in / 1277 out tokens · 82158 ms · 2026-05-20T20:22:56.924135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We give a finite-k decomposition of this estimator's forecast error... curvature term Cθ ∝ −q''θ(yM), occupancy term Gθ
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gumbel-tail method assumes... logS(τ)≈aτ+b

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

[1]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InProceedings of the 12th International Con- ference on Learning Representations, 2024

work page 2024
[2]

Barber, R

Anastasios N. Angelopoulos, Stephen Bates, Em- manuel J. Cand`es, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.The Annals of Applied Statistics, 19(2):1641–1662, 2025. doi: 10.1214/24-AOAS1998

work page doi:10.1214/24-aoas1998 2025
[3]

Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

Nicolas Atienza, Christophe Labreuche, Johanne Co- hen, and Mich `ele Sebag. Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

work page 2025
[4]

Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025

Yuan-Lu Bai, Zhi-Yuan Huang, Henry Lam, and Ding Zhao. Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025. doi: 10.1007/s40305-025-00585-0

work page doi:10.1007/s40305-025-00585-0 2025
[5]

An alignment safety case sketch based on debate, 2025

Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, and Geoffrey Irving. An alignment safety case sketch based on debate, 2025

work page 2025
[6]

Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024

work page 2024
[7]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, 2015

work page 2015
[8]

Safety cases: How to justify the safety of advanced AI systems, 2024

Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced AI systems, 2024

work page 2024
[9]

Kochenderfer

Anthony Corso, Kyu-Young Kim, Shubh Gupta, Grace Gao, and Mykel J. Kochenderfer. A deep reinforce- ment learning approach to rare event estimation, 2022. URL https://arxiv.org/abs/2211.124 70

work page 2022
[10]

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

Parisa Davar, Fr ´ed´eric Godin, and Jose Garrido. Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

work page
[11]

URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

doi: 10.1016/j.jfds.2025.100165. URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

work page doi:10.1016/j.jfds.2025.100165 2025
[12]

Evaluation-aware reinforcement learn- ing, 2025

Shripad Vilasrao Deshmukh, Will Schwarzer, and Scott Niekum. Evaluation-aware reinforcement learn- ing, 2025

work page 2025
[13]

Rose, Jamie F

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, and Juan P. Garrahan. Rare event analysis of large language models, 2026. URL https://arxiv.org/abs/2602.06791

work page arXiv 2026
[14]

Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

work page 2022
[15]

Extreme value policy optimization for safe reinforcement learning

Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, and Xinbing Wang. Extreme value policy optimization for safe reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, vol- ume 267 ofProceedings of Machine Learning Re- search, pages 18772–18793. PMLR, 2025. URL https://proceedings.m...

work page 2025
[16]

Gringorten

Irving I. Gringorten. A plotting rule for extreme prob- ability paper.Journal of Geophysical Research, 68(3): 813–814, 1963. doi: 10.1029/JZ068i003p00813

work page doi:10.1029/jz068i003p00813 1963
[17]

Hanna, Philip S

Josiah P. Hanna, Philip S. Thomas, Peter Stone, and Scott Niekum. Data-efficient policy evaluation through behavior policy search. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th In- ternational Conference on Machine Learning, vol- ume 70 ofProceedings of Machine Learning Research, pages 1394–1403. PMLR, 06–11 Aug 2017. URL https://proce...

work page 2017
[18]

Detoxify

Laura Hanu and Unitary team. Detoxify. https: //github.com/unitaryai/detoxify, 2020. 9 Training ML Models with Predictable Failures

work page 2020
[19]

Henry, III and Ping-Hung Hsieh

John B. Henry, III and Ping-Hung Hsieh. Extreme value analysis for partitioned insurance losses.Vari- ance, 3(2):214–238, 2009

work page 2009
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. URL https://arxiv.or g/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

S*: Test Time Scaling for Code Generation

Zhifeng Jiang, Zhihua Jin, and Guoliang He. Prompt- Keeper: Safeguarding system prompts for LLMs. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2712–2728, Suzhou, China, November 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings- emnlp.147. URL https://aclanthology...

work page doi:10.18653/v1/2025.findings- 2025
[22]

Forecast- ing rare language model behaviors, 2025

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mah- foud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecast- ing rare language model behaviors, 2025

work page 2025
[23]

Mea- suring AI ability to complete long tasks, 2025

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Mea- suring AI ability to complete long tasks, 2025

work page 2025
[24]

List of dirty, naughty, ob- scene, and otherwise bad words (English)

LDNOOBW Contributors. List of dirty, naughty, ob- scene, and otherwise bad words (English). https: //github.com/LDNOOBW/List-of-Dirty -Naughty-Obscene-and-Otherwise-Bad -Words, 2023

work page 2023
[25]

Tilted empirical risk minimization

Tianyu Li, Ahmad Beirami, Maziar Sanjabi, and Vir- ginia Smith. Tilted empirical risk minimization. In Proceedings of the 9th International Conference on Learning Representations, 2021

work page 2021
[26]

Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn

Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Im- proving group robustness without training group in- formation. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceedings of Machine Learning Research, pages 6781–

work page
[27]

Autonomy evaluation resources

METR. Autonomy evaluation resources. METR blog, March 2024. URL https://metr.org/blog/ 2024-03-13-autonomy-evaluation-res ources/

work page 2024
[28]

A closer look at system prompt robustness,

Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A closer look at system prompt robustness,

work page
[29]

URL https://arxiv.org/abs/2502 .12197

work page
[30]

Duchi, and Russ Tedrake

Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, John C. Duchi, and Russ Tedrake. Scalable end-to-end autonomous vehicle testing via rare-event simulation. InAdvances in Neural Information Processing Sys- tems, volume 31, 2018

work page 2018
[31]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448. Association for Computational Linguistics, 2022

work page 2022
[32]

Evaluating frontier models for dan- gerous capabilities, 2024

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, et al. Evaluating frontier models for dan- gerous capabilities, 2024

work page 2024
[33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. Model card: https://huggingface.co/Qwen/Qw en3-0.6B

work page 2025
[34]

HCAST: Human-calibrated autonomy software tasks, 2025

David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O’Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, et al. HCAST: Human-calibrated autonomy software tasks, 2025

work page 2025
[35]

Extreme quantile regression with deep learning, 2024

Jordan Richards and Rapha¨el Huser. Extreme quantile regression with deep learning, 2024

work page 2024
[36]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the 8th International Conference on Learning Representations, 2020

work page 2020
[37]

Do Anything Now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Charac- terizing and evaluating In-The-Wild jailbreak prompts on large language models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Commu- nications Security. ACM, 2024. doi: 10.1145/365864 4.3670388

work page doi:10.1145/365864 2024
[38]

Model evaluation for extreme risks, 2023

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks, 2023. 10 Training ML Models with Predictable Failures

work page 2023
[39]

S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M

Karthik Somayaji N. S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M. Halappanavar, Frank Liu, and Peng Li. Extreme risk mitigation in reinforcement learning using extreme value theory.Transactions on Machine Learning Research, 2024. URLhttps:// openreview.net/forum?id=098mb06uhA

work page 2024
[40]

Op- timizing the CVaR via sampling

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Op- timizing the CVaR via sampling. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intel- ligence, pages 2993–2999, 2015

work page 2015
[41]

Tensor trust: Inter- pretable prompt injection attacks from an online game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Inter- pretable prompt injection attacks from an online game. InProceedings of the 12th International Conference on Learning Representations, 2024

work page 2024
[42]

Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures

Jonathan Uesato, Ananya Kumar, Csaba Szepesv´ari, Tom Erez, Avraham Ruderman, Keith Anderson, Kr- ishnamurthy Dvijotham, Nicolas Heess, and Pushmeet Kohli. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. InProceed- ings of the 7th International Conference on Learning Representations, 2019

work page 2019
[43]

Pawan Kumar

Stefan Webb, Tom Rainforth, Yee Whye Teh, and M. Pawan Kumar. A statistical approach to assess- ing neural network robustness. InProceedings of the 7th International Conference on Learning Representa- tions, 2019

work page 2019
[44]

Evaluating the robustness of neural networks: An extreme value theory approach

Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. InProceedings of the 6th International Conference on Learning Repre- sentations, 2018

work page 2018
[45]

Enforcing tail calibration when training probabilistic forecast models, 2025

Jakob Benjamin Wessel, Maybritt Schillinger, Frank Kwasniok, and Sam Allen. Enforcing tail calibration when training probabilistic forecast models, 2025

work page 2025
[46]

Estimating the probabil- ities of rare outputs in language models

Gabriel Wu and Jacob Hilton. Estimating the probabil- ities of rare outputs in language models. InProceed- ings of the 13th International Conference on Learning Representations, 2025

work page 2025
[47]

Estimating tail risk in neural networks

Mark Xu. Estimating tail risk in neural networks. Alignment Research Center blog, September 2024. URL https://alignment.org/blog/e stimating-tail-risk/ . Blog post describ- ing joint research with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu

work page 2024
[48]

WildChat: 1M Chat- GPT interaction logs in the wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M Chat- GPT interaction logs in the wild. InInternational Con- ference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2405.014 70

work page 2024
[49]

Shorten, and Jakub Mareˇcek

Antanas ˇZilinskas, Robert N. Shorten, and Jakub Mareˇcek. EVEREST: An evidential, tail-aware trans- former for rare-event time-series forecasting. InPro- ceedings of the 14th International Conference on Learning Representations, 2026. doi: 10.48550/a rXiv.2601.19022

work page doi:10.48550/a 2026
[50]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 11 Training ML Models with Predictable Failures A Experimental details This appendix collects the specific configuration choices for the two canonical runs that produced Figures 3 and 4. ...

work page 2023
[51]

Future work should systematically examine how forecastability scales with model size

– and hence why forecastability training has less potential improvement to offer. Future work should systematically examine how forecastability scales with model size. G A finite-kdecomposition for the inverse-OLS Gumbel-tail extrapolator Headline result.We prove a finite- k decomposition of the forecast error of the inverse-OLS Gumbel-tail extrapolator (...

work page

[1] [1]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InProceedings of the 12th International Con- ference on Learning Representations, 2024

work page 2024

[2] [2]

Barber, R

Anastasios N. Angelopoulos, Stephen Bates, Em- manuel J. Cand`es, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.The Annals of Applied Statistics, 19(2):1641–1662, 2025. doi: 10.1214/24-AOAS1998

work page doi:10.1214/24-aoas1998 2025

[3] [3]

Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

Nicolas Atienza, Christophe Labreuche, Johanne Co- hen, and Mich `ele Sebag. Provably safeguarding a classifier from OOD and adversarial samples: An ex- treme value theory approach, 2025

work page 2025

[4] [4]

Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025

Yuan-Lu Bai, Zhi-Yuan Huang, Henry Lam, and Ding Zhao. Black-box rare-event simulation for safety test- ing of AI agents: An overview.Journal of the Oper- ations Research Society of China, 13:750–774, 2025. doi: 10.1007/s40305-025-00585-0

work page doi:10.1007/s40305-025-00585-0 2025

[5] [5]

An alignment safety case sketch based on debate, 2025

Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, and Geoffrey Irving. An alignment safety case sketch based on debate, 2025

work page 2025

[6] [6]

Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference, 2024

work page 2024

[7] [7]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, 2015

work page 2015

[8] [8]

Safety cases: How to justify the safety of advanced AI systems, 2024

Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced AI systems, 2024

work page 2024

[9] [9]

Kochenderfer

Anthony Corso, Kyu-Young Kim, Shubh Gupta, Grace Gao, and Mykel J. Kochenderfer. A deep reinforce- ment learning approach to rare event estimation, 2022. URL https://arxiv.org/abs/2211.124 70

work page 2022

[10] [10]

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

Parisa Davar, Fr ´ed´eric Godin, and Jose Garrido. Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients.The Journal of Finance and Data Science, 11:100165,

work page

[11] [11]

URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

doi: 10.1016/j.jfds.2025.100165. URL https://www.sciencedirect.com/scie nce/article/pii/S2405918825000170

work page doi:10.1016/j.jfds.2025.100165 2025

[12] [12]

Evaluation-aware reinforcement learn- ing, 2025

Shripad Vilasrao Deshmukh, Will Schwarzer, and Scott Niekum. Evaluation-aware reinforcement learn- ing, 2025

work page 2025

[13] [13]

Rose, Jamie F

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, and Juan P. Garrahan. Rare event analysis of large language models, 2026. URL https://arxiv.org/abs/2602.06791

work page arXiv 2026

[14] [14]

Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned, 2022

work page 2022

[15] [15]

Extreme value policy optimization for safe reinforcement learning

Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, and Xinbing Wang. Extreme value policy optimization for safe reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, vol- ume 267 ofProceedings of Machine Learning Re- search, pages 18772–18793. PMLR, 2025. URL https://proceedings.m...

work page 2025

[16] [16]

Gringorten

Irving I. Gringorten. A plotting rule for extreme prob- ability paper.Journal of Geophysical Research, 68(3): 813–814, 1963. doi: 10.1029/JZ068i003p00813

work page doi:10.1029/jz068i003p00813 1963

[17] [17]

Hanna, Philip S

Josiah P. Hanna, Philip S. Thomas, Peter Stone, and Scott Niekum. Data-efficient policy evaluation through behavior policy search. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th In- ternational Conference on Machine Learning, vol- ume 70 ofProceedings of Machine Learning Research, pages 1394–1403. PMLR, 06–11 Aug 2017. URL https://proce...

work page 2017

[18] [18]

Detoxify

Laura Hanu and Unitary team. Detoxify. https: //github.com/unitaryai/detoxify, 2020. 9 Training ML Models with Predictable Failures

work page 2020

[19] [19]

Henry, III and Ping-Hung Hsieh

John B. Henry, III and Ping-Hung Hsieh. Extreme value analysis for partitioned insurance losses.Vari- ance, 3(2):214–238, 2009

work page 2009

[20] [20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. URL https://arxiv.or g/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

S*: Test Time Scaling for Code Generation

Zhifeng Jiang, Zhihua Jin, and Guoliang He. Prompt- Keeper: Safeguarding system prompts for LLMs. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2712–2728, Suzhou, China, November 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings- emnlp.147. URL https://aclanthology...

work page doi:10.18653/v1/2025.findings- 2025

[22] [22]

Forecast- ing rare language model behaviors, 2025

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mah- foud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecast- ing rare language model behaviors, 2025

work page 2025

[23] [23]

Mea- suring AI ability to complete long tasks, 2025

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Mea- suring AI ability to complete long tasks, 2025

work page 2025

[24] [24]

List of dirty, naughty, ob- scene, and otherwise bad words (English)

LDNOOBW Contributors. List of dirty, naughty, ob- scene, and otherwise bad words (English). https: //github.com/LDNOOBW/List-of-Dirty -Naughty-Obscene-and-Otherwise-Bad -Words, 2023

work page 2023

[25] [25]

Tilted empirical risk minimization

Tianyu Li, Ahmad Beirami, Maziar Sanjabi, and Vir- ginia Smith. Tilted empirical risk minimization. In Proceedings of the 9th International Conference on Learning Representations, 2021

work page 2021

[26] [26]

Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn

Evan Zheran Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Im- proving group robustness without training group in- formation. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceedings of Machine Learning Research, pages 6781–

work page

[27] [27]

Autonomy evaluation resources

METR. Autonomy evaluation resources. METR blog, March 2024. URL https://metr.org/blog/ 2024-03-13-autonomy-evaluation-res ources/

work page 2024

[28] [28]

A closer look at system prompt robustness,

Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A closer look at system prompt robustness,

work page

[29] [29]

URL https://arxiv.org/abs/2502 .12197

work page

[30] [30]

Duchi, and Russ Tedrake

Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, John C. Duchi, and Russ Tedrake. Scalable end-to-end autonomous vehicle testing via rare-event simulation. InAdvances in Neural Information Processing Sys- tems, volume 31, 2018

work page 2018

[31] [31]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448. Association for Computational Linguistics, 2022

work page 2022

[32] [32]

Evaluating frontier models for dan- gerous capabilities, 2024

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, et al. Evaluating frontier models for dan- gerous capabilities, 2024

work page 2024

[33] [33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. Model card: https://huggingface.co/Qwen/Qw en3-0.6B

work page 2025

[34] [34]

HCAST: Human-calibrated autonomy software tasks, 2025

David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O’Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, et al. HCAST: Human-calibrated autonomy software tasks, 2025

work page 2025

[35] [35]

Extreme quantile regression with deep learning, 2024

Jordan Richards and Rapha¨el Huser. Extreme quantile regression with deep learning, 2024

work page 2024

[36] [36]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the 8th International Conference on Learning Representations, 2020

work page 2020

[37] [37]

Do Anything Now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “Do Anything Now”: Charac- terizing and evaluating In-The-Wild jailbreak prompts on large language models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Commu- nications Security. ACM, 2024. doi: 10.1145/365864 4.3670388

work page doi:10.1145/365864 2024

[38] [38]

Model evaluation for extreme risks, 2023

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks, 2023. 10 Training ML Models with Predictable Failures

work page 2023

[39] [39]

S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M

Karthik Somayaji N. S., Yu Wang, Malachi Schram, J´an Drgoˇna, Mahantesh M. Halappanavar, Frank Liu, and Peng Li. Extreme risk mitigation in reinforcement learning using extreme value theory.Transactions on Machine Learning Research, 2024. URLhttps:// openreview.net/forum?id=098mb06uhA

work page 2024

[40] [40]

Op- timizing the CVaR via sampling

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Op- timizing the CVaR via sampling. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intel- ligence, pages 2993–2999, 2015

work page 2015

[41] [41]

Tensor trust: Inter- pretable prompt injection attacks from an online game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Inter- pretable prompt injection attacks from an online game. InProceedings of the 12th International Conference on Learning Representations, 2024

work page 2024

[42] [42]

Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures

Jonathan Uesato, Ananya Kumar, Csaba Szepesv´ari, Tom Erez, Avraham Ruderman, Keith Anderson, Kr- ishnamurthy Dvijotham, Nicolas Heess, and Pushmeet Kohli. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. InProceed- ings of the 7th International Conference on Learning Representations, 2019

work page 2019

[43] [43]

Pawan Kumar

Stefan Webb, Tom Rainforth, Yee Whye Teh, and M. Pawan Kumar. A statistical approach to assess- ing neural network robustness. InProceedings of the 7th International Conference on Learning Representa- tions, 2019

work page 2019

[44] [44]

Evaluating the robustness of neural networks: An extreme value theory approach

Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. InProceedings of the 6th International Conference on Learning Repre- sentations, 2018

work page 2018

[45] [45]

Enforcing tail calibration when training probabilistic forecast models, 2025

Jakob Benjamin Wessel, Maybritt Schillinger, Frank Kwasniok, and Sam Allen. Enforcing tail calibration when training probabilistic forecast models, 2025

work page 2025

[46] [46]

Estimating the probabil- ities of rare outputs in language models

Gabriel Wu and Jacob Hilton. Estimating the probabil- ities of rare outputs in language models. InProceed- ings of the 13th International Conference on Learning Representations, 2025

work page 2025

[47] [47]

Estimating tail risk in neural networks

Mark Xu. Estimating tail risk in neural networks. Alignment Research Center blog, September 2024. URL https://alignment.org/blog/e stimating-tail-risk/ . Blog post describ- ing joint research with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu

work page 2024

[48] [48]

WildChat: 1M Chat- GPT interaction logs in the wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M Chat- GPT interaction logs in the wild. InInternational Con- ference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2405.014 70

work page 2024

[49] [49]

Shorten, and Jakub Mareˇcek

Antanas ˇZilinskas, Robert N. Shorten, and Jakub Mareˇcek. EVEREST: An evidential, tail-aware trans- former for rare-event time-series forecasting. InPro- ceedings of the 14th International Conference on Learning Representations, 2026. doi: 10.48550/a rXiv.2601.19022

work page doi:10.48550/a 2026

[50] [50]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 11 Training ML Models with Predictable Failures A Experimental details This appendix collects the specific configuration choices for the two canonical runs that produced Figures 3 and 4. ...

work page 2023

[51] [51]

Future work should systematically examine how forecastability scales with model size

– and hence why forecastability training has less potential improvement to offer. Future work should systematically examine how forecastability scales with model size. G A finite-kdecomposition for the inverse-OLS Gumbel-tail extrapolator Headline result.We prove a finite- k decomposition of the forecast error of the inverse-OLS Gumbel-tail extrapolator (...

work page