SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings

Edmund Lau; Jie Meng; Taylor T Johnson; Xiaowei Huang; Yi Dong; Yingjie Wang

arxiv: 2606.29623 · v1 · pith:22WK77B2new · submitted 2026-06-28 · 💻 cs.AI · cs.LG

SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings

Yingjie Wang , Yi Dong , Edmund Lau , Jie Meng , Taylor T Johnson , Xiaowei Huang This is my paper

Pith reviewed 2026-06-30 07:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords rare event estimationsubset simulationAI safetyembeddingssupermartingaleadversarial fractionmisclassification probabilityjailbreak detection

0 comments

The pith

SCARCE estimates rare AI failure probabilities by replacing handcrafted performance functions with learned embeddings and geometric rulers in subset simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SCARCE to compute extremely low probabilities of events such as model misclassifications or adversarial successes, where direct sampling is infeasible. It substitutes the classical requirement for a manually engineered scalar performance function with data-derived latent representations scored by geometric rulers that measure distance to failure regions. Adaptive thresholds then build nested intermediate events, and the whole procedure is bounded by a non-negative supermartingale that supplies a valid high-probability upper envelope even if sampling stops early. Experiments on MNIST show 400-500 times lower mean absolute error than grid-searched traditional subset simulation, and the same pipeline applied to LLM hidden states yields low relative error for adversarial fractions at or above 0.001. A reader would care because the approach removes the domain-expertise barrier that has limited rare-event analysis in new AI systems.

Core claim

SCARCE replaces the scalar performance function whose sublevel sets define nested events in classical subset simulation with learned latent representations scored by geometric rulers such as PCA. Adaptive thresholding on these scores constructs the nested events directly from data. The construction is formalised as a non-negative supermartingale, producing a high-probability upper envelope that remains valid under early stopping. On MNIST misclassification the method attains 400-500 times lower mean absolute error than grid-searched traditional subset simulation and eliminates systematic over-counting. On Llama-Guard-3-8B hidden states a PCA ruler reaches 2.6 percent mean relative error for

What carries the argument

Learned latent representations scored by geometric rulers (for example PCA) that measure proximity to failure regions and permit data-driven construction of nested events inside the supermartingale bound.

If this is right

Handcrafted performance functions are no longer required to apply subset simulation to new AI domains.
Rare-event probabilities in high-dimensional models can be bounded with far smaller sample budgets while preserving validity.
The same ruler construction transfers across threat models after a single recalibration step.
A KL-based directional criterion provides a consistent way to select among candidate rulers without ground-truth probabilities.
Early stopping of the cascade does not invalidate the final upper envelope.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If embeddings from other modalities preserve failure geometry, SCARCE could extend to rare-event estimation in vision-language or multimodal systems.
The supermartingale envelope might be combined with importance sampling to further reduce variance at the lowest probability levels.
For adversarial fractions below 10 to the minus 4, additional ruler calibration or higher-dimensional embeddings may be needed to keep relative error below 5 percent.
Applying the method to safety benchmarks beyond jailbreaks, such as reward-model failures, would test whether the KL ranking of rulers generalises.

Load-bearing premise

The learned latent representations and chosen geometric rulers must accurately reflect proximity to failure regions so that the constructed nested events remain valid for the supermartingale bound.

What would settle it

A large-scale Monte Carlo reference on Llama-Guard-3-8B for eta equal to 0.001 that deviates from the reported 2.6 percent relative error by more than the bootstrap half-width of 27.9 percent would falsify the accuracy claim for the PCA ruler.

Figures

Figures reproduced from arXiv: 2606.29623 by Edmund Lau, Jie Meng, Taylor T Johnson, Xiaowei Huang, Yi Dong, Yingjie Wang.

**Figure 2.** Figure 2: Overall ruler accuracy: (a) average accuracy; (b) per-turn accuracy; (c) per-fleet accuracy; [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Threshold percentile sweep in PC1 family. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Selector diagnostics against downstream ruler-family error. Worst-turn directional KL is [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Step 1 ruler pre-screening on 50 flipped MNIST seeds. Bad-anchored variants dominate the [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Step 1 efficiency budgets. The rulers selected by accuracy and reliability also avoid the [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Step 1 Pareto views. Bad-anchored geometric rulers occupy the low-error and moderate [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Step 2 MNIST comparison with accuracy and mismatch counts. SCARCE rulers keep [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Miscounts over 100 held-out seeds. Traditional SS attains zero false negatives at the cost of [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Seed-wise distribution of |pˆf −p SMC f | across 100 MNIST seeds for the top SCARCE rulers and Traditional SS. The SCARCE distributions are tightly concentrated near zero while Traditional SS shows a heavy upper tail driven by systematic false positives. Part II: LLM Jailbreak Sweeps 10 5 N (particles per run) 10 −6 SE of ̂ pf (a) Variance: SE drops monotonically turn 1 turn 2 turn 3 turn 4 turn 5 ∝ 1/√ N… view at source ↗

**Figure 11.** Figure 11: Population-size sweep for PC1/p75. Increasing N reduces Monte Carlo variation as expected, while the mean error plateaus once the sample budget is large enough for stable cascade calibration. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

Rare events govern the safety profile of modern AI systems, yet their probabilities are extremely difficult to estimate: direct Monte Carlo requires prohibitive sample budgets. Subset Simulation (SS) addresses this by decomposing a rare-event probability into moderate conditional probabilities over nested intermediate events. However, classical SS requires a handcrafted scalar performance function whose sublevel sets define those events, demanding detailed knowledge of the failure geometry and limiting transfer to new domains. We propose SCARCE (Scalable Cascade Analysis for Rare-event Characterisation via Embeddings), which replaces the performance function with learned latent representations and geometric rulers that score proximity to failure regions. Adaptive thresholding constructs nested intermediate events directly from data. We formalise SCARCE through a non-negative supermartingale, yielding a high-probability upper envelope that remains valid under early stopping. On MNIST misclassification, where dense Monte Carlo provides ground truth, SCARCE achieves approximately 400--500 times lower mean absolute error than grid-searched traditional SS while eliminating systematic over-counting. We then study PAIR-style LLM jailbreaks under a fleet-level threat model with adversarial fraction $\eta$. On Llama-Guard-3-8B hidden states, a PCA-based ruler attains 2.6% mean relative error for $\eta \geq 10^{-3}$ against finite-sample references whose average bootstrap relative half-width is 27.9%, and transfers to a GCG-style corpus with 2.93% relative error after recalibration. A directional criterion $\mathrm{KL}(p_{\mathrm{good}}\,\|\,p_{\mathrm{bad}})$ ranks rulers consistently with estimation error (Spearman $\rho=0.83$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCARCE replaces handcrafted performance functions with embeddings and geometric rulers in subset simulation, backed by a supermartingale for bounds, and shows large error drops on MNIST and LLM tasks, but the adaptive ruler construction needs explicit proof that nesting and the martingale property survive data-dependent thresholds.

read the letter

SCARCE replaces the scalar performance function in subset simulation with learned embeddings and simple geometric rulers such as PCA projections. Adaptive thresholds on those ruler scores define the nested events directly from samples, and a non-negative supermartingale supplies the high-probability envelope that stays valid under early stopping.

The concrete gains are the clearest part. On MNIST misclassification the method cuts mean absolute error by roughly 400-500 times relative to grid-searched classical SS and removes the systematic over-count. On Llama-Guard-3-8B hidden states a PCA ruler reaches 2.6 % mean relative error for probabilities at or above 10^{-3}, well inside the 27.9 % bootstrap reference width, and the same approach transfers to a GCG corpus after recalibration. The KL(p_good || p_bad) ranking also lines up with observed error (Spearman 0.83), which is a useful practical check.

The main technical move—embedding-based event construction plus the supermartingale formalization—is new enough to matter for rare-event work where an explicit failure score is unavailable. It opens the method to latent representations that already exist in many AI pipelines.

The soft spot is the one flagged in the stress-test note. Standard subset simulation gets nesting for free from sublevel sets; here the rulers and thresholds are chosen from the same data used for estimation. If the proof does not show that this dependence preserves both the nesting A1 ⊃ A2 ⊃ … and the supermartingale property without extra bias, the reported error reductions rest on an unverified extension. The abstract asserts validity but does not spell out how the argument handles the data-dependent step.

The paper is aimed at researchers who need rare-event probabilities for AI safety or reliability and who already work with embeddings. A reader in Monte Carlo methods or LLM red-teaming would find the empirical comparisons and the ruler-ranking criterion useful. It deserves a serious referee because the problem is concrete, the empirical claims are sharp, and the proposed fix is testable even if the theoretical justification for the adaptive case requires more detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes SCARCE, which replaces the handcrafted scalar performance function in classical Subset Simulation with learned embeddings and geometric rulers (e.g., PCA projections) to construct nested events A1 ⊃ A2 ⊃ ⋯ via adaptive thresholding. It formalizes the procedure via a non-negative supermartingale that supplies a high-probability upper envelope valid under early stopping, and reports empirical gains: ~400–500× lower MAE than grid-searched SS on MNIST misclassification (with elimination of over-counting) and 2.6% mean relative error on Llama-Guard-3-8B hidden states for jailbreak probability η ≥ 10^{-3}.

Significance. If the supermartingale bound remains valid under the adaptive, data-driven construction, the approach would remove a major practical barrier to rare-event estimation in AI safety by enabling transfer across domains without domain-specific performance functions. The reported error reductions relative to both dense MC ground truth and bootstrap references, together with the KL(p_good ‖ p_bad) ruler-ranking criterion (Spearman ρ=0.83), would constitute a concrete advance in scalable rare-event tools.

major comments (2)

[Formalization via non-negative supermartingale] The central theoretical claim—that adaptive thresholding on learned rulers produces nested events whose conditional probabilities multiply to a valid supermartingale bound—requires an explicit argument showing that ruler selection and threshold choice from the same samples do not introduce data-dependent bias that violates the non-negative supermartingale property. The abstract asserts validity under early stopping, but the dependence structure induced by data-driven event definition is not addressed in the provided description and is load-bearing for the high-probability envelope.
[MNIST misclassification results] On the MNIST experiments, the 400–500× MAE reduction and elimination of systematic over-counting are reported against grid-searched traditional SS; however, it is unclear whether the comparison equalizes total sample budget, whether the traditional SS performance function was chosen with knowledge of the failure geometry, and whether the dense MC reference is sufficiently precise to support the magnitude of the claimed improvement.

minor comments (2)

[Ruler ranking] The precise definition of the directional criterion KL(p_good ‖ p_bad) and how the good/bad distributions are estimated from the embeddings should be stated explicitly.
[Method overview] Notation for the nested events A_i and the adaptive thresholds should be introduced with a short diagram or pseudocode to clarify the construction before the supermartingale argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of SCARCE to advance rare-event estimation in AI safety applications. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Formalization via non-negative supermartingale] The central theoretical claim—that adaptive thresholding on learned rulers produces nested events whose conditional probabilities multiply to a valid supermartingale bound—requires an explicit argument showing that ruler selection and threshold choice from the same samples do not introduce data-dependent bias that violates the non-negative supermartingale property. The abstract asserts validity under early stopping, but the dependence structure induced by data-driven event definition is not addressed in the provided description and is load-bearing for the high-probability envelope.

Authors: We agree that an explicit argument is required to confirm preservation of the supermartingale property under data-driven ruler selection and adaptive thresholding. The manuscript constructs the nested events sequentially from the learned embeddings, with the non-negative supermartingale defined via the product of conditional probabilities estimated on subsequent independent samples after threshold determination. This ordering ensures the increments remain martingale differences with respect to the filtration that includes the selection step. To make this rigorous and address the dependence structure directly, we will add a dedicated lemma and proof sketch in the revised theoretical section demonstrating that the adaptive construction does not introduce bias violating non-negativity or the supermartingale property, while retaining validity under early stopping. revision: yes
Referee: [MNIST misclassification results] On the MNIST experiments, the 400–500× MAE reduction and elimination of systematic over-counting are reported against grid-searched traditional SS; however, it is unclear whether the comparison equalizes total sample budget, whether the traditional SS performance function was chosen with knowledge of the failure geometry, and whether the dense MC reference is sufficiently precise to support the magnitude of the claimed improvement.

Authors: The MNIST comparison equalized total sample budget by matching the number of classifier evaluations (forward passes) across SCARCE and grid-searched traditional SS. The baseline performance function was a standard handcrafted scalar (pixel-space Euclidean distance to the decision boundary), selected from the general problem setup without access to the specific misclassification geometry. The dense Monte Carlo reference used a large sample size whose bootstrap-estimated precision is sufficient to support the reported MAE differences, as the variance is orders of magnitude smaller than the observed gap. We will revise the experimental section to state these details explicitly, including exact budgets and the baseline function definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; supermartingale formalization remains independent of data-driven ruler construction

full rationale

The abstract states that SCARCE replaces the performance function with learned embeddings and geometric rulers, then 'formalise[s] SCARCE through a non-negative supermartingale, yielding a high-probability upper envelope that remains valid under early stopping.' No equations or steps are shown that define the supermartingale in terms of the fitted rulers or that rename a fitted quantity as a prediction. The MNIST evaluation uses independent dense Monte Carlo as ground truth, providing an external benchmark. No self-citation chains, ansatz smuggling, or self-definitional reductions appear in the given text. The derivation is therefore treated as self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that embeddings preserve failure geometry and that adaptive thresholds produce valid nested events; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

ruler choice and adaptive thresholds
Geometric rulers (PCA, KL criterion) and thresholds are selected or calibrated from data.

axioms (2)

domain assumption Latent embeddings allow geometric rulers to score proximity to failure regions
This replaces the classical scalar performance function.
standard math Non-negative supermartingale property yields valid high-probability upper envelope under early stopping
Invoked to guarantee the bound remains valid.

pith-pipeline@v0.9.1-grok · 5852 in / 1332 out tokens · 39646 ms · 2026-06-30T07:00:04.027258+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. 9

2024
[2]

Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

Solomon Asghar, Qing-Xiang Pei, Giorgio V olpe, and Ran Ni. Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

2024
[3]

Siu-Kui Au and James L. Beck. Estimation of small failure probabilities in high dimensions by subset simulation.Probabilistic Engineering Mechanics, 16(4):263–277, 2001

2001
[4]

Siu-Kui Au and James L. Beck. Subset simulation and its application to seismic risk based on dynamic analysis.Journal of Engineering Mechanics, 129(8):901–917, 2003

2003
[5]

Zdravko I. Botev. Minimax tilting for importance sampling.Annals of Statistics, 45(2):468–499, 2017

2017
[6]

Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

Charles-Edouard Bréhier, Maxime Gazeau, Ludovic Goudenège, Tony Lelièvre, and Mathias Rousset. Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

2016
[7]

Bucklew.Introduction to Rare Event Simulation

James A. Bucklew.Introduction to Rare Event Simulation. Springer, 2004

2004
[8]

Adaptive multilevel splitting for rare event analysis

Frédéric Cérou and Arnaud Guyader. Adaptive multilevel splitting for rare event analysis. Stochastic Analysis and Applications, 25(2):417–443, 2007

2007
[9]

Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

Frédéric Cérou, Arnaud Guyader, and Mathias Rousset. Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

2019
[10]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems, Datasets and...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

Miaochuan Chen, Dimitrios Giovanis, and Michael Shields. Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

2024
[13]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020

2020
[14]

Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

work page arXiv 2024
[15]

Boundary point jailbreaking of black-box llms, 2026

Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, and Yarin Gal. Boundary point jailbreaking of black-box llms, 2026

2026
[16]

Zhengqi Gao, Dinghuai Zhang, Luca Daniel, and Duane S. Boning. NOFIS: Normalizing flow for rare circuit failure analysis. InProceedings of the 61st ACM/IEEE Design Automa- tion Conference (DAC), DAC ’24, New York, NY , USA, 2024. Association for Computing Machinery

2024
[17]

Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales.Annals of Statistics, 49(2):1055–1080, 2021

2021
[18]

SAFARI: Versatile and efficient evaluations for robustness of interpretability

Wei Huang, Xingyu Zhao, Gaojie Jin, and Xiaowei Huang. SAFARI: Versatile and efficient evaluations for robustness of interpretability. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10

2023
[19]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Forecasting rare language model behaviors

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797, 2025

work page arXiv 2025
[21]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

2020
[22]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

2018
[23]

Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

Qianxiao Li, Bo Lin, and Weiqing Ren. Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

2024
[24]

Deep learning for rare event estimation.Neural Computing and Applications, 2021

Yifan Li et al. Deep learning for rare event estimation.Neural Computing and Applications, 2021

2021
[25]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Martell, Jessica A

Max J. Martell, Jessica A. Baweja, and Brandon D. Dreslin. Mitigative strategies for recovering from large language model trust violations.Journal of Cognitive Engineering and Decision Making, 19(1), 2024

2024
[27]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning (ICML), 2024

2024
[28]

Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[29]

MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

Iason Papaioannou, Wolfgang Betz, Karl Zwirglmaier, and Daniel Straub. MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

2015
[30]

An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

Iason Papaioannou, Max Ehre, and Daniel Straub. An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

2022
[31]

Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

Aaditya Ramdas, Johannes Ruf, and Martin Larsson. Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

work page arXiv 2006
[32]

Rubinstein

Reuven Y . Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1(2):127–190, 1999

1999
[33]

Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A

Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A. Siddiqui, Alexan- der Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 4393–4402. PMLR, 2018

2018
[34]

Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

Jian Song and Zhishen Lu. Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

2009
[35]

Cambridge Mathematical Textbooks

David Williams.Probability with Martingales. Cambridge Mathematical Textbooks. Cambridge University Press, 1991

1991
[36]

Estimating the probabilities of rare outputs in language mod- els

Gabriel Wu and Jacob Hilton. Estimating the probabilities of rare outputs in language mod- els. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. arXiv:2410.13211. 11

work page arXiv 2025
[37]

Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

Wenzhuo Xu, Zhipeng Wei, Xiongtao Sun, Zonghao Ying, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, and Quanchen Zou. Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

work page arXiv 2025
[38]

Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

Xiaoyu Zhang et al. Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

2019
[39]

AutoRedTeamer: Autonomous red teaming with lifelong attack integration

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Oluwasanmi Koyejo, James Zou, and Bo Li. AutoRedTeamer: Autonomous red teaming with lifelong attack integration. InAdvances in Neural Information Processing Systems (NeurIPS),
[40]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Kirill M. Zuev. Subset simulation method for rare event estimation: An introduction.Interna- tional Journal for Uncertainty Quantification, 5(2), 2015. A Subset Simulation This appendix expands the brief overview of classical Subset Simulation (SS) given in §3. SS estimates an extreme failure probability Pf =P(g(x)≥γ F ) by decomposing the failure event i...

work page arXiv 2015

[1] [1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. 9

2024

[2] [2]

Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

Solomon Asghar, Qing-Xiang Pei, Giorgio V olpe, and Ran Ni. Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

2024

[3] [3]

Siu-Kui Au and James L. Beck. Estimation of small failure probabilities in high dimensions by subset simulation.Probabilistic Engineering Mechanics, 16(4):263–277, 2001

2001

[4] [4]

Siu-Kui Au and James L. Beck. Subset simulation and its application to seismic risk based on dynamic analysis.Journal of Engineering Mechanics, 129(8):901–917, 2003

2003

[5] [5]

Zdravko I. Botev. Minimax tilting for importance sampling.Annals of Statistics, 45(2):468–499, 2017

2017

[6] [6]

Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

Charles-Edouard Bréhier, Maxime Gazeau, Ludovic Goudenège, Tony Lelièvre, and Mathias Rousset. Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

2016

[7] [7]

Bucklew.Introduction to Rare Event Simulation

James A. Bucklew.Introduction to Rare Event Simulation. Springer, 2004

2004

[8] [8]

Adaptive multilevel splitting for rare event analysis

Frédéric Cérou and Arnaud Guyader. Adaptive multilevel splitting for rare event analysis. Stochastic Analysis and Applications, 25(2):417–443, 2007

2007

[9] [9]

Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

Frédéric Cérou, Arnaud Guyader, and Mathias Rousset. Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

2019

[10] [10]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems, Datasets and...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

Miaochuan Chen, Dimitrios Giovanis, and Michael Shields. Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

2024

[13] [13]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020

2020

[14] [14]

Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

work page arXiv 2024

[15] [15]

Boundary point jailbreaking of black-box llms, 2026

Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, and Yarin Gal. Boundary point jailbreaking of black-box llms, 2026

2026

[16] [16]

Zhengqi Gao, Dinghuai Zhang, Luca Daniel, and Duane S. Boning. NOFIS: Normalizing flow for rare circuit failure analysis. InProceedings of the 61st ACM/IEEE Design Automa- tion Conference (DAC), DAC ’24, New York, NY , USA, 2024. Association for Computing Machinery

2024

[17] [17]

Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales.Annals of Statistics, 49(2):1055–1080, 2021

2021

[18] [18]

SAFARI: Versatile and efficient evaluations for robustness of interpretability

Wei Huang, Xingyu Zhao, Gaojie Jin, and Xiaowei Huang. SAFARI: Versatile and efficient evaluations for robustness of interpretability. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10

2023

[19] [19]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Forecasting rare language model behaviors

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797, 2025

work page arXiv 2025

[21] [21]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

2020

[22] [22]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

2018

[23] [23]

Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

Qianxiao Li, Bo Lin, and Weiqing Ren. Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

2024

[24] [24]

Deep learning for rare event estimation.Neural Computing and Applications, 2021

Yifan Li et al. Deep learning for rare event estimation.Neural Computing and Applications, 2021

2021

[25] [25]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Martell, Jessica A

Max J. Martell, Jessica A. Baweja, and Brandon D. Dreslin. Mitigative strategies for recovering from large language model trust violations.Journal of Cognitive Engineering and Decision Making, 19(1), 2024

2024

[27] [27]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning (ICML), 2024

2024

[28] [28]

Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[29] [29]

MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

Iason Papaioannou, Wolfgang Betz, Karl Zwirglmaier, and Daniel Straub. MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

2015

[30] [30]

An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

Iason Papaioannou, Max Ehre, and Daniel Straub. An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

2022

[31] [31]

Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

Aaditya Ramdas, Johannes Ruf, and Martin Larsson. Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

work page arXiv 2006

[32] [32]

Rubinstein

Reuven Y . Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1(2):127–190, 1999

1999

[33] [33]

Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A

Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A. Siddiqui, Alexan- der Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 4393–4402. PMLR, 2018

2018

[34] [34]

Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

Jian Song and Zhishen Lu. Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

2009

[35] [35]

Cambridge Mathematical Textbooks

David Williams.Probability with Martingales. Cambridge Mathematical Textbooks. Cambridge University Press, 1991

1991

[36] [36]

Estimating the probabilities of rare outputs in language mod- els

Gabriel Wu and Jacob Hilton. Estimating the probabilities of rare outputs in language mod- els. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. arXiv:2410.13211. 11

work page arXiv 2025

[37] [37]

Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

Wenzhuo Xu, Zhipeng Wei, Xiongtao Sun, Zonghao Ying, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, and Quanchen Zou. Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

work page arXiv 2025

[38] [38]

Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

Xiaoyu Zhang et al. Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

2019

[39] [39]

AutoRedTeamer: Autonomous red teaming with lifelong attack integration

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Oluwasanmi Koyejo, James Zou, and Bo Li. AutoRedTeamer: Autonomous red teaming with lifelong attack integration. InAdvances in Neural Information Processing Systems (NeurIPS),

[40] [40]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Kirill M. Zuev. Subset simulation method for rare event estimation: An introduction.Interna- tional Journal for Uncertainty Quantification, 5(2), 2015. A Subset Simulation This appendix expands the brief overview of classical Subset Simulation (SS) given in §3. SS estimates an extreme failure probability Pf =P(g(x)≥γ F ) by decomposing the failure event i...

work page arXiv 2015