pith. sign in

arxiv: 2606.29623 · v1 · pith:22WK77B2new · submitted 2026-06-28 · 💻 cs.AI · cs.LG

SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings

Pith reviewed 2026-06-30 07:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords rare event estimationsubset simulationAI safetyembeddingssupermartingaleadversarial fractionmisclassification probabilityjailbreak detection
0
0 comments X

The pith

SCARCE estimates rare AI failure probabilities by replacing handcrafted performance functions with learned embeddings and geometric rulers in subset simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SCARCE to compute extremely low probabilities of events such as model misclassifications or adversarial successes, where direct sampling is infeasible. It substitutes the classical requirement for a manually engineered scalar performance function with data-derived latent representations scored by geometric rulers that measure distance to failure regions. Adaptive thresholds then build nested intermediate events, and the whole procedure is bounded by a non-negative supermartingale that supplies a valid high-probability upper envelope even if sampling stops early. Experiments on MNIST show 400-500 times lower mean absolute error than grid-searched traditional subset simulation, and the same pipeline applied to LLM hidden states yields low relative error for adversarial fractions at or above 0.001. A reader would care because the approach removes the domain-expertise barrier that has limited rare-event analysis in new AI systems.

Core claim

SCARCE replaces the scalar performance function whose sublevel sets define nested events in classical subset simulation with learned latent representations scored by geometric rulers such as PCA. Adaptive thresholding on these scores constructs the nested events directly from data. The construction is formalised as a non-negative supermartingale, producing a high-probability upper envelope that remains valid under early stopping. On MNIST misclassification the method attains 400-500 times lower mean absolute error than grid-searched traditional subset simulation and eliminates systematic over-counting. On Llama-Guard-3-8B hidden states a PCA ruler reaches 2.6 percent mean relative error for

What carries the argument

Learned latent representations scored by geometric rulers (for example PCA) that measure proximity to failure regions and permit data-driven construction of nested events inside the supermartingale bound.

If this is right

  • Handcrafted performance functions are no longer required to apply subset simulation to new AI domains.
  • Rare-event probabilities in high-dimensional models can be bounded with far smaller sample budgets while preserving validity.
  • The same ruler construction transfers across threat models after a single recalibration step.
  • A KL-based directional criterion provides a consistent way to select among candidate rulers without ground-truth probabilities.
  • Early stopping of the cascade does not invalidate the final upper envelope.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If embeddings from other modalities preserve failure geometry, SCARCE could extend to rare-event estimation in vision-language or multimodal systems.
  • The supermartingale envelope might be combined with importance sampling to further reduce variance at the lowest probability levels.
  • For adversarial fractions below 10 to the minus 4, additional ruler calibration or higher-dimensional embeddings may be needed to keep relative error below 5 percent.
  • Applying the method to safety benchmarks beyond jailbreaks, such as reward-model failures, would test whether the KL ranking of rulers generalises.

Load-bearing premise

The learned latent representations and chosen geometric rulers must accurately reflect proximity to failure regions so that the constructed nested events remain valid for the supermartingale bound.

What would settle it

A large-scale Monte Carlo reference on Llama-Guard-3-8B for eta equal to 0.001 that deviates from the reported 2.6 percent relative error by more than the bootstrap half-width of 27.9 percent would falsify the accuracy claim for the PCA ruler.

Figures

Figures reproduced from arXiv: 2606.29623 by Edmund Lau, Jie Meng, Taylor T Johnson, Xiaowei Huang, Yi Dong, Yingjie Wang.

Figure 1
Figure 1. Figure 1: Accuracy on 100 held-out seeds and seed-wise absolute error (log scale). SCARCE errors [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall ruler accuracy: (a) average accuracy; (b) per-turn accuracy; (c) per-fleet accuracy; [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Threshold percentile sweep in PC1 family. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Selector diagnostics against downstream ruler-family error. Worst-turn directional KL is [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Step 1 ruler pre-screening on 50 flipped MNIST seeds. Bad-anchored variants dominate the [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step 1 efficiency budgets. The rulers selected by accuracy and reliability also avoid the [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step 1 Pareto views. Bad-anchored geometric rulers occupy the low-error and moderate [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step 2 MNIST comparison with accuracy and mismatch counts. SCARCE rulers keep [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Miscounts over 100 held-out seeds. Traditional SS attains zero false negatives at the cost of [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Seed-wise distribution of |pˆf −p SMC f | across 100 MNIST seeds for the top SCARCE rulers and Traditional SS. The SCARCE distributions are tightly concentrated near zero while Traditional SS shows a heavy upper tail driven by systematic false positives. Part II: LLM Jailbreak Sweeps 10 5 N (particles per run) 10 −6 SE of ̂ pf (a) Variance: SE drops monotonically turn 1 turn 2 turn 3 turn 4 turn 5 ∝ 1/√ N… view at source ↗
Figure 11
Figure 11. Figure 11: Population-size sweep for PC1/p75. Increasing N reduces Monte Carlo variation as expected, while the mean error plateaus once the sample budget is large enough for stable cascade calibration. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Rare events govern the safety profile of modern AI systems, yet their probabilities are extremely difficult to estimate: direct Monte Carlo requires prohibitive sample budgets. Subset Simulation (SS) addresses this by decomposing a rare-event probability into moderate conditional probabilities over nested intermediate events. However, classical SS requires a handcrafted scalar performance function whose sublevel sets define those events, demanding detailed knowledge of the failure geometry and limiting transfer to new domains. We propose SCARCE (Scalable Cascade Analysis for Rare-event Characterisation via Embeddings), which replaces the performance function with learned latent representations and geometric rulers that score proximity to failure regions. Adaptive thresholding constructs nested intermediate events directly from data. We formalise SCARCE through a non-negative supermartingale, yielding a high-probability upper envelope that remains valid under early stopping. On MNIST misclassification, where dense Monte Carlo provides ground truth, SCARCE achieves approximately 400--500 times lower mean absolute error than grid-searched traditional SS while eliminating systematic over-counting. We then study PAIR-style LLM jailbreaks under a fleet-level threat model with adversarial fraction $\eta$. On Llama-Guard-3-8B hidden states, a PCA-based ruler attains 2.6% mean relative error for $\eta \geq 10^{-3}$ against finite-sample references whose average bootstrap relative half-width is 27.9%, and transfers to a GCG-style corpus with 2.93% relative error after recalibration. A directional criterion $\mathrm{KL}(p_{\mathrm{good}}\,\|\,p_{\mathrm{bad}})$ ranks rulers consistently with estimation error (Spearman $\rho=0.83$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SCARCE, which replaces the handcrafted scalar performance function in classical Subset Simulation with learned embeddings and geometric rulers (e.g., PCA projections) to construct nested events A1 ⊃ A2 ⊃ ⋯ via adaptive thresholding. It formalizes the procedure via a non-negative supermartingale that supplies a high-probability upper envelope valid under early stopping, and reports empirical gains: ~400–500× lower MAE than grid-searched SS on MNIST misclassification (with elimination of over-counting) and 2.6% mean relative error on Llama-Guard-3-8B hidden states for jailbreak probability η ≥ 10^{-3}.

Significance. If the supermartingale bound remains valid under the adaptive, data-driven construction, the approach would remove a major practical barrier to rare-event estimation in AI safety by enabling transfer across domains without domain-specific performance functions. The reported error reductions relative to both dense MC ground truth and bootstrap references, together with the KL(p_good ‖ p_bad) ruler-ranking criterion (Spearman ρ=0.83), would constitute a concrete advance in scalable rare-event tools.

major comments (2)
  1. [Formalization via non-negative supermartingale] The central theoretical claim—that adaptive thresholding on learned rulers produces nested events whose conditional probabilities multiply to a valid supermartingale bound—requires an explicit argument showing that ruler selection and threshold choice from the same samples do not introduce data-dependent bias that violates the non-negative supermartingale property. The abstract asserts validity under early stopping, but the dependence structure induced by data-driven event definition is not addressed in the provided description and is load-bearing for the high-probability envelope.
  2. [MNIST misclassification results] On the MNIST experiments, the 400–500× MAE reduction and elimination of systematic over-counting are reported against grid-searched traditional SS; however, it is unclear whether the comparison equalizes total sample budget, whether the traditional SS performance function was chosen with knowledge of the failure geometry, and whether the dense MC reference is sufficiently precise to support the magnitude of the claimed improvement.
minor comments (2)
  1. [Ruler ranking] The precise definition of the directional criterion KL(p_good ‖ p_bad) and how the good/bad distributions are estimated from the embeddings should be stated explicitly.
  2. [Method overview] Notation for the nested events A_i and the adaptive thresholds should be introduced with a short diagram or pseudocode to clarify the construction before the supermartingale argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of SCARCE to advance rare-event estimation in AI safety applications. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Formalization via non-negative supermartingale] The central theoretical claim—that adaptive thresholding on learned rulers produces nested events whose conditional probabilities multiply to a valid supermartingale bound—requires an explicit argument showing that ruler selection and threshold choice from the same samples do not introduce data-dependent bias that violates the non-negative supermartingale property. The abstract asserts validity under early stopping, but the dependence structure induced by data-driven event definition is not addressed in the provided description and is load-bearing for the high-probability envelope.

    Authors: We agree that an explicit argument is required to confirm preservation of the supermartingale property under data-driven ruler selection and adaptive thresholding. The manuscript constructs the nested events sequentially from the learned embeddings, with the non-negative supermartingale defined via the product of conditional probabilities estimated on subsequent independent samples after threshold determination. This ordering ensures the increments remain martingale differences with respect to the filtration that includes the selection step. To make this rigorous and address the dependence structure directly, we will add a dedicated lemma and proof sketch in the revised theoretical section demonstrating that the adaptive construction does not introduce bias violating non-negativity or the supermartingale property, while retaining validity under early stopping. revision: yes

  2. Referee: [MNIST misclassification results] On the MNIST experiments, the 400–500× MAE reduction and elimination of systematic over-counting are reported against grid-searched traditional SS; however, it is unclear whether the comparison equalizes total sample budget, whether the traditional SS performance function was chosen with knowledge of the failure geometry, and whether the dense MC reference is sufficiently precise to support the magnitude of the claimed improvement.

    Authors: The MNIST comparison equalized total sample budget by matching the number of classifier evaluations (forward passes) across SCARCE and grid-searched traditional SS. The baseline performance function was a standard handcrafted scalar (pixel-space Euclidean distance to the decision boundary), selected from the general problem setup without access to the specific misclassification geometry. The dense Monte Carlo reference used a large sample size whose bootstrap-estimated precision is sufficient to support the reported MAE differences, as the variance is orders of magnitude smaller than the observed gap. We will revise the experimental section to state these details explicitly, including exact budgets and the baseline function definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; supermartingale formalization remains independent of data-driven ruler construction

full rationale

The abstract states that SCARCE replaces the performance function with learned embeddings and geometric rulers, then 'formalise[s] SCARCE through a non-negative supermartingale, yielding a high-probability upper envelope that remains valid under early stopping.' No equations or steps are shown that define the supermartingale in terms of the fitted rulers or that rename a fitted quantity as a prediction. The MNIST evaluation uses independent dense Monte Carlo as ground truth, providing an external benchmark. No self-citation chains, ansatz smuggling, or self-definitional reductions appear in the given text. The derivation is therefore treated as self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that embeddings preserve failure geometry and that adaptive thresholds produce valid nested events; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)
  • ruler choice and adaptive thresholds
    Geometric rulers (PCA, KL criterion) and thresholds are selected or calibrated from data.
axioms (2)
  • domain assumption Latent embeddings allow geometric rulers to score proximity to failure regions
    This replaces the classical scalar performance function.
  • standard math Non-negative supermartingale property yields valid high-probability upper envelope under early stopping
    Invoked to guarantee the bound remains valid.

pith-pipeline@v0.9.1-grok · 5852 in / 1332 out tokens · 39646 ms · 2026-06-30T07:00:04.027258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. 9

  2. [2]

    Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

    Solomon Asghar, Qing-Xiang Pei, Giorgio V olpe, and Ran Ni. Efficient rare event sampling with unsupervised normalizing flows.Nature Machine Intelligence, 6:1370–1381, 2024

  3. [3]

    Siu-Kui Au and James L. Beck. Estimation of small failure probabilities in high dimensions by subset simulation.Probabilistic Engineering Mechanics, 16(4):263–277, 2001

  4. [4]

    Siu-Kui Au and James L. Beck. Subset simulation and its application to seismic risk based on dynamic analysis.Journal of Engineering Mechanics, 129(8):901–917, 2003

  5. [5]

    Zdravko I. Botev. Minimax tilting for importance sampling.Annals of Statistics, 45(2):468–499, 2017

  6. [6]

    Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

    Charles-Edouard Bréhier, Maxime Gazeau, Ludovic Goudenège, Tony Lelièvre, and Mathias Rousset. Unbiasedness of some generalized adaptive multilevel splitting algorithms.Annals of Applied Probability, 26(6):3559–3601, 2016

  7. [7]

    Bucklew.Introduction to Rare Event Simulation

    James A. Bucklew.Introduction to Rare Event Simulation. Springer, 2004

  8. [8]

    Adaptive multilevel splitting for rare event analysis

    Frédéric Cérou and Arnaud Guyader. Adaptive multilevel splitting for rare event analysis. Stochastic Analysis and Applications, 25(2):417–443, 2007

  9. [9]

    Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

    Frédéric Cérou, Arnaud Guyader, and Mathias Rousset. Adaptive multilevel splitting: Historical perspective and recent results.Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(4):043108, 2019

  10. [10]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems, Datasets and...

  11. [11]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  12. [12]

    Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

    Miaochuan Chen, Dimitrios Giovanis, and Michael Shields. Reliability analysis of complex sys- tems using subset simulations with hamiltonian neural networks.Structural Safety, 110:102479, 2024

  13. [13]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020

  14. [14]

    Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Jail- breakRadar: Comprehensive assessment of jailbreak attacks against LLMs.arXiv preprint arXiv:2402.05668, 2024

  15. [15]

    Boundary point jailbreaking of black-box llms, 2026

    Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, and Yarin Gal. Boundary point jailbreaking of black-box llms, 2026

  16. [16]

    Zhengqi Gao, Dinghuai Zhang, Luca Daniel, and Duane S. Boning. NOFIS: Normalizing flow for rare circuit failure analysis. InProceedings of the 61st ACM/IEEE Design Automa- tion Conference (DAC), DAC ’24, New York, NY , USA, 2024. Association for Computing Machinery

  17. [17]

    Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon

    Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales.Annals of Statistics, 49(2):1055–1080, 2021

  18. [18]

    SAFARI: Versatile and efficient evaluations for robustness of interpretability

    Wei Huang, Xingyu Zhao, Gaojie Jin, and Xiaowei Huang. SAFARI: Versatile and efficient evaluations for robustness of interpretability. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10

  19. [19]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

  20. [20]

    Forecasting rare language model behaviors

    Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, and Mrinank Sharma. Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797, 2025

  21. [21]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

  22. [22]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

  23. [23]

    Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

    Qianxiao Li, Bo Lin, and Weiqing Ren. Computing committor functions for the study of rare events using deep learning.Journal of Chemical Physics, 2024

  24. [24]

    Deep learning for rare event estimation.Neural Computing and Applications, 2021

    Yifan Li et al. Deep learning for rare event estimation.Neural Computing and Applications, 2021

  25. [25]

    The Llama 3 Herd of Models

    Llama Team, AI @ Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Martell, Jessica A

    Max J. Martell, Jessica A. Baweja, and Brandon D. Dreslin. Mitigative strategies for recovering from large language model trust violations.Journal of Cognitive Engineering and Decision Making, 19(1), 2024

  27. [27]

    HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning (ICML), 2024

  28. [28]

    Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs with auto-generated subversions. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  29. [29]

    MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

    Iason Papaioannou, Wolfgang Betz, Karl Zwirglmaier, and Daniel Straub. MCMC algorithms for subset simulation.Probabilistic Engineering Mechanics, 41:89–103, 2015

  30. [30]

    An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

    Iason Papaioannou, Max Ehre, and Daniel Straub. An adaptive subset simulation algorithm for system reliability analysis with discontinuous limit states.Reliability Engineering & System Safety, 225:108607, 2022

  31. [31]

    Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

    Aaditya Ramdas, Johannes Ruf, and Martin Larsson. Game-theoretic statistics and safe anytime- valid inference.arXiv preprint arXiv:2006.04292, 2020

  32. [32]

    Rubinstein

    Reuven Y . Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1(2):127–190, 1999

  33. [33]

    Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A

    Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A. Siddiqui, Alexan- der Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 4393–4402. PMLR, 2018

  34. [34]

    Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

    Jian Song and Zhishen Lu. Subset simulation for structural reliability analysis.Structural Safety, 31(2):133–141, 2009

  35. [35]

    Cambridge Mathematical Textbooks

    David Williams.Probability with Martingales. Cambridge Mathematical Textbooks. Cambridge University Press, 1991

  36. [36]

    Estimating the probabilities of rare outputs in language mod- els

    Gabriel Wu and Jacob Hilton. Estimating the probabilities of rare outputs in language mod- els. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. arXiv:2410.13211. 11

  37. [37]

    Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

    Wenzhuo Xu, Zhipeng Wei, Xiongtao Sun, Zonghao Ying, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, and Quanchen Zou. Probabilistic modeling of jailbreak on multimodal LLMs: From quantification to application.arXiv preprint arXiv:2503.06989, 2025

  38. [38]

    Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

    Xiaoyu Zhang et al. Surrogate-assisted subset simulation for reliability analysis.Reliability Engineering & System Safety, 183:10–19, 2019

  39. [39]

    AutoRedTeamer: Autonomous red teaming with lifelong attack integration

    Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Oluwasanmi Koyejo, James Zou, and Bo Li. AutoRedTeamer: Autonomous red teaming with lifelong attack integration. InAdvances in Neural Information Processing Systems (NeurIPS),

  40. [40]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

  41. [41]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  42. [42]

    Kirill M. Zuev. Subset simulation method for rare event estimation: An introduction.Interna- tional Journal for Uncertainty Quantification, 5(2), 2015. A Subset Simulation This appendix expands the brief overview of classical Subset Simulation (SS) given in §3. SS estimates an extreme failure probability Pf =P(g(x)≥γ F ) by decomposing the failure event i...