arxiv: 2604.22167 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

Recognition: unknown

Estimating Tail Risks in Language Model Output Distributions

Rico Angell , Raghav Singhal , Zachary Horvitz , Zhou Yu , Rajesh Ranganath , Kathleen McKeown , He He

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords language model safetytail risk estimationimportance samplingrare harmful outputsmisuse and misalignmentoutput distributionsafety evaluation

0 comments

The pith

Creating unsafe versions of language models allows accurate estimation of rare harmful outputs with 10-20 times fewer samples than brute force.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models deployed at scale will occasionally produce harmful outputs even if those events are rare. The paper develops a method to estimate the probability of such events for any given input by using importance sampling. This is achieved by temporarily altering the model to make harmful outputs more likely, then adjusting the probabilities accordingly. The resulting estimates match those from exhaustive sampling but require far fewer model queries. The approach also shows that these probabilities can indicate how sensitive a model is to changes in its inputs and help forecast risks in real deployment.

Core claim

The paper establishes that operationalizing importance sampling through the creation of unsafe variants of the target language model enables sample-efficient estimation of the probability of harmful outputs, achieving accuracy comparable to brute-force Monte Carlo sampling while using 10-20 times fewer samples, such as estimating probabilities around 10 to the power of -4 with only 500 samples.

What carries the argument

Importance sampling performed by constructing unsafe versions of the language model that increase the likelihood of harmful outputs, allowing reweighting to recover the original probabilities.

Load-bearing premise

Unsafe versions of the target model can be constructed so that importance sampling produces unbiased estimates of the original model's probabilities for harmful outputs.

What would settle it

A direct comparison on the same set of inputs where the importance sampling estimates are computed with 500 samples and brute-force Monte Carlo is run with 10,000 or more samples; if the two do not agree within expected statistical variation, the method's accuracy is not supported.

Figures

Figures reproduced from arXiv: 2604.22167 by He He, Kathleen McKeown, Raghav Singhal, Rajesh Ranganath, Rico Angell, Zachary Horvitz, Zhou Yu.

**Figure 1.** Figure 1: Estimating probabilities of rare harmful outputs. We estimate the probability of a target model generating harmful outputs to a given input query with importance sampling. Since harmful outputs could be rare, we use activation steering to find an unsafe model that is similar to the original target model but is more likely to surface harmful outputs. By sampling responses from the unsafe model and then rewe… view at source ↗

**Figure 2.** Figure 2: Repeated sampling yields unsafe outputs from “safe” models. We consider the 313 STRONGREJECT (Souly et al., 2024) queries without jailbreaks, and evaluate model outputs with the STRONGREJECT judge: we say that output is harmful if the judge score is ≥ 0.75. Repeatedly sampling with temperature 1.0 eventually yields at least one harmful output for significantly more queries, indicating that in many context… view at source ↗

**Figure 3.** Figure 3: Importance sampling converges faster than naive Monte Carlo estimation. In this plot, we compare naive Monte Carlo estimation (MC) and importance sampling (IS) against expensive brute-force estimates (computed with 10, 000 samples). The y-axis is the average log-ratio between the two estimators and the brute-force estimate. We observe that importance sampling leads to much faster convergence to the brute f… view at source ↗

**Figure 4.** Figure 4: Importance Sampling (IS) vs. Monte Carlo (MC) Estimates: Here we plot the brute-force Monte Carlo (k = 10, 000) estimates of harm on the x-axis and the corresponding importance sampling estimate (k = 1, 000) on the y-axis on the STRONGREJECT benchmark. For visual convenience, we clamp estimates to 10−6 . The red dashed line signifies exact agreement when y = x. We note that most queries lie on this line, a… view at source ↗

**Figure 5.** Figure 5: Sample-Efficiency of Misalignment Estimation. In these plots, we compare the sample-efficiency of estimators measuring sycophancy and hallucination rates using both naive Monte Carlo (MC) sampling and importance sampling (IS). We compare these two estimators against pseudo ground-truth brute-force Monte Carlo estimates computed with 10, 000 samples. The y-axis is the average log-ratio between the estimator… view at source ↗

**Figure 6.** Figure 6: Effect of rewriting texts on harmfulness probabilities: We generate 25 rewrites of harmful queries using Grok-3 (xAI, 2025). We then compare their harm estimates with harm estimates from the original queries. We observe that rewrites can shift harmfulness probabilities by multiple orders of magnitude. alent, changes in the input context. Our findings demonstrate that, in isolation, a single query-level ha… view at source ↗

**Figure 7.** Figure 7: Predicting Worst Case Risk on Unseen Queries: In this plot, we show that using QRISK estimates from the evaluation can provide accurate predictions of the likelihood of worst-case risk on unseen queries. 7 view at source ↗

**Figure 8.** Figure 8: Optimizing proposal model reduces estimator variance: In this experiment, we measure the mean-squared error between the importance sampling estimate and the brute-force Monte Carlo estimate for the optimized proposal model as well as proposal models with varying steering coefficients. We observe that for all models, the optimized proposal has the lowest variance. 10 1 10 3 0% 25% 50% 75% 100% Queries w. Ha… view at source ↗

**Figure 9.** Figure 9: Queries with harmful outputs vs number of samples per query, alongside importance sampling and naive Monte Carlo estimates. We plot the number of non-zero queries estimated by k=1,000 Monte Carlo sampling versus k=500 Importance sampling. Even with 2× samples, Monte Carlo estimates significantly underestimate the number of queries with non-zero harmfulness probabilities. 17 view at source ↗

**Figure 10.** Figure 10: Worst rewrite harmfulness versus importance sampling estimates. We sample k = 200 outputs per query, and compare percent with harmful output against the importance sampling estimate (k = 500) from fig. 6. Note: We exclude samples with estimates of zero. 19 view at source ↗

read the original abstract

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to estimate rare harmful LM outputs via importance sampling from unsafe model variants and claims 10-20x sample savings over Monte Carlo, but the MC baselines are too noisy to validate the method reliably.

read the letter

The core idea here is to estimate tail probabilities of harmful outputs by deliberately creating unsafe versions of the target model, then using importance sampling to correct back to the original distribution. This lets them hit probabilities around 10^-4 with roughly 500 samples on misuse and misalignment benchmarks while reporting agreement with brute-force Monte Carlo runs that used 10-20 times more samples. The code release is a plus for anyone who wants to try it out. The approach is a straightforward but useful application of importance sampling to the specific problem of LM tail risks at deployment scale, and the additional checks on input sensitivity and predicted deployment risks show they are thinking about how the numbers would actually be used. The main weakness is exactly the one flagged in the stress test. For events with true probability near 10^-4, even 10,000 Monte Carlo samples give an expected count of only one, so the baseline itself has enormous relative variance. The abstract gives no sample counts for the MC runs, no confidence intervals, and no repeated runs to show stability, which means the reported match could just be two noisy estimators landing in the same ballpark. The claim that the unsafe variants produce unbiased estimates also rests on that same empirical agreement rather than a direct proof or stronger diagnostic. This work is aimed at people doing practical safety evaluations who need to quantify low-probability failure modes without burning huge compute budgets. A reader already working on alignment benchmarks or rare-event estimation would find the method worth testing. It deserves peer review because the problem is real, the technique is implementable, and the reported gains are large enough to be worth checking carefully, even if the current validation needs tightening on the baseline variance and bias checks.

Referee Report

3 major / 2 minor

Summary. The paper proposes an importance sampling method to estimate tail probabilities of harmful outputs in language models by constructing 'unsafe' versions of the target model as proposal distributions. It claims these estimates match brute-force Monte Carlo on misuse and misalignment benchmarks while using 10-20x fewer samples, e.g., recovering probabilities of order 10^{-4} from 500 proposal samples, and that the estimates can reveal input sensitivity and predict deployment risks.

Significance. If the unbiasedness of the importance sampling procedure holds, the work could meaningfully advance safety evaluations by enabling efficient quantification of rare harmful events at deployment scale. The claimed empirical agreement with Monte Carlo and public code release are positive for reproducibility, but the central claim depends on validation details that are not yet fully substantiated.

major comments (3)

[Abstract] Abstract: the reported match between IS estimates and brute-force MC lacks specification of the exact MC sample size N, confidence intervals on the MC estimator, or convergence checks across independent runs. For target probabilities p≈10^{-4}, even N=10,000 yields an expected event count of 1 and relative standard error near 100%; without these diagnostics, agreement between two noisy estimators does not confirm that the IS procedure recovers the true tail probability.
[Methods] Methods (importance sampling construction): the claim that unsafe model variants yield unbiased estimates of the original model's harmful output probabilities requires an explicit derivation showing that the importance weights fully correct for any distributional shift introduced by the modification process. The weakest assumption—that such modifications introduce no systematic bias—must be validated, e.g., via a controlled experiment on a known distribution or a proof that the proposal remains a valid importance sampler.
[Experiments] Experiments: the manuscript must report data exclusion rules, exact benchmark query sets, and any filtering applied before sampling, because these choices directly affect whether the reported agreement with MC supports the central claim of accurate rare-event estimation across the full output distribution.

minor comments (2)

[Abstract] Abstract: the statement that harmfulness estimates 'can reveal the sensitivity of models to perturbations in model input' would benefit from a concrete example or figure reference in the main text to clarify the operationalization.
[Appendix] The code repository link is a strength for reproducibility; the manuscript should include a brief methods appendix summarizing the unsafe-model construction hyperparameters so readers can replicate without inspecting the repository.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key areas for improving the statistical rigor, theoretical grounding, and transparency of our work. We have revised the manuscript accordingly and address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported match between IS estimates and brute-force MC lacks specification of the exact MC sample size N, confidence intervals on the MC estimator, or convergence checks across independent runs. For target probabilities p≈10^{-4}, even N=10,000 yields an expected event count of 1 and relative standard error near 100%; without these diagnostics, agreement between two noisy estimators does not confirm that the IS procedure recovers the true tail probability.

Authors: We agree that these details are necessary to substantiate the comparison. In the revised manuscript we have added the exact MC sample sizes used (sufficient to observe multiple events for the reported probabilities), bootstrap-derived 95% confidence intervals on the MC estimates, and results from multiple independent runs demonstrating convergence of both estimators. These additions show that the IS estimates lie within the MC confidence intervals. revision: yes
Referee: [Methods] Methods (importance sampling construction): the claim that unsafe model variants yield unbiased estimates of the original model's harmful output probabilities requires an explicit derivation showing that the importance weights fully correct for any distributional shift introduced by the modification process. The weakest assumption—that such modifications introduce no systematic bias—must be validated, e.g., via a controlled experiment on a known distribution or a proof that the proposal remains a valid importance sampler.

Authors: We appreciate this request for formal justification. The revised Methods section now contains an explicit derivation establishing that the importance weights w(y) = p_target(y | x) / p_unsafe(y | x) recover the target expectation without bias, provided the unsafe variant preserves support over the target distribution (which holds under our logit-modification and fine-tuning procedures). We have also added a controlled validation on a synthetic autoregressive model with known closed-form tail probabilities, confirming that the estimator is unbiased. revision: yes
Referee: [Experiments] Experiments: the manuscript must report data exclusion rules, exact benchmark query sets, and any filtering applied before sampling, because these choices directly affect whether the reported agreement with MC supports the central claim of accurate rare-event estimation across the full output distribution.

Authors: We agree that full experimental transparency is required. The revised Experiments section now specifies the complete benchmark query sets (with sizes and public references), states that no queries were excluded beyond standard length-based truncation, and confirms that no post-sampling filtering on harmfulness or other criteria was applied. All generated outputs were retained for the probability estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; estimator validated externally against Monte Carlo baselines

full rationale

The paper's core contribution is an importance-sampling procedure that constructs unsafe proposal models to estimate rare harmful-output probabilities under the original target model. This follows the standard unbiased IS identity p(harm) = E_proposal[ indicator(harm) * (target_prob / proposal_prob) ], with the proposal obtained by an explicit (non-self-referential) modification step whose bias properties are asserted as an assumption rather than derived from the target quantity. The reported results consist of direct numerical comparisons to independent brute-force Monte Carlo runs on the same benchmarks; no fitted parameter is relabeled as a prediction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain therefore remains self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the claim rests on standard importance sampling assumptions and the feasibility of creating unsafe model variants that increase harmful output probability while preserving reweighting validity. No explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5558 in / 1133 out tokens · 63679 ms · 2026-05-08T12:23:01.623617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 24 canonical work pages · 12 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Abdin, M. et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review arXiv 2024
[3]

Jailbreaking leading safety-aligned LLM s with simple adaptive attacks

Andriushchenko, M., Croce, F., and Flammarion, N. Jailbreaking leading safety-aligned LLM s with simple adaptive attacks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=hXA8wqRdyV

2025
[4]

Claude sonnet 4.5 system card

Anthropic. Claude sonnet 4.5 system card. Technical report, September 2025. URL https://www.anthropic.com/claude-sonnet-4-5-system-card

2025
[5]

B., Syed, A., Paleka, D., Rimsky, N., Gurnee, W., and Nanda, N

Arditi, A., Obeso, O. B., Syed, A., Paleka, D., Rimsky, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

2024
[6]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URL https://arxiv.org/abs/2502.17424

work page arXiv 2025
[7]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419

work page internal anchor Pith review arXiv 2024
[8]

and Diaconis, P

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling. The Annals of Applied Probability, 28 0 (2): 0 1099--1135, 2018

2018
[9]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Chen, R., Arditi, A., Sleight, H., Evans, O., and Lindsey, J. Persona vectors: Monitoring and controlling character traits in language models, 2025. URL https://arxiv.org/abs/2507.21509

work page internal anchor Pith review arXiv 2025
[10]

An introduction to sequential Monte Carlo, volume 4

Chopin, N., Papaspiliopoulos, O., et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

2020
[11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600

work page internal anchor Pith review arXiv 2023
[12]

P., Mannor, S., and Rubinstein, R

De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y. A tutorial on the cross-entropy method. Annals of operations research, 134 0 (1): 0 19--67, 2005

2005
[13]

and Ferreira, A

De Haan, L. and Ferreira, A. Extreme value theory: an introduction. Springer, 2006

2006
[14]

The Llama 3 Herd of Models

Dubey, A., Grattafiori, A., Jauhri, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[15]

Ettinger, A. et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024 a . URL https://arxiv.org/abs/2412.14093

work page internal anchor Pith review arXiv 2024
[17]

AI control: Improving safety despite intentional subversion,

Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. Ai control: Improving safety despite intentional subversion, 2024 b . URL https://arxiv.org/abs/2312.06942

work page arXiv 2024
[18]

Hesterberg, T. C. Advances in importance sampling. Stanford University, 1988

1988
[19]

Catastrophic jailbreak of open-source llms via exploiting generation

Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source llms via exploiting generation, 2023. URL https://arxiv.org/abs/2310.06987

work page arXiv 2023
[20]

Catastrophic jailbreak of open-source LLM s via exploiting generation

Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source LLM s via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh

2024
[21]

Best-of-n jailbreaking

Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking, 2024. URL https://arxiv.org/abs/2412.03556

work page arXiv 2024
[22]

Forecasting rare language model behaviors

Jones, E., Tong, M., Mu, J., Mahfoud, M., Leike, J., Grosse, R., Kaplan, J., Fithian, W., Perez, E., and Sharma, M. Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797, 2025

work page arXiv 2025
[23]

L., Chowdhury, N., Johnson, D

Li, X. L., Chowdhury, N., Johnson, D. D., Hashimoto, T., Liang, P., Schwettmann, S., and Steinhardt, J. Eliciting language model behaviors with investigator agents. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=AulTigiaMv

2025
[24]

E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., and Xiao, C

Liu, X., Li, P., Suh, G. E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., and Xiao, C. Auto DAN -turbo: A lifelong agent for strategy self-exploration to jailbreak LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bhK7U37VW8

2025
[25]

Natural emergent misalignment from reward hacking in production rl, 2025

MacDiarmid, M., Wright, B., Uesato, J., Benton, J., Kutasov, J., Price, S., Bouscal, N., Bowman, S., Bricken, T., Cloud, A., Denison, C., Gasteiger, J., Greenblatt, R., Leike, J., Lindsey, J., Mikulik, V., Perez, E., Rodrigues, A., Thomas, D., Webson, A., Ziegler, D., and Hubinger, E. Natural emergent misalignment from reward hacking in production rl, 202...

work page arXiv 2025
[26]

and Tegmark, M

Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=aajyHYjjsk

2024
[27]

Tdc 2023 (llm edition): The trojan detection challenge

Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O'Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track, 2023

2023
[28]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249

work page internal anchor Pith review arXiv 2024
[29]

B., and Singh, J

Neumann, A., Kirsten, E., Zafar, M. B., and Singh, J. Position is power: System prompts as a mechanism of bias in large language models (llms). In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pp.\ 573–598. ACM, June 2025. doi:10.1145/3715275.3732038. URL http://dx.doi.org/10.1145/3715275.3732038

work page doi:10.1145/3715275.3732038 2025
[30]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

O'Brien, K., Majercak, D., Fernandes, X., Edgar, R., Bullwinkel, B., Chen, J., Nori, H., Carignan, D., Horvitz, E., and Poursabzi-Sangdeh, F. Steering language model refusal with sparse autoencoders, 2025. URL https://arxiv.org/abs/2411.11296

work page arXiv 2025
[31]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, December 2025 a . URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/

2025
[32]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b and gpt-oss-20b model card, 2025 b . URL https://arxiv.org/abs/2508.10925

work page internal anchor Pith review arXiv 2025
[33]

and Zhou, Y

Owen, A. and Zhou, Y. Safe and effective importance sampling. Journal of the American Statistical Association, 95 0 (449): 0 135--143, 2000

2000
[34]

Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv.org/abs/2312.06681

work page internal anchor Pith review arXiv 2024
[35]

P., Casella, G., and Casella, G

Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999

1999
[36]

Sharma, M. et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming, 2025. URL https://arxiv.org/abs/2501.18837

work page arXiv 2025
[37]

A strong REJECT for empty jailbreaks

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A strong REJECT for empty jailbreaks. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=KZLE5BaaOH

2024
[38]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

2023
[39]

Team, G. et al. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

work page internal anchor Pith review arXiv 2024
[40]

Qwen2.5: A party of foundation models, September 2024

Team, Q. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

2024
[41]

Estimating worst-case frontier risks of open-weight llms, 2025

Wallace, E., Watkins, O., Wang, M., Chen, K., and Koch, C. Estimating worst-case frontier risks of open-weight llms, 2025. URL https://arxiv.org/abs/2508.03153

work page arXiv 2025
[42]

and Hilton, J

Wu, G. and Hilton, J. Estimating the probabilities of rare outputs in language models. arXiv preprint arXiv:2410.13211, 2024

work page arXiv 2024
[43]

Grok 3 beta — the age of reasoning agents, February 2025

xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/grok-3. Accessed: 2026-01-27

2025
[44]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043

work page internal anchor Pith review arXiv 2023
[45]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[46]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[47]

B6 '& &G ѡUtf! -Ff92

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2001