Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Cassidy Laidlaw; Dylan Feng; Pragya Srivastava

arxiv: 2605.21602 · v1 · pith:QPCPLFWWnew · submitted 2026-05-20 · 💻 cs.AI · cs.SE

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Dylan Feng , Pragya Srivastava , Cassidy Laidlaw This is my paper

Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords out-of-distribution detectionLLM alignmentsafety monitoringguard modelsMahalanobis distanceperplexityMOOD benchmark

0 comments

The pith

Combining guard models with Mahalanobis distance and perplexity OOD detectors improves recall of out-of-distribution LLM alignment failures from 39% to 45%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the MOOD benchmark to study whether monitoring systems can spot alignment failures that occur in situations the models were not trained on. Guard models trained on limited safety data tend to miss these failures when the inputs differ from the training examples. Adding out-of-distribution detectors helps catch more of them. The authors demonstrate that this hybrid approach scales positively and outperforms simply using a much larger guard model.

Core claim

Guard models often fail to generalize to out-of-distribution alignment failures, but combining them with Mahalanobis distance and perplexity-based OOD detectors raises recall from 39% to 45%. This hybrid method shows positive scaling across model sizes and achieves higher recall gains than a guard model with 20 times more parameters. The MOOD benchmark supports these findings by using a restricted training set for monitors and seven test sets with alignment failures outside that distribution.

What carries the argument

The hybrid monitor combining a guard model (safety classifier) with Mahalanobis distance and perplexity OOD detectors, evaluated on the MOOD benchmark.

If this is right

Monitoring pipelines for LLMs should include OOD detection to handle unforeseen alignment failures.
Combined monitors benefit from scaling up model size more than guard models alone.
The recall gains from OOD detection exceed those from increasing guard model parameters by a factor of 20.
Further development of OOD detectors could lead to more robust LLM safety systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may achieve better safety by focusing on detecting shifts in input patterns instead of training ever-larger safety classifiers.
This work implies that many alignment issues arise from distributional novelty rather than inherent model weaknesses.
Real-world deployments could use these monitors to flag unusual prompts for human review or model fallback.

Load-bearing premise

The seven test sets contain alignment failures that lie outside the distribution of the restricted training set used to train the monitors.

What would settle it

If adding the OOD detectors fails to improve recall when the test failures are drawn from the same distribution as the training data, or if the improvement does not appear on additional OOD test sets.

Figures

Figures reproduced from arXiv: 2605.21602 by Cassidy Laidlaw, Dylan Feng, Pragya Srivastava.

**Figure 1.** Figure 1: We systematically study incorporating out-of-distribution (OOD) detectors into LLM safety monitoring to catch alignment failures outside the training distribution. LLMs are often deployed with a guard model (right) trained with safety training data (left). However, if a prompt or response is outside of the training distribution, the guard model may generalize incorrectly and fail to flag safety issues. Add… view at source ↗

**Figure 2.** Figure 2: We introduce Misalignment Out Of Distribution (MOOD), a benchmark which tests LLM monitors for their ability to recognize unforeseen LLM alignment failures. MOOD includes seven test sets containing conversations with distinct alignment failures. To ensure that these test sets are truly out-of-distribution, we train our own guard models and OOD detectors on a restricted post-training dataset that we careful… view at source ↗

**Figure 3.** Figure 3: To better understand the Mahalanobis OOD detector, we apply PCA to the activations of the Qwen2.5-32B guard model on which we compute the Mahalanobis distance. We plot the resulting principal components of 200 conversations from each test dataset above. For each dataset, we also show the relative change in misalignment recall for the combined guard + Mahalanobis model compared to using the guard model alo… view at source ↗

**Figure 5.** Figure 5: The improvement in OOD misalignment recall when training guard models additionally on some of the MOOD test sets. We display both the increase in recall relative to the baseline Gemma 2 9B guard model as well as the absolute recall in parentheses. The first seven rows each correspond to adding a single test dataset to the training data. The “union” row measures the recall on each test dataset when taking … view at source ↗

**Figure 6.** Figure 6: The average misalignment recall of six methods across three models from the Gemma 2 family with 2, 9, and 27 billion parameters. Methods improve significantly from the 2B to the 9B model, but the misalignment recall drops from the 9B to the 27B model. We hypothesize this may be because the 27B model is suboptimally trained; we use the same hyperparameters across all model sizes, and 27B might require diffe… view at source ↗

**Figure 7.** Figure 7: Per-token perplexity results on different test samples. Tokens highlighted with brighter colors have higher perplexity. The conversation on the left is from the sycophantic test set and the conversation on the right is from the function calling deception (missing tools) test set. Many of the sycophantic tokens are flagged as high-perplexity in the sycophantic conversation, while very few of the tokens are … view at source ↗

**Figure 8.** Figure 8: The distributions of the numbers of tokens and Flesch-Kincaid grade levels (Kincaid et al., 1975) of conversations in each MOOD test set. The significant overlap between test set and train set distributions means that it is not trivial to detect OOD conversations based on surface level features. The majority of samples in our test datasets are cleanly classifiable with respect to the training dataset using… view at source ↗

read the original abstract

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Guard models plus basic OOD detectors improve recall on alignment failures in this new benchmark, though confirming the tests are truly OOD would strengthen the case.

read the letter

The main thing to know is that this paper builds a benchmark called MOOD to test monitors on out-of-distribution alignment failures in LLMs, and finds that adding Mahalanobis and perplexity OOD detectors to a guard model raises recall from 39% to 45%, outperforming a much larger guard model. They do a few things right. The setup with a restricted training set and seven separate test sets for different failure modes gives a clean way to measure generalization. They run comparisons across four detector types and track how performance scales with model size. The positive scaling for the combined monitors is a useful data point, and the claim that OOD detection helps more than just scaling parameters has practical implications for safety work. The weaker part is the lack of direct evidence that the test sets are truly out of distribution from the training data. The abstract describes the construction but does not include any quantitative checks like distance metrics or distribution comparisons. That leaves open the possibility that the recall gain comes from general differences rather than the OOD-specific handling the authors intend. Methods details are also thin in the summary, so it's hard to judge if the splits or hyperparameters were tuned in ways that affect the results. This paper is aimed at researchers building monitoring systems for deployed LLMs. Anyone thinking about how to catch unexpected failures would find the benchmark and the detector comparisons worth looking at. It is worth sending to peer review because the core idea addresses a real gap in current guard models, and the empirical results are concrete enough to spark discussion even if some validation steps need more work.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MOOD benchmark for evaluating monitors on out-of-distribution (OOD) alignment failures in LLMs. It uses a restricted training set to train monitors and seven test sets containing diverse alignment failures asserted to lie outside that distribution. The central empirical finding is that guard models (safety classifiers) generalize poorly OOD, but combining a guard model with Mahalanobis-distance and perplexity-based OOD detectors raises recall from 39% to 45%. The work also reports positive scaling trends for combined monitors across model sizes and claims that adding OOD detection yields larger recall gains than scaling the guard model by a factor of 20.

Significance. If the OOD status of the test sets and the reported recall gains are robustly established, the paper supplies a concrete benchmark and practical evidence that OOD detection is a high-leverage addition to LLM monitoring pipelines. The scaling results and the comparison against larger guard models are directly actionable for safety engineering.

major comments (1)

[§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.

minor comments (2)

[§4.1] The abstract and experimental sections should explicitly state the precise definitions and hyper-parameter choices for the four OOD detectors tested, including any post-hoc tuning that could affect the 39%-to-45% comparison.
[Table 2 and Figure 3] Figure captions and tables reporting recall should include error bars or statistical significance tests for the scaling trends across model sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address the major comment below and will incorporate the suggested verification to strengthen the interpretation of the MOOD benchmark results.

read point-by-point responses

Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.

Authors: We agree that explicit quantitative verification of the distributional shift would strengthen the central claim. The MOOD benchmark defines the test sets by selecting diverse alignment failures (e.g., novel jailbreak styles, unusual response patterns, and failure modes) that are excluded from the restricted training set by construction; this restricted set is a curated subset of safety data used to train the monitors. Nevertheless, we acknowledge that reporting statistics such as mean Mahalanobis distances on model embeddings, perplexity histograms, or maximum mean discrepancy would provide more rigorous evidence that the performance gains arise specifically from OOD detection rather than incidental distributional differences. We will add these analyses to §3 in the revised manuscript, including comparisons for each of the seven test sets, and will reference them when interpreting the recall improvements in §4. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with measured recall on held-out sets

full rationale

The paper constructs the MOOD benchmark with a restricted training set used to train monitors and seven test sets asserted to contain alignment failures outside that distribution. Reported results consist of directly measured recall improvements (39% to 45%) and scaling trends on these held-out test sets rather than any derivation, fitted parameter, or self-referential definition that reduces the central claim to its inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the evaluation is falsifiable via standard held-out performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no new mathematical axioms, free parameters fitted to the target result, or invented entities; the central claims rest on the assumption that the constructed test sets are OOD relative to the restricted training distribution.

pith-pipeline@v0.9.0 · 5778 in / 1164 out tokens · 29907 ms · 2026-05-22T09:34:20.324048+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD)... combining guard models with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 27 internal anchors

[1]

Does safety training of llms generalize to semantically related natural prompts?, 2025

Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K., and Jain, P. Does safety training of llms generalize to semantically related natural prompts?, 2025. URL https://arxiv.org/abs/2412.03235

work page arXiv 2025
[2]

System card: Claude opus 4.5

Anthropic. System card: Claude opus 4.5. Technical report, November 2025

work page 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

work page arXiv 2025
[5]

Envisioning outlier exposure by large language models for out-of-distribution detection, 2024

Cao, C., Zhong, Z., Zhou, Z., Liu, Y., Liu, T., and Han, B. Envisioning outlier exposure by large language models for out-of-distribution detection, 2024. URL https://arxiv.org/abs/2406.00806

work page arXiv 2024
[6]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., Hassani, H., and Wong, E. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models , October 2024. URL http://arxiv.org/abs/2404.01318. arXiv:2404.01318 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Investigating truthfulness in a pre-release o3 model, April 2025

Chowdhury, N., Johnson, D., Huang, V., Steinhardt, J., and Schwettmann, S. Investigating truthfulness in a pre-release o3 model, April 2025. URL https://transluce.org/investigating-o3-truthfulness

work page 2025
[10]

Reward Model Ensembles Help Mitigate Overoptimization , March 2024

Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward Model Ensembles Help Mitigate Overoptimization , March 2024. URL http://arxiv.org/abs/2310.02743. arXiv:2310.02743 [cs]

work page arXiv 2024
[11]

J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J

Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A., Dvijotham, D. J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J. Helping or Herding ? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , August 2024. URL http://arxiv.org/abs/2312.09244. arXiv:2312.09244 [cs]

work page arXiv 2024
[12]

Exploring the Limits of Out -of- Distribution Detection

Fort, S., Ren, J., and Lakshminarayanan, B. Exploring the Limits of Out -of- Distribution Detection . In Advances in Neural Information Processing Systems , volume 34, pp.\ 7068--7081. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/3941c4358616274ac2436eacf67fae05-Abstract.html

work page 2021
[13]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-J...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out -of- Distribution Examples in Neural Networks , October 2018. URL http://arxiv.org/abs/1610.02136. arXiv:1610.02136 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

AI Induced Psychosis : A shallow investigation

Hua, T. AI Induced Psychosis : A shallow investigation. August 2025. URL https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation

work page 2025
[17]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S

Kutasov, J., Jermyn, A., Steen, J., Le, M., Bowman, S. R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S. Teaching Claude Why , May 2026. URL https://alignment.anthropic.com/2026/teaching-claude-why/

work page 2026
[21]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. T \"u lu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks

Lee, K., Lee, K., Lee, H., and Shin, J. A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html

work page 2018
[23]

Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026

Liang, S., Xu, Z., Weng, J., Tao, J., Xue, H., and Wang, X. Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026. URL http://arxiv.org/abs/2508.09201. arXiv:2508.09201 [cs]

work page arXiv 2026
[24]

K., Ritchie, S

Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., Mindermann, S., Perez, E., and Hubinger, E. Agentic Misalignment : How LLMs Could be an Insider Threat . Anthropic Research, 2025

work page 2025
[25]

Mahalanobis, P. C. On the generalized distance in statistics. The National Institute of Sciences of India, 2 0 (1): 0 49--55, 1936

work page 1936
[26]

Frontier Models are Capable of In-context Scheming

Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., and Hobbhahn, M. Frontier models are capable of in-context scheming, 2025. URL https://arxiv.org/abs/2412.04984

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025

Nian, Y., Zhu, S., Qin, Y., Li, L., Wang, Z., Xiao, C., and Zhao, Y. Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025. URL https://arxiv.org/abs/2504.03770

work page arXiv 2025
[28]

Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b

OpenAI. Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Technical report, OpenAI, October 2025 a . URL https://cdn.openai.com/pdf/08b7dee4-8bc6-4955-a219-7793fb69090c/Technical_report__Research_Preview_of_gpt_oss_safeguard.pdf

work page 2025
[29]

GPT -5 System Card

OpenAI. GPT -5 System Card . Technical report, August 2025 b

work page 2025
[30]

Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c

OpenAI. Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c . URL https://openai.com/index/sycophancy-in-gpt-4o/

work page 2025
[31]

Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022

Podolskiy, A., Lipin, D., Bout, A., Artemova, E., and Piontkovskaya, I. Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022. URL https://arxiv.org/abs/2101.03778

work page arXiv 2022
[33]

A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled

Roose, K. A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled . The New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html

work page 2023
[34]

Towards Understanding Sycophancy in Language Models

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards Understanding Sycophancy in Language Models , May 2025 a . URL http://arxiv.org/abs/2310.13548. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O'Hara, C., Olsson, C., Petrini, L., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

A StrongREJECT for Empty Jailbreaks

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A StrongREJECT for Empty Jailbreaks , August 2024. URL http://arxiv.org/abs/2402.10260. arXiv:2402.10260 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html

work page 2023
[38]

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025

Williams, M., Carroll, M., Narang, A., Weisser, C., Murphy, B., and Dragan, A. On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025. URL http://arxiv.org/abs/2411.02306. arXiv:2411.02306 [cs]

work page arXiv 2025
[39]

and Ding, K

Xu, R. and Ding, K. Large Language Models for Anomaly and Out -of- Distribution Detection : A Survey , February 2025. URL http://arxiv.org/abs/2409.01980. arXiv:2409.01980 [cs]

work page arXiv 2025
[40]

Young, R. J. Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks , November 2025. URL http://arxiv.org/abs/2511.22047. arXiv:2511.22047 [cs] version: 1

work page arXiv 2025
[41]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shieldgemma: Generative ai content moderation based on gemma, 2024. URL https://arxiv.org/abs/2407.21772

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Detection of

Yoo, KiYoon and Kim, Jangho and Jang, Jiho and Kwak, Nojun , editor =. Detection of. Findings of the. 2022 , pages =. doi:10.18653/v1/2022.findings-acl.289 , abstract =

work page doi:10.18653/v1/2022.findings-acl.289 2022
[43]

Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,

Ye, Zihuiwen and Melo, Luckeciano Carvalho and Kaddar, Younesse and Blunsom, Phil and Staton, Sam and Gal, Yarin , month = feb, year =. Uncertainty-. doi:10.48550/arXiv.2502.11250 , abstract =

work page doi:10.48550/arxiv.2502.11250
[44]

Uncertainty estimation using a single deep deterministic neural network , url =

Van Amersfoort, Joost and Smith, Lewis and Teh, Yee Whye and Gal, Yarin , year =. Uncertainty estimation using a single deep deterministic neural network , url =. International conference on machine learning , publisher =

work page
[45]

Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,

Shi, Haizhou and Wang, Yibin and Han, Ligong and Zhang, Huan and Wang, Hao , month = dec, year =. Training-. doi:10.48550/arXiv.2412.05723 , abstract =

work page doi:10.48550/arxiv.2412.05723
[46]

Epistemic

Osband, Ian and Wen, Zheng and Asghari, Seyed Mohammad and Dwaracherla, Vikranth and Ibrahimi, Morteza and Lu, Xiuyuan and Roy, Benjamin Van , month = may, year =. Epistemic. doi:10.48550/arXiv.2107.08924 , abstract =

work page doi:10.48550/arxiv.2107.08924
[47]

and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =

Melo, Luckeciano C. and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =. Deep. doi:10.48550/arXiv.2406.10023 , abstract =

work page doi:10.48550/arxiv.2406.10023
[48]

Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , year =. A. Advances in

work page
[49]

Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , month = jun, year =. Last. doi:10.48550/arXiv.2204.02937 , abstract =

work page doi:10.48550/arxiv.2204.02937
[50]

Unfamiliar

Kang, Katie and Wallace, Eric and Tomlin, Claire and Kumar, Aviral and Levine, Sergey , month = may, year =. Unfamiliar. doi:10.48550/arXiv.2403.05612 , abstract =

work page doi:10.48550/arxiv.2403.05612
[51]

Izmailov, Pavel and Kirichenko, Polina and Gruver, Nate and Wilson, Andrew Gordon , month = oct, year =. On. doi:10.48550/arXiv.2210.11369 , abstract =

work page doi:10.48550/arxiv.2210.11369
[52]

Uncertainty

Gleave, Adam and Irving, Geoffrey , month = mar, year =. Uncertainty. doi:10.48550/arXiv.2203.07472 , abstract =

work page doi:10.48550/arxiv.2203.07472
[53]

Exploring the

Fort, Stanislav and Ren, Jie and Lakshminarayanan, Balaji , year =. Exploring the. Advances in

work page
[54]

and Lakshminarayanan, Balaji , month = jul, year =

Dherin, Benoit and Hu, Huiyi and Ren, Jie and Dusenberry, Michael W. and Lakshminarayanan, Balaji , month = jul, year =. Morse. doi:10.48550/arXiv.2307.00667 , abstract =

work page doi:10.48550/arxiv.2307.00667
[55]

Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =

Chan, Alex and Alaa, Ahmed and Qian, Zhaozhi and Van Der Schaar, Mihaela , year =. Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =. International conference on machine learning , publisher =

work page
[56]

and Ober, Sebastian W

Burt, David R. and Ober, Sebastian W. and Garriga-Alonso, Adrià and Wilk, Mark van der , month = nov, year =. Understanding. doi:10.48550/arXiv.2011.09421 , abstract =

work page doi:10.48550/arxiv.2011.09421 2011
[58]

Outlier-

Srivastava, Pragya and Nalli, Sai Soumya and Deshpande, Amit and Sharma, Amit , month = apr, year =. Outlier-

work page
[59]

Addressing

Ielanskyi, Mykyta and Schweighofer, Kajetan and Aichberger, Lukas and Hochreiter, Sepp , month = mar, year =. Addressing

work page
[60]

Machine Learning , author =

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , volume =. Machine Learning , author =. 2021 , keywords =. doi:10.1007/s10994-021-05946-3 , abstract =

work page doi:10.1007/s10994-021-05946-3 2021
[61]

Farquhar, Sebastian and Gal, Yarin , month = nov, year =. What '

work page
[62]

Advances in Neural Information Processing Systems , author =

Jailbroken:. Advances in Neural Information Processing Systems , author =. 2023 , pages =

work page 2023
[63]

Anthropic Research , author =

Agentic. Anthropic Research , author =. 2025 , annote =

work page 2025
[64]

Investigating truthfulness in a pre-release o3 model , url =

Chowdhury, Neil and Johnson, Daniel and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah , month = apr, year =. Investigating truthfulness in a pre-release o3 model , url =

work page
[65]

OpenAI , month = aug, year =

work page
[66]

System Card: Claude Opus 4.5 , author =

work page
[67]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.18837
[68]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , month = may, year =. Emergent. doi:10.48550/arXiv.2502.17424 , abstract =

work page doi:10.48550/arxiv.2502.17424
[69]

Proceedings of the 2023

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.741 , language =

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[70]

Manakul, A

Manakul, Potsawee and Liusie, Adian and Gales, Mark , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.557 , language =

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[71]

Eisenstein, Jacob and Nagpal, Chirag and Agarwal, Alekh and Beirami, Ahmad and D'Amour, Alex and Dvijotham, D. J. and Fisch, Adam and Heller, Katherine and Pfohl, Stephen and Ramachandran, Deepak and Shaw, Peter and Berant, Jonathan , month = aug, year =. Helping or. doi:10.48550/arXiv.2312.09244 , abstract =

work page doi:10.48550/arxiv.2312.09244
[72]

Coste, Thomas and Anwar, Usman and Kirk, Robert and Krueger, David , month = mar, year =. Reward. doi:10.48550/arXiv.2310.02743 , abstract =

work page doi:10.48550/arxiv.2310.02743
[73]

Learning a

Xu, Yinglun and Kang, Hangoo and Suresh, Tarun and Wan, Yuxuan and Singh, Gagandeep , month = may, year =. Learning a. doi:10.48550/arXiv.2505.20556 , abstract =

work page doi:10.48550/arxiv.2505.20556
[74]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
[75]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
[76]

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , month = aug, year =. A. doi:10.48550/arXiv.2402.10260 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.10260
[77]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218
[78]

Williams, Marcus and Carroll, Micah and Narang, Adhyyan and Weisser, Constantin and Murphy, Brendan and Dragan, Anca , month = feb, year =. On. doi:10.48550/arXiv.2411.02306 , abstract =

work page doi:10.48550/arxiv.2411.02306
[79]

The New York Times , author =

A. The New York Times , author =. 2023 , keywords =

work page 2023
[80]

Sycophancy in

OpenAI , month = apr, year =. Sycophancy in

work page
[81]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548
[82]

Preemptive

Fang, Haishuo and Zhu, Xiaodan and Gurevych, Iryna , month = dec, year =. Preemptive. doi:10.48550/arXiv.2407.11843 , abstract =

work page doi:10.48550/arxiv.2407.11843
[83]

Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , month = oct, year =. A. doi:10.48550/arXiv.1807.03888 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03888
[84]

Hendrycks, Dan and Gimpel, Kevin , month = oct, year =. A. doi:10.48550/arXiv.1610.02136 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.02136
[85]

Xu, Ruiyao and Ding, Kaize , month = feb, year =. Large. doi:10.48550/arXiv.2409.01980 , abstract =

work page doi:10.48550/arxiv.2409.01980
[86]

Jailbroken: How Does LLM Safety Training Fail?

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483

Showing first 80 references.

[1] [1]

Does safety training of llms generalize to semantically related natural prompts?, 2025

Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K., and Jain, P. Does safety training of llms generalize to semantically related natural prompts?, 2025. URL https://arxiv.org/abs/2412.03235

work page arXiv 2025

[2] [2]

System card: Claude opus 4.5

Anthropic. System card: Claude opus 4.5. Technical report, November 2025

work page 2025

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

work page arXiv 2025

[5] [5]

Envisioning outlier exposure by large language models for out-of-distribution detection, 2024

Cao, C., Zhong, Z., Zhou, Z., Liu, Y., Liu, T., and Han, B. Envisioning outlier exposure by large language models for out-of-distribution detection, 2024. URL https://arxiv.org/abs/2406.00806

work page arXiv 2024

[6] [6]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., Hassani, H., and Wong, E. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models , October 2024. URL http://arxiv.org/abs/2404.01318. arXiv:2404.01318 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [9]

Investigating truthfulness in a pre-release o3 model, April 2025

Chowdhury, N., Johnson, D., Huang, V., Steinhardt, J., and Schwettmann, S. Investigating truthfulness in a pre-release o3 model, April 2025. URL https://transluce.org/investigating-o3-truthfulness

work page 2025

[8] [10]

Reward Model Ensembles Help Mitigate Overoptimization , March 2024

Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward Model Ensembles Help Mitigate Overoptimization , March 2024. URL http://arxiv.org/abs/2310.02743. arXiv:2310.02743 [cs]

work page arXiv 2024

[9] [11]

J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J

Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A., Dvijotham, D. J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J. Helping or Herding ? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , August 2024. URL http://arxiv.org/abs/2312.09244. arXiv:2312.09244 [cs]

work page arXiv 2024

[10] [12]

Exploring the Limits of Out -of- Distribution Detection

Fort, S., Ren, J., and Lakshminarayanan, B. Exploring the Limits of Out -of- Distribution Detection . In Advances in Neural Information Processing Systems , volume 34, pp.\ 7068--7081. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/3941c4358616274ac2436eacf67fae05-Abstract.html

work page 2021

[11] [13]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-J...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [14]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [15]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out -of- Distribution Examples in Neural Networks , October 2018. URL http://arxiv.org/abs/1610.02136. arXiv:1610.02136 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [16]

AI Induced Psychosis : A shallow investigation

Hua, T. AI Induced Psychosis : A shallow investigation. August 2025. URL https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation

work page 2025

[15] [17]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [20]

R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S

Kutasov, J., Jermyn, A., Steen, J., Le, M., Bowman, S. R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S. Teaching Claude Why , May 2026. URL https://alignment.anthropic.com/2026/teaching-claude-why/

work page 2026

[17] [21]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. T \"u lu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [22]

A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks

Lee, K., Lee, K., Lee, H., and Shin, J. A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html

work page 2018

[19] [23]

Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026

Liang, S., Xu, Z., Weng, J., Tao, J., Xue, H., and Wang, X. Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026. URL http://arxiv.org/abs/2508.09201. arXiv:2508.09201 [cs]

work page arXiv 2026

[20] [24]

K., Ritchie, S

Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., Mindermann, S., Perez, E., and Hubinger, E. Agentic Misalignment : How LLMs Could be an Insider Threat . Anthropic Research, 2025

work page 2025

[21] [25]

Mahalanobis, P. C. On the generalized distance in statistics. The National Institute of Sciences of India, 2 0 (1): 0 49--55, 1936

work page 1936

[22] [26]

Frontier Models are Capable of In-context Scheming

Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., and Hobbhahn, M. Frontier models are capable of in-context scheming, 2025. URL https://arxiv.org/abs/2412.04984

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [27]

Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025

Nian, Y., Zhu, S., Qin, Y., Li, L., Wang, Z., Xiao, C., and Zhao, Y. Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025. URL https://arxiv.org/abs/2504.03770

work page arXiv 2025

[24] [28]

Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b

OpenAI. Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Technical report, OpenAI, October 2025 a . URL https://cdn.openai.com/pdf/08b7dee4-8bc6-4955-a219-7793fb69090c/Technical_report__Research_Preview_of_gpt_oss_safeguard.pdf

work page 2025

[25] [29]

GPT -5 System Card

OpenAI. GPT -5 System Card . Technical report, August 2025 b

work page 2025

[26] [30]

Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c

OpenAI. Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c . URL https://openai.com/index/sycophancy-in-gpt-4o/

work page 2025

[27] [31]

Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022

Podolskiy, A., Lipin, D., Bout, A., Artemova, E., and Piontkovskaya, I. Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022. URL https://arxiv.org/abs/2101.03778

work page arXiv 2022

[28] [33]

A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled

Roose, K. A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled . The New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html

work page 2023

[29] [34]

Towards Understanding Sycophancy in Language Models

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards Understanding Sycophancy in Language Models , May 2025 a . URL http://arxiv.org/abs/2310.13548. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [35]

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O'Hara, C., Olsson, C., Petrini, L., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [36]

A StrongREJECT for Empty Jailbreaks

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A StrongREJECT for Empty Jailbreaks , August 2024. URL http://arxiv.org/abs/2402.10260. arXiv:2402.10260 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [37]

Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html

work page 2023

[33] [38]

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025

Williams, M., Carroll, M., Narang, A., Weisser, C., Murphy, B., and Dragan, A. On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025. URL http://arxiv.org/abs/2411.02306. arXiv:2411.02306 [cs]

work page arXiv 2025

[34] [39]

and Ding, K

Xu, R. and Ding, K. Large Language Models for Anomaly and Out -of- Distribution Detection : A Survey , February 2025. URL http://arxiv.org/abs/2409.01980. arXiv:2409.01980 [cs]

work page arXiv 2025

[35] [40]

Young, R. J. Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks , November 2025. URL http://arxiv.org/abs/2511.22047. arXiv:2511.22047 [cs] version: 1

work page arXiv 2025

[36] [41]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shieldgemma: Generative ai content moderation based on gemma, 2024. URL https://arxiv.org/abs/2407.21772

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [42]

Detection of

Yoo, KiYoon and Kim, Jangho and Jang, Jiho and Kwak, Nojun , editor =. Detection of. Findings of the. 2022 , pages =. doi:10.18653/v1/2022.findings-acl.289 , abstract =

work page doi:10.18653/v1/2022.findings-acl.289 2022

[38] [43]

Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,

Ye, Zihuiwen and Melo, Luckeciano Carvalho and Kaddar, Younesse and Blunsom, Phil and Staton, Sam and Gal, Yarin , month = feb, year =. Uncertainty-. doi:10.48550/arXiv.2502.11250 , abstract =

work page doi:10.48550/arxiv.2502.11250

[39] [44]

Uncertainty estimation using a single deep deterministic neural network , url =

Van Amersfoort, Joost and Smith, Lewis and Teh, Yee Whye and Gal, Yarin , year =. Uncertainty estimation using a single deep deterministic neural network , url =. International conference on machine learning , publisher =

work page

[40] [45]

Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,

Shi, Haizhou and Wang, Yibin and Han, Ligong and Zhang, Huan and Wang, Hao , month = dec, year =. Training-. doi:10.48550/arXiv.2412.05723 , abstract =

work page doi:10.48550/arxiv.2412.05723

[41] [46]

Epistemic

Osband, Ian and Wen, Zheng and Asghari, Seyed Mohammad and Dwaracherla, Vikranth and Ibrahimi, Morteza and Lu, Xiuyuan and Roy, Benjamin Van , month = may, year =. Epistemic. doi:10.48550/arXiv.2107.08924 , abstract =

work page doi:10.48550/arxiv.2107.08924

[42] [47]

and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =

Melo, Luckeciano C. and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =. Deep. doi:10.48550/arXiv.2406.10023 , abstract =

work page doi:10.48550/arxiv.2406.10023

[43] [48]

Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , year =. A. Advances in

work page

[44] [49]

Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , month = jun, year =. Last. doi:10.48550/arXiv.2204.02937 , abstract =

work page doi:10.48550/arxiv.2204.02937

[45] [50]

Unfamiliar

Kang, Katie and Wallace, Eric and Tomlin, Claire and Kumar, Aviral and Levine, Sergey , month = may, year =. Unfamiliar. doi:10.48550/arXiv.2403.05612 , abstract =

work page doi:10.48550/arxiv.2403.05612

[46] [51]

Izmailov, Pavel and Kirichenko, Polina and Gruver, Nate and Wilson, Andrew Gordon , month = oct, year =. On. doi:10.48550/arXiv.2210.11369 , abstract =

work page doi:10.48550/arxiv.2210.11369

[47] [52]

Uncertainty

Gleave, Adam and Irving, Geoffrey , month = mar, year =. Uncertainty. doi:10.48550/arXiv.2203.07472 , abstract =

work page doi:10.48550/arxiv.2203.07472

[48] [53]

Exploring the

Fort, Stanislav and Ren, Jie and Lakshminarayanan, Balaji , year =. Exploring the. Advances in

work page

[49] [54]

and Lakshminarayanan, Balaji , month = jul, year =

Dherin, Benoit and Hu, Huiyi and Ren, Jie and Dusenberry, Michael W. and Lakshminarayanan, Balaji , month = jul, year =. Morse. doi:10.48550/arXiv.2307.00667 , abstract =

work page doi:10.48550/arxiv.2307.00667

[50] [55]

Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =

Chan, Alex and Alaa, Ahmed and Qian, Zhaozhi and Van Der Schaar, Mihaela , year =. Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =. International conference on machine learning , publisher =

work page

[51] [56]

and Ober, Sebastian W

Burt, David R. and Ober, Sebastian W. and Garriga-Alonso, Adrià and Wilk, Mark van der , month = nov, year =. Understanding. doi:10.48550/arXiv.2011.09421 , abstract =

work page doi:10.48550/arxiv.2011.09421 2011

[52] [58]

Outlier-

Srivastava, Pragya and Nalli, Sai Soumya and Deshpande, Amit and Sharma, Amit , month = apr, year =. Outlier-

work page

[53] [59]

Addressing

Ielanskyi, Mykyta and Schweighofer, Kajetan and Aichberger, Lukas and Hochreiter, Sepp , month = mar, year =. Addressing

work page

[54] [60]

Machine Learning , author =

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , volume =. Machine Learning , author =. 2021 , keywords =. doi:10.1007/s10994-021-05946-3 , abstract =

work page doi:10.1007/s10994-021-05946-3 2021

[55] [61]

Farquhar, Sebastian and Gal, Yarin , month = nov, year =. What '

work page

[56] [62]

Advances in Neural Information Processing Systems , author =

Jailbroken:. Advances in Neural Information Processing Systems , author =. 2023 , pages =

work page 2023

[57] [63]

Anthropic Research , author =

Agentic. Anthropic Research , author =. 2025 , annote =

work page 2025

[58] [64]

Investigating truthfulness in a pre-release o3 model , url =

Chowdhury, Neil and Johnson, Daniel and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah , month = apr, year =. Investigating truthfulness in a pre-release o3 model , url =

work page

[59] [65]

OpenAI , month = aug, year =

work page

[60] [66]

System Card: Claude Opus 4.5 , author =

work page

[61] [67]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.18837

[62] [68]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , month = may, year =. Emergent. doi:10.48550/arXiv.2502.17424 , abstract =

work page doi:10.48550/arxiv.2502.17424

[63] [69]

Proceedings of the 2023

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.741 , language =

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[64] [70]

Manakul, A

Manakul, Potsawee and Liusie, Adian and Gales, Mark , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.557 , language =

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[65] [71]

Eisenstein, Jacob and Nagpal, Chirag and Agarwal, Alekh and Beirami, Ahmad and D'Amour, Alex and Dvijotham, D. J. and Fisch, Adam and Heller, Katherine and Pfohl, Stephen and Ramachandran, Deepak and Shaw, Peter and Berant, Jonathan , month = aug, year =. Helping or. doi:10.48550/arXiv.2312.09244 , abstract =

work page doi:10.48550/arxiv.2312.09244

[66] [72]

Coste, Thomas and Anwar, Usman and Kirk, Robert and Krueger, David , month = mar, year =. Reward. doi:10.48550/arXiv.2310.02743 , abstract =

work page doi:10.48550/arxiv.2310.02743

[67] [73]

Learning a

Xu, Yinglun and Kang, Hangoo and Suresh, Tarun and Wan, Yuxuan and Singh, Gagandeep , month = may, year =. Learning a. doi:10.48550/arXiv.2505.20556 , abstract =

work page doi:10.48550/arxiv.2505.20556

[68] [74]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862

[69] [75]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073

[70] [76]

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , month = aug, year =. A. doi:10.48550/arXiv.2402.10260 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.10260

[71] [77]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218

[72] [78]

Williams, Marcus and Carroll, Micah and Narang, Adhyyan and Weisser, Constantin and Murphy, Brendan and Dragan, Anca , month = feb, year =. On. doi:10.48550/arXiv.2411.02306 , abstract =

work page doi:10.48550/arxiv.2411.02306

[73] [79]

The New York Times , author =

A. The New York Times , author =. 2023 , keywords =

work page 2023

[74] [80]

Sycophancy in

OpenAI , month = apr, year =. Sycophancy in

work page

[75] [81]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548

[76] [82]

Preemptive

Fang, Haishuo and Zhu, Xiaodan and Gurevych, Iryna , month = dec, year =. Preemptive. doi:10.48550/arXiv.2407.11843 , abstract =

work page doi:10.48550/arxiv.2407.11843

[77] [83]

Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , month = oct, year =. A. doi:10.48550/arXiv.1807.03888 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03888

[78] [84]

Hendrycks, Dan and Gimpel, Kevin , month = oct, year =. A. doi:10.48550/arXiv.1610.02136 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.02136

[79] [85]

Xu, Ruiyao and Ding, Kaize , month = feb, year =. Large. doi:10.48550/arXiv.2409.01980 , abstract =

work page doi:10.48550/arxiv.2409.01980

[80] [86]

Jailbroken: How Does LLM Safety Training Fail?

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483