pith. sign in

arxiv: 2605.21602 · v1 · pith:QPCPLFWWnew · submitted 2026-05-20 · 💻 cs.AI · cs.SE

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords out-of-distribution detectionLLM alignmentsafety monitoringguard modelsMahalanobis distanceperplexityMOOD benchmark
0
0 comments X

The pith

Combining guard models with Mahalanobis distance and perplexity OOD detectors improves recall of out-of-distribution LLM alignment failures from 39% to 45%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the MOOD benchmark to study whether monitoring systems can spot alignment failures that occur in situations the models were not trained on. Guard models trained on limited safety data tend to miss these failures when the inputs differ from the training examples. Adding out-of-distribution detectors helps catch more of them. The authors demonstrate that this hybrid approach scales positively and outperforms simply using a much larger guard model.

Core claim

Guard models often fail to generalize to out-of-distribution alignment failures, but combining them with Mahalanobis distance and perplexity-based OOD detectors raises recall from 39% to 45%. This hybrid method shows positive scaling across model sizes and achieves higher recall gains than a guard model with 20 times more parameters. The MOOD benchmark supports these findings by using a restricted training set for monitors and seven test sets with alignment failures outside that distribution.

What carries the argument

The hybrid monitor combining a guard model (safety classifier) with Mahalanobis distance and perplexity OOD detectors, evaluated on the MOOD benchmark.

If this is right

  • Monitoring pipelines for LLMs should include OOD detection to handle unforeseen alignment failures.
  • Combined monitors benefit from scaling up model size more than guard models alone.
  • The recall gains from OOD detection exceed those from increasing guard model parameters by a factor of 20.
  • Further development of OOD detectors could lead to more robust LLM safety systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may achieve better safety by focusing on detecting shifts in input patterns instead of training ever-larger safety classifiers.
  • This work implies that many alignment issues arise from distributional novelty rather than inherent model weaknesses.
  • Real-world deployments could use these monitors to flag unusual prompts for human review or model fallback.

Load-bearing premise

The seven test sets contain alignment failures that lie outside the distribution of the restricted training set used to train the monitors.

What would settle it

If adding the OOD detectors fails to improve recall when the test failures are drawn from the same distribution as the training data, or if the improvement does not appear on additional OOD test sets.

Figures

Figures reproduced from arXiv: 2605.21602 by Cassidy Laidlaw, Dylan Feng, Pragya Srivastava.

Figure 1
Figure 1. Figure 1: We systematically study incorporating out-of-distribution (OOD) detectors into LLM safety monitoring to catch alignment failures outside the training distribution. LLMs are often deployed with a guard model (right) trained with safety training data (left). However, if a prompt or response is outside of the training distribution, the guard model may generalize incorrectly and fail to flag safety issues. Add… view at source ↗
Figure 2
Figure 2. Figure 2: We introduce Misalignment Out Of Distribution (MOOD), a benchmark which tests LLM monitors for their ability to recognize unforeseen LLM alignment failures. MOOD includes seven test sets containing conversations with distinct alignment failures. To ensure that these test sets are truly out-of-distribution, we train our own guard models and OOD detectors on a restricted post-training dataset that we careful… view at source ↗
Figure 3
Figure 3. Figure 3: To better understand the Mahalanobis OOD detector, we apply PCA to the activations of the Qwen2.5-32B guard model on which we compute the Mahalanobis distance. We plot the re￾sulting principal components of 200 conversations from each test dataset above. For each dataset, we also show the relative change in misalignment recall for the combined guard + Mahalanobis model compared to using the guard model alo… view at source ↗
Figure 5
Figure 5. Figure 5: The improvement in OOD misalignment recall when training guard models additionally on some of the MOOD test sets. We display both the increase in recall relative to the baseline Gemma 2 9B guard model as well as the absolute recall in paren￾theses. The first seven rows each correspond to adding a single test dataset to the training data. The “union” row measures the recall on each test dataset when taking … view at source ↗
Figure 6
Figure 6. Figure 6: The average misalignment recall of six methods across three models from the Gemma 2 family with 2, 9, and 27 billion parameters. Methods improve significantly from the 2B to the 9B model, but the misalignment recall drops from the 9B to the 27B model. We hypothesize this may be because the 27B model is suboptimally trained; we use the same hyperparameters across all model sizes, and 27B might require diffe… view at source ↗
Figure 7
Figure 7. Figure 7: Per-token perplexity results on different test samples. Tokens highlighted with brighter colors have higher perplexity. The conversation on the left is from the sycophantic test set and the conversation on the right is from the function calling deception (missing tools) test set. Many of the sycophantic tokens are flagged as high-perplexity in the sycophantic conversation, while very few of the tokens are … view at source ↗
Figure 8
Figure 8. Figure 8: The distributions of the numbers of tokens and Flesch-Kincaid grade levels (Kincaid et al., 1975) of conversations in each MOOD test set. The significant overlap between test set and train set distributions means that it is not trivial to detect OOD conversations based on surface level features. The majority of samples in our test datasets are cleanly classifiable with respect to the training dataset using… view at source ↗
read the original abstract

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MOOD benchmark for evaluating monitors on out-of-distribution (OOD) alignment failures in LLMs. It uses a restricted training set to train monitors and seven test sets containing diverse alignment failures asserted to lie outside that distribution. The central empirical finding is that guard models (safety classifiers) generalize poorly OOD, but combining a guard model with Mahalanobis-distance and perplexity-based OOD detectors raises recall from 39% to 45%. The work also reports positive scaling trends for combined monitors across model sizes and claims that adding OOD detection yields larger recall gains than scaling the guard model by a factor of 20.

Significance. If the OOD status of the test sets and the reported recall gains are robustly established, the paper supplies a concrete benchmark and practical evidence that OOD detection is a high-leverage addition to LLM monitoring pipelines. The scaling results and the comparison against larger guard models are directly actionable for safety engineering.

major comments (1)
  1. [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.
minor comments (2)
  1. [§4.1] The abstract and experimental sections should explicitly state the precise definitions and hyper-parameter choices for the four OOD detectors tested, including any post-hoc tuning that could affect the 39%-to-45% comparison.
  2. [Table 2 and Figure 3] Figure captions and tables reporting recall should include error bars or statistical significance tests for the scaling trends across model sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address the major comment below and will incorporate the suggested verification to strengthen the interpretation of the MOOD benchmark results.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.

    Authors: We agree that explicit quantitative verification of the distributional shift would strengthen the central claim. The MOOD benchmark defines the test sets by selecting diverse alignment failures (e.g., novel jailbreak styles, unusual response patterns, and failure modes) that are excluded from the restricted training set by construction; this restricted set is a curated subset of safety data used to train the monitors. Nevertheless, we acknowledge that reporting statistics such as mean Mahalanobis distances on model embeddings, perplexity histograms, or maximum mean discrepancy would provide more rigorous evidence that the performance gains arise specifically from OOD detection rather than incidental distributional differences. We will add these analyses to §3 in the revised manuscript, including comparisons for each of the seven test sets, and will reference them when interpreting the recall improvements in §4. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with measured recall on held-out sets

full rationale

The paper constructs the MOOD benchmark with a restricted training set used to train monitors and seven test sets asserted to contain alignment failures outside that distribution. Reported results consist of directly measured recall improvements (39% to 45%) and scaling trends on these held-out test sets rather than any derivation, fitted parameter, or self-referential definition that reduces the central claim to its inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the evaluation is falsifiable via standard held-out performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no new mathematical axioms, free parameters fitted to the target result, or invented entities; the central claims rest on the assumption that the constructed test sets are OOD relative to the restricted training distribution.

pith-pipeline@v0.9.0 · 5778 in / 1164 out tokens · 29907 ms · 2026-05-22T09:34:20.324048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 27 internal anchors

  1. [1]

    Does safety training of llms generalize to semantically related natural prompts?, 2025

    Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K., and Jain, P. Does safety training of llms generalize to semantically related natural prompts?, 2025. URL https://arxiv.org/abs/2412.03235

  2. [2]

    System card: Claude opus 4.5

    Anthropic. System card: Claude opus 4.5. Technical report, November 2025

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

  4. [4]

    Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

    Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

  5. [5]

    Envisioning outlier exposure by large language models for out-of-distribution detection, 2024

    Cao, C., Zhong, Z., Zhou, Z., Liu, Y., Liu, T., and Han, B. Envisioning outlier exposure by large language models for out-of-distribution detection, 2024. URL https://arxiv.org/abs/2406.00806

  6. [6]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., Hassani, H., and Wong, E. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models , October 2024. URL http://arxiv.org/abs/2404.01318. arXiv:2404.01318 [cs]

  7. [9]

    Investigating truthfulness in a pre-release o3 model, April 2025

    Chowdhury, N., Johnson, D., Huang, V., Steinhardt, J., and Schwettmann, S. Investigating truthfulness in a pre-release o3 model, April 2025. URL https://transluce.org/investigating-o3-truthfulness

  8. [10]

    Reward Model Ensembles Help Mitigate Overoptimization , March 2024

    Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward Model Ensembles Help Mitigate Overoptimization , March 2024. URL http://arxiv.org/abs/2310.02743. arXiv:2310.02743 [cs]

  9. [11]

    J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J

    Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A., Dvijotham, D. J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J. Helping or Herding ? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , August 2024. URL http://arxiv.org/abs/2312.09244. arXiv:2312.09244 [cs]

  10. [12]

    Exploring the Limits of Out -of- Distribution Detection

    Fort, S., Ren, J., and Lakshminarayanan, B. Exploring the Limits of Out -of- Distribution Detection . In Advances in Neural Information Processing Systems , volume 34, pp.\ 7068--7081. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/3941c4358616274ac2436eacf67fae05-Abstract.html

  11. [13]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-J...

  12. [14]

    Alignment faking in large language models

    Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093

  13. [15]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out -of- Distribution Examples in Neural Networks , October 2018. URL http://arxiv.org/abs/1610.02136. arXiv:1610.02136 [cs]

  14. [16]

    AI Induced Psychosis : A shallow investigation

    Hua, T. AI Induced Psychosis : A shallow investigation. August 2025. URL https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation

  15. [17]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

  16. [20]

    R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S

    Kutasov, J., Jermyn, A., Steen, J., Le, M., Bowman, S. R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S. Teaching Claude Why , May 2026. URL https://alignment.anthropic.com/2026/teaching-claude-why/

  17. [21]

    Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. T \"u lu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

  18. [22]

    A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks

    Lee, K., Lee, K., Lee, H., and Shin, J. A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html

  19. [23]

    Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026

    Liang, S., Xu, Z., Weng, J., Tao, J., Xue, H., and Wang, X. Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026. URL http://arxiv.org/abs/2508.09201. arXiv:2508.09201 [cs]

  20. [24]

    K., Ritchie, S

    Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., Mindermann, S., Perez, E., and Hubinger, E. Agentic Misalignment : How LLMs Could be an Insider Threat . Anthropic Research, 2025

  21. [25]

    Mahalanobis, P. C. On the generalized distance in statistics. The National Institute of Sciences of India, 2 0 (1): 0 49--55, 1936

  22. [26]

    Frontier Models are Capable of In-context Scheming

    Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., and Hobbhahn, M. Frontier models are capable of in-context scheming, 2025. URL https://arxiv.org/abs/2412.04984

  23. [27]

    Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025

    Nian, Y., Zhu, S., Qin, Y., Li, L., Wang, Z., Xiao, C., and Zhao, Y. Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025. URL https://arxiv.org/abs/2504.03770

  24. [28]

    Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b

    OpenAI. Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Technical report, OpenAI, October 2025 a . URL https://cdn.openai.com/pdf/08b7dee4-8bc6-4955-a219-7793fb69090c/Technical_report__Research_Preview_of_gpt_oss_safeguard.pdf

  25. [29]

    GPT -5 System Card

    OpenAI. GPT -5 System Card . Technical report, August 2025 b

  26. [30]

    Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c

    OpenAI. Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c . URL https://openai.com/index/sycophancy-in-gpt-4o/

  27. [31]

    Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022

    Podolskiy, A., Lipin, D., Bout, A., Artemova, E., and Piontkovskaya, I. Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022. URL https://arxiv.org/abs/2101.03778

  28. [33]

    A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled

    Roose, K. A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled . The New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html

  29. [34]

    Towards Understanding Sycophancy in Language Models

    Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards Understanding Sycophancy in Language Models , May 2025 a . URL http://arxiv.org/abs/2310.13548. arXi...

  30. [35]

    Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O'Hara, C., Olsson, C., Petrini, L., ...

  31. [36]

    A StrongREJECT for Empty Jailbreaks

    Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A StrongREJECT for Empty Jailbreaks , August 2024. URL http://arxiv.org/abs/2402.10260. arXiv:2402.10260 [cs]

  32. [37]

    Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023

    Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html

  33. [38]

    On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025

    Williams, M., Carroll, M., Narang, A., Weisser, C., Murphy, B., and Dragan, A. On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025. URL http://arxiv.org/abs/2411.02306. arXiv:2411.02306 [cs]

  34. [39]

    and Ding, K

    Xu, R. and Ding, K. Large Language Models for Anomaly and Out -of- Distribution Detection : A Survey , February 2025. URL http://arxiv.org/abs/2409.01980. arXiv:2409.01980 [cs]

  35. [40]

    Young, R. J. Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks , November 2025. URL http://arxiv.org/abs/2511.22047. arXiv:2511.22047 [cs] version: 1

  36. [41]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    Zeng, W., Liu, Y., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shieldgemma: Generative ai content moderation based on gemma, 2024. URL https://arxiv.org/abs/2407.21772

  37. [42]

    Detection of

    Yoo, KiYoon and Kim, Jangho and Jang, Jiho and Kwak, Nojun , editor =. Detection of. Findings of the. 2022 , pages =. doi:10.18653/v1/2022.findings-acl.289 , abstract =

  38. [43]

    Uncertainty-aware step-wise verification with generative reward models.arXiv preprint arXiv:2502.11250,

    Ye, Zihuiwen and Melo, Luckeciano Carvalho and Kaddar, Younesse and Blunsom, Phil and Staton, Sam and Gal, Yarin , month = feb, year =. Uncertainty-. doi:10.48550/arXiv.2502.11250 , abstract =

  39. [44]

    Uncertainty estimation using a single deep deterministic neural network , url =

    Van Amersfoort, Joost and Smith, Lewis and Teh, Yee Whye and Gal, Yarin , year =. Uncertainty estimation using a single deep deterministic neural network , url =. International conference on machine learning , publisher =

  40. [45]

    Training-free bayesianization for low-rank adapters of large language models.arXiv preprint arXiv:2412.05723,

    Shi, Haizhou and Wang, Yibin and Han, Ligong and Zhang, Huan and Wang, Hao , month = dec, year =. Training-. doi:10.48550/arXiv.2412.05723 , abstract =

  41. [46]

    Epistemic

    Osband, Ian and Wen, Zheng and Asghari, Seyed Mohammad and Dwaracherla, Vikranth and Ibrahimi, Morteza and Lu, Xiuyuan and Roy, Benjamin Van , month = may, year =. Epistemic. doi:10.48550/arXiv.2107.08924 , abstract =

  42. [47]

    and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =

    Melo, Luckeciano C. and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =. Deep. doi:10.48550/arXiv.2406.10023 , abstract =

  43. [48]

    Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , year =. A. Advances in

  44. [49]

    Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , month = jun, year =. Last. doi:10.48550/arXiv.2204.02937 , abstract =

  45. [50]

    Unfamiliar

    Kang, Katie and Wallace, Eric and Tomlin, Claire and Kumar, Aviral and Levine, Sergey , month = may, year =. Unfamiliar. doi:10.48550/arXiv.2403.05612 , abstract =

  46. [51]

    Izmailov, Pavel and Kirichenko, Polina and Gruver, Nate and Wilson, Andrew Gordon , month = oct, year =. On. doi:10.48550/arXiv.2210.11369 , abstract =

  47. [52]

    Uncertainty

    Gleave, Adam and Irving, Geoffrey , month = mar, year =. Uncertainty. doi:10.48550/arXiv.2203.07472 , abstract =

  48. [53]

    Exploring the

    Fort, Stanislav and Ren, Jie and Lakshminarayanan, Balaji , year =. Exploring the. Advances in

  49. [54]

    and Lakshminarayanan, Balaji , month = jul, year =

    Dherin, Benoit and Hu, Huiyi and Ren, Jie and Dusenberry, Michael W. and Lakshminarayanan, Balaji , month = jul, year =. Morse. doi:10.48550/arXiv.2307.00667 , abstract =

  50. [55]

    Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =

    Chan, Alex and Alaa, Ahmed and Qian, Zhaozhi and Van Der Schaar, Mihaela , year =. Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =. International conference on machine learning , publisher =

  51. [56]

    and Ober, Sebastian W

    Burt, David R. and Ober, Sebastian W. and Garriga-Alonso, Adrià and Wilk, Mark van der , month = nov, year =. Understanding. doi:10.48550/arXiv.2011.09421 , abstract =

  52. [58]

    Outlier-

    Srivastava, Pragya and Nalli, Sai Soumya and Deshpande, Amit and Sharma, Amit , month = apr, year =. Outlier-

  53. [59]

    Addressing

    Ielanskyi, Mykyta and Schweighofer, Kajetan and Aichberger, Lukas and Hochreiter, Sepp , month = mar, year =. Addressing

  54. [60]

    Machine Learning , author =

    Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , volume =. Machine Learning , author =. 2021 , keywords =. doi:10.1007/s10994-021-05946-3 , abstract =

  55. [61]

    Farquhar, Sebastian and Gal, Yarin , month = nov, year =. What '

  56. [62]

    Advances in Neural Information Processing Systems , author =

    Jailbroken:. Advances in Neural Information Processing Systems , author =. 2023 , pages =

  57. [63]

    Anthropic Research , author =

    Agentic. Anthropic Research , author =. 2025 , annote =

  58. [64]

    Investigating truthfulness in a pre-release o3 model , url =

    Chowdhury, Neil and Johnson, Daniel and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah , month = apr, year =. Investigating truthfulness in a pre-release o3 model , url =

  59. [65]

    OpenAI , month = aug, year =

  60. [66]

    System Card: Claude Opus 4.5 , author =

  61. [67]

    Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

  62. [68]

    Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

    Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , month = may, year =. Emergent. doi:10.48550/arXiv.2502.17424 , abstract =

  63. [69]

    Proceedings of the 2023

    Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.741 , language =

  64. [70]

    Manakul, A

    Manakul, Potsawee and Liusie, Adian and Gales, Mark , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.557 , language =

  65. [71]

    Eisenstein, Jacob and Nagpal, Chirag and Agarwal, Alekh and Beirami, Ahmad and D'Amour, Alex and Dvijotham, D. J. and Fisch, Adam and Heller, Katherine and Pfohl, Stephen and Ramachandran, Deepak and Shaw, Peter and Berant, Jonathan , month = aug, year =. Helping or. doi:10.48550/arXiv.2312.09244 , abstract =

  66. [72]

    Coste, Thomas and Anwar, Usman and Kirk, Robert and Krueger, David , month = mar, year =. Reward. doi:10.48550/arXiv.2310.02743 , abstract =

  67. [73]

    Learning a

    Xu, Yinglun and Kang, Hangoo and Suresh, Tarun and Wan, Yuxuan and Singh, Gagandeep , month = may, year =. Learning a. doi:10.48550/arXiv.2505.20556 , abstract =

  68. [74]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

  69. [75]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  70. [76]

    Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , month = aug, year =. A. doi:10.48550/arXiv.2402.10260 , abstract =

  71. [77]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...

  72. [78]

    Williams, Marcus and Carroll, Micah and Narang, Adhyyan and Weisser, Constantin and Murphy, Brendan and Dragan, Anca , month = feb, year =. On. doi:10.48550/arXiv.2411.02306 , abstract =

  73. [79]

    The New York Times , author =

    A. The New York Times , author =. 2023 , keywords =

  74. [80]

    Sycophancy in

    OpenAI , month = apr, year =. Sycophancy in

  75. [81]

    Towards Understanding Sycophancy in Language Models

    Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

  76. [82]

    Preemptive

    Fang, Haishuo and Zhu, Xiaodan and Gurevych, Iryna , month = dec, year =. Preemptive. doi:10.48550/arXiv.2407.11843 , abstract =

  77. [83]

    Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , month = oct, year =. A. doi:10.48550/arXiv.1807.03888 , abstract =

  78. [84]

    Hendrycks, Dan and Gimpel, Kevin , month = oct, year =. A. doi:10.48550/arXiv.1610.02136 , abstract =

  79. [85]

    Xu, Ruiyao and Ding, Kaize , month = feb, year =. Large. doi:10.48550/arXiv.2409.01980 , abstract =

  80. [86]

    Jailbroken: How Does LLM Safety Training Fail?

    Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =

Showing first 80 references.