Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3
The pith
Combining guard models with Mahalanobis distance and perplexity OOD detectors improves recall of out-of-distribution LLM alignment failures from 39% to 45%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guard models often fail to generalize to out-of-distribution alignment failures, but combining them with Mahalanobis distance and perplexity-based OOD detectors raises recall from 39% to 45%. This hybrid method shows positive scaling across model sizes and achieves higher recall gains than a guard model with 20 times more parameters. The MOOD benchmark supports these findings by using a restricted training set for monitors and seven test sets with alignment failures outside that distribution.
What carries the argument
The hybrid monitor combining a guard model (safety classifier) with Mahalanobis distance and perplexity OOD detectors, evaluated on the MOOD benchmark.
If this is right
- Monitoring pipelines for LLMs should include OOD detection to handle unforeseen alignment failures.
- Combined monitors benefit from scaling up model size more than guard models alone.
- The recall gains from OOD detection exceed those from increasing guard model parameters by a factor of 20.
- Further development of OOD detectors could lead to more robust LLM safety systems.
Where Pith is reading between the lines
- Developers may achieve better safety by focusing on detecting shifts in input patterns instead of training ever-larger safety classifiers.
- This work implies that many alignment issues arise from distributional novelty rather than inherent model weaknesses.
- Real-world deployments could use these monitors to flag unusual prompts for human review or model fallback.
Load-bearing premise
The seven test sets contain alignment failures that lie outside the distribution of the restricted training set used to train the monitors.
What would settle it
If adding the OOD detectors fails to improve recall when the test failures are drawn from the same distribution as the training data, or if the improvement does not appear on additional OOD test sets.
Figures
read the original abstract
Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MOOD benchmark for evaluating monitors on out-of-distribution (OOD) alignment failures in LLMs. It uses a restricted training set to train monitors and seven test sets containing diverse alignment failures asserted to lie outside that distribution. The central empirical finding is that guard models (safety classifiers) generalize poorly OOD, but combining a guard model with Mahalanobis-distance and perplexity-based OOD detectors raises recall from 39% to 45%. The work also reports positive scaling trends for combined monitors across model sizes and claims that adding OOD detection yields larger recall gains than scaling the guard model by a factor of 20.
Significance. If the OOD status of the test sets and the reported recall gains are robustly established, the paper supplies a concrete benchmark and practical evidence that OOD detection is a high-leverage addition to LLM monitoring pipelines. The scaling results and the comparison against larger guard models are directly actionable for safety engineering.
major comments (1)
- [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.
minor comments (2)
- [§4.1] The abstract and experimental sections should explicitly state the precise definitions and hyper-parameter choices for the four OOD detectors tested, including any post-hoc tuning that could affect the 39%-to-45% comparison.
- [Table 2 and Figure 3] Figure captions and tables reporting recall should include error bars or statistical significance tests for the scaling trends across model sizes.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed report. We address the major comment below and will incorporate the suggested verification to strengthen the interpretation of the MOOD benchmark results.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.
Authors: We agree that explicit quantitative verification of the distributional shift would strengthen the central claim. The MOOD benchmark defines the test sets by selecting diverse alignment failures (e.g., novel jailbreak styles, unusual response patterns, and failure modes) that are excluded from the restricted training set by construction; this restricted set is a curated subset of safety data used to train the monitors. Nevertheless, we acknowledge that reporting statistics such as mean Mahalanobis distances on model embeddings, perplexity histograms, or maximum mean discrepancy would provide more rigorous evidence that the performance gains arise specifically from OOD detection rather than incidental distributional differences. We will add these analyses to §3 in the revised manuscript, including comparisons for each of the seven test sets, and will reference them when interpreting the recall improvements in §4. revision: yes
Circularity Check
Empirical benchmark evaluation with measured recall on held-out sets
full rationale
The paper constructs the MOOD benchmark with a restricted training set used to train monitors and seven test sets asserted to contain alignment failures outside that distribution. Reported results consist of directly measured recall improvements (39% to 45%) and scaling trends on these held-out test sets rather than any derivation, fitted parameter, or self-referential definition that reduces the central claim to its inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the evaluation is falsifiable via standard held-out performance metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD)... combining guard models with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Does safety training of llms generalize to semantically related natural prompts?, 2025
Addepalli, S., Varun, Y., Suggala, A., Shanmugam, K., and Jain, P. Does safety training of llms generalize to semantically related natural prompts?, 2025. URL https://arxiv.org/abs/2412.03235
-
[2]
Anthropic. System card: Claude opus 4.5. Technical report, November 2025
work page 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]
-
[5]
Envisioning outlier exposure by large language models for out-of-distribution detection, 2024
Cao, C., Zhong, Z., Zhou, Z., Liu, Y., Liu, T., and Han, B. Envisioning outlier exposure by large language models for out-of-distribution detection, 2024. URL https://arxiv.org/abs/2406.00806
-
[6]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., Hassani, H., and Wong, E. JailbreakBench : An Open Robustness Benchmark for Jailbreaking Large Language Models , October 2024. URL http://arxiv.org/abs/2404.01318. arXiv:2404.01318 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Investigating truthfulness in a pre-release o3 model, April 2025
Chowdhury, N., Johnson, D., Huang, V., Steinhardt, J., and Schwettmann, S. Investigating truthfulness in a pre-release o3 model, April 2025. URL https://transluce.org/investigating-o3-truthfulness
work page 2025
-
[10]
Reward Model Ensembles Help Mitigate Overoptimization , March 2024
Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward Model Ensembles Help Mitigate Overoptimization , March 2024. URL http://arxiv.org/abs/2310.02743. arXiv:2310.02743 [cs]
-
[11]
J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J
Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A., Dvijotham, D. J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J. Helping or Herding ? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , August 2024. URL http://arxiv.org/abs/2312.09244. arXiv:2312.09244 [cs]
-
[12]
Exploring the Limits of Out -of- Distribution Detection
Fort, S., Ren, J., and Lakshminarayanan, B. Exploring the Limits of Out -of- Distribution Detection . In Advances in Neural Information Processing Systems , volume 34, pp.\ 7068--7081. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/3941c4358616274ac2436eacf67fae05-Abstract.html
work page 2021
-
[13]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-J...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Alignment faking in large language models
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out -of- Distribution Examples in Neural Networks , October 2018. URL http://arxiv.org/abs/1610.02136. arXiv:1610.02136 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
AI Induced Psychosis : A shallow investigation
Hua, T. AI Induced Psychosis : A shallow investigation. August 2025. URL https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation
work page 2025
-
[17]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S
Kutasov, J., Jermyn, A., Steen, J., Le, M., Bowman, S. R., Marks, S., Leike, J., Askell, A., Olah, C., Hubinger, E., and Price, S. Teaching Claude Why , May 2026. URL https://alignment.anthropic.com/2026/teaching-claude-why/
work page 2026
-
[21]
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., et al. T \"u lu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks
Lee, K., Lee, K., Lee, H., and Shin, J. A Simple Unified Framework for Detecting Out -of- Distribution Samples and Adversarial Attacks . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html
work page 2018
-
[23]
Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026
Liang, S., Xu, Z., Weng, J., Tao, J., Xue, H., and Wang, X. Learning to Detect Unseen Jailbreak Attacks in Large Vision - Language Models , January 2026. URL http://arxiv.org/abs/2508.09201. arXiv:2508.09201 [cs]
-
[24]
Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., Mindermann, S., Perez, E., and Hubinger, E. Agentic Misalignment : How LLMs Could be an Insider Threat . Anthropic Research, 2025
work page 2025
-
[25]
Mahalanobis, P. C. On the generalized distance in statistics. The National Institute of Sciences of India, 2 0 (1): 0 49--55, 1936
work page 1936
-
[26]
Frontier Models are Capable of In-context Scheming
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., and Hobbhahn, M. Frontier models are capable of in-context scheming, 2025. URL https://arxiv.org/abs/2412.04984
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025
Nian, Y., Zhu, S., Qin, Y., Li, L., Wang, Z., Xiao, C., and Zhao, Y. Jaildam: Jailbreak detection with adaptive memory for vision-language model, 2025. URL https://arxiv.org/abs/2504.03770
-
[28]
OpenAI. Technical report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Technical report, OpenAI, October 2025 a . URL https://cdn.openai.com/pdf/08b7dee4-8bc6-4955-a219-7793fb69090c/Technical_report__Research_Preview_of_gpt_oss_safeguard.pdf
work page 2025
- [29]
-
[30]
Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c
OpenAI. Sycophancy in GPT -4o: What happened and what we’re doing about it, April 2025 c . URL https://openai.com/index/sycophancy-in-gpt-4o/
work page 2025
-
[31]
Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022
Podolskiy, A., Lipin, D., Bout, A., Artemova, E., and Piontkovskaya, I. Revisiting mahalanobis distance for transformer-based out-of-domain detection, 2022. URL https://arxiv.org/abs/2101.03778
-
[33]
A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled
Roose, K. A Conversation With Bing ’s Chatbot Left Me Deeply Unsettled . The New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html
work page 2023
-
[34]
Towards Understanding Sycophancy in Language Models
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards Understanding Sycophancy in Language Models , May 2025 a . URL http://arxiv.org/abs/2310.13548. arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O'Hara, C., Olsson, C., Petrini, L., ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
A StrongREJECT for Empty Jailbreaks
Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A StrongREJECT for Empty Jailbreaks , August 2024. URL http://arxiv.org/abs/2402.10260. arXiv:2402.10260 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail ? Advances in Neural Information Processing Systems, 36: 0 80079--80110, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html
work page 2023
-
[38]
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025
Williams, M., Carroll, M., Narang, A., Weisser, C., Murphy, B., and Dragan, A. On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback , February 2025. URL http://arxiv.org/abs/2411.02306. arXiv:2411.02306 [cs]
-
[39]
Xu, R. and Ding, K. Large Language Models for Anomaly and Out -of- Distribution Detection : A Survey , February 2025. URL http://arxiv.org/abs/2409.01980. arXiv:2409.01980 [cs]
- [40]
-
[41]
ShieldGemma: Generative AI Content Moderation Based on Gemma
Zeng, W., Liu, Y., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., Sturman, O., and Wahltinez, O. Shieldgemma: Generative ai content moderation based on gemma, 2024. URL https://arxiv.org/abs/2407.21772
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Yoo, KiYoon and Kim, Jangho and Jang, Jiho and Kwak, Nojun , editor =. Detection of. Findings of the. 2022 , pages =. doi:10.18653/v1/2022.findings-acl.289 , abstract =
-
[43]
Ye, Zihuiwen and Melo, Luckeciano Carvalho and Kaddar, Younesse and Blunsom, Phil and Staton, Sam and Gal, Yarin , month = feb, year =. Uncertainty-. doi:10.48550/arXiv.2502.11250 , abstract =
-
[44]
Uncertainty estimation using a single deep deterministic neural network , url =
Van Amersfoort, Joost and Smith, Lewis and Teh, Yee Whye and Gal, Yarin , year =. Uncertainty estimation using a single deep deterministic neural network , url =. International conference on machine learning , publisher =
-
[45]
Shi, Haizhou and Wang, Yibin and Han, Ligong and Zhang, Huan and Wang, Hao , month = dec, year =. Training-. doi:10.48550/arXiv.2412.05723 , abstract =
-
[46]
Osband, Ian and Wen, Zheng and Asghari, Seyed Mohammad and Dwaracherla, Vikranth and Ibrahimi, Morteza and Lu, Xiuyuan and Roy, Benjamin Van , month = may, year =. Epistemic. doi:10.48550/arXiv.2107.08924 , abstract =
-
[47]
and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =
Melo, Luckeciano C. and Tigas, Panagiotis and Abate, Alessandro and Gal, Yarin , month = oct, year =. Deep. doi:10.48550/arXiv.2406.10023 , abstract =
-
[48]
Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , year =. A. Advances in
-
[49]
Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , month = jun, year =. Last. doi:10.48550/arXiv.2204.02937 , abstract =
-
[50]
Kang, Katie and Wallace, Eric and Tomlin, Claire and Kumar, Aviral and Levine, Sergey , month = may, year =. Unfamiliar. doi:10.48550/arXiv.2403.05612 , abstract =
-
[51]
Izmailov, Pavel and Kirichenko, Polina and Gruver, Nate and Wilson, Andrew Gordon , month = oct, year =. On. doi:10.48550/arXiv.2210.11369 , abstract =
-
[52]
Gleave, Adam and Irving, Geoffrey , month = mar, year =. Uncertainty. doi:10.48550/arXiv.2203.07472 , abstract =
-
[53]
Fort, Stanislav and Ren, Jie and Lakshminarayanan, Balaji , year =. Exploring the. Advances in
-
[54]
and Lakshminarayanan, Balaji , month = jul, year =
Dherin, Benoit and Hu, Huiyi and Ren, Jie and Dusenberry, Michael W. and Lakshminarayanan, Balaji , month = jul, year =. Morse. doi:10.48550/arXiv.2307.00667 , abstract =
-
[55]
Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =
Chan, Alex and Alaa, Ahmed and Qian, Zhaozhi and Van Der Schaar, Mihaela , year =. Unlabelled data improves bayesian uncertainty calibration under covariate shift , url =. International conference on machine learning , publisher =
-
[56]
Burt, David R. and Ober, Sebastian W. and Garriga-Alonso, Adrià and Wilk, Mark van der , month = nov, year =. Understanding. doi:10.48550/arXiv.2011.09421 , abstract =
- [58]
-
[59]
Ielanskyi, Mykyta and Schweighofer, Kajetan and Aichberger, Lukas and Hochreiter, Sepp , month = mar, year =. Addressing
-
[60]
Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , volume =. Machine Learning , author =. 2021 , keywords =. doi:10.1007/s10994-021-05946-3 , abstract =
-
[61]
Farquhar, Sebastian and Gal, Yarin , month = nov, year =. What '
-
[62]
Advances in Neural Information Processing Systems , author =
Jailbroken:. Advances in Neural Information Processing Systems , author =. 2023 , pages =
work page 2023
-
[63]
Agentic. Anthropic Research , author =. 2025 , annote =
work page 2025
-
[64]
Investigating truthfulness in a pre-release o3 model , url =
Chowdhury, Neil and Johnson, Daniel and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah , month = apr, year =. Investigating truthfulness in a pre-release o3 model , url =
-
[65]
OpenAI , month = aug, year =
-
[66]
System Card: Claude Opus 4.5 , author =
-
[67]
Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.18837
-
[68]
Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025
Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , month = may, year =. Emergent. doi:10.48550/arXiv.2502.17424 , abstract =
-
[69]
Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.741 , language =
-
[70]
Manakul, Potsawee and Liusie, Adian and Gales, Mark , year =. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.557 , language =
-
[71]
Eisenstein, Jacob and Nagpal, Chirag and Agarwal, Alekh and Beirami, Ahmad and D'Amour, Alex and Dvijotham, D. J. and Fisch, Adam and Heller, Katherine and Pfohl, Stephen and Ramachandran, Deepak and Shaw, Peter and Berant, Jonathan , month = aug, year =. Helping or. doi:10.48550/arXiv.2312.09244 , abstract =
-
[72]
Coste, Thomas and Anwar, Usman and Kirk, Robert and Krueger, David , month = mar, year =. Reward. doi:10.48550/arXiv.2310.02743 , abstract =
-
[73]
Xu, Yinglun and Kang, Hangoo and Suresh, Tarun and Wan, Yuxuan and Singh, Gagandeep , month = may, year =. Learning a. doi:10.48550/arXiv.2505.20556 , abstract =
-
[74]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
-
[75]
Constitutional AI: Harmlessness from AI Feedback
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
-
[76]
Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , month = aug, year =. A. doi:10.48550/arXiv.2402.10260 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.10260
-
[77]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218
-
[78]
Williams, Marcus and Carroll, Micah and Narang, Adhyyan and Weisser, Constantin and Murphy, Brendan and Dragan, Anca , month = feb, year =. On. doi:10.48550/arXiv.2411.02306 , abstract =
- [79]
- [80]
-
[81]
Towards Understanding Sycophancy in Language Models
Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548
-
[82]
Fang, Haishuo and Zhu, Xiaodan and Gurevych, Iryna , month = dec, year =. Preemptive. doi:10.48550/arXiv.2407.11843 , abstract =
-
[83]
Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , month = oct, year =. A. doi:10.48550/arXiv.1807.03888 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03888
-
[84]
Hendrycks, Dan and Gimpel, Kevin , month = oct, year =. A. doi:10.48550/arXiv.1610.02136 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.02136
-
[85]
Xu, Ruiyao and Ding, Kaize , month = feb, year =. Large. doi:10.48550/arXiv.2409.01980 , abstract =
-
[86]
Jailbroken: How Does LLM Safety Training Fail?
Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.