Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Anca Dragan; Cassidy Laidlaw; Dylan Feng; Pragya Srivastava

arxiv: 2605.21602 · v2 · pith:QPCPLFWWnew · submitted 2026-05-20 · 💻 cs.AI · cs.SE

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Dylan Feng , Pragya Srivastava , Anca Dragan , Cassidy Laidlaw This is my paper

Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords out-of-distribution detectionLLM alignmentsafety monitoringguard modelsMahalanobis distanceperplexityMOOD benchmark

0 comments

The pith

Combining guard models with Mahalanobis distance and perplexity OOD detectors improves recall of out-of-distribution LLM alignment failures from 39% to 45%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the MOOD benchmark to study whether monitoring systems can spot alignment failures that occur in situations the models were not trained on. Guard models trained on limited safety data tend to miss these failures when the inputs differ from the training examples. Adding out-of-distribution detectors helps catch more of them. The authors demonstrate that this hybrid approach scales positively and outperforms simply using a much larger guard model.

Core claim

Guard models often fail to generalize to out-of-distribution alignment failures, but combining them with Mahalanobis distance and perplexity-based OOD detectors raises recall from 39% to 45%. This hybrid method shows positive scaling across model sizes and achieves higher recall gains than a guard model with 20 times more parameters. The MOOD benchmark supports these findings by using a restricted training set for monitors and seven test sets with alignment failures outside that distribution.

What carries the argument

The hybrid monitor combining a guard model (safety classifier) with Mahalanobis distance and perplexity OOD detectors, evaluated on the MOOD benchmark.

If this is right

Monitoring pipelines for LLMs should include OOD detection to handle unforeseen alignment failures.
Combined monitors benefit from scaling up model size more than guard models alone.
The recall gains from OOD detection exceed those from increasing guard model parameters by a factor of 20.
Further development of OOD detectors could lead to more robust LLM safety systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may achieve better safety by focusing on detecting shifts in input patterns instead of training ever-larger safety classifiers.
This work implies that many alignment issues arise from distributional novelty rather than inherent model weaknesses.
Real-world deployments could use these monitors to flag unusual prompts for human review or model fallback.

Load-bearing premise

The seven test sets contain alignment failures that lie outside the distribution of the restricted training set used to train the monitors.

What would settle it

If adding the OOD detectors fails to improve recall when the test failures are drawn from the same distribution as the training data, or if the improvement does not appear on additional OOD test sets.

Figures

Figures reproduced from arXiv: 2605.21602 by Anca Dragan, Cassidy Laidlaw, Dylan Feng, Pragya Srivastava.

**Figure 1.** Figure 1: We systematically study incorporating out-of-distribution (OOD) detectors into LLM safety monitoring to catch alignment failures outside the training distribution. LLMs are often deployed with a guard model (right) trained with safety training data (left). However, if a prompt or response is outside of the training distribution, the guard model may generalize incorrectly and fail to flag safety issues. Add… view at source ↗

**Figure 2.** Figure 2: We introduce Misalignment Out Of Distribution (MOOD), a benchmark which tests LLM monitors for their ability to recognize unforeseen LLM alignment failures. MOOD includes seven test sets containing conversations with distinct alignment failures. To ensure that these test sets are truly out-of-distribution, we train our own guard models and OOD detectors on a restricted post-training dataset that we careful… view at source ↗

**Figure 3.** Figure 3: To better understand the Mahalanobis OOD detector, we apply PCA to the activations of the Qwen2.5-32B guard model on which we compute the Mahalanobis distance. We plot the resulting principal components of 200 conversations from each test dataset above. For each dataset, we also show the relative change in misalignment recall for the combined guard + Mahalanobis model compared to using the guard model alo… view at source ↗

**Figure 5.** Figure 5: The improvement in OOD misalignment recall when training guard models additionally on some of the MOOD test sets. We display both the increase in recall relative to the baseline Gemma 2 9B guard model as well as the absolute recall in parentheses. The first seven rows each correspond to adding a single test dataset to the training data. The “union” row measures the recall on each test dataset when taking … view at source ↗

**Figure 6.** Figure 6: The average misalignment recall of six methods across three models from the Gemma 2 family with 2, 9, and 27 billion parameters. Methods improve significantly from the 2B to the 9B model, but the misalignment recall drops from the 9B to the 27B model. We hypothesize this may be because the 27B model is suboptimally trained; we use the same hyperparameters across all model sizes, and 27B might require diffe… view at source ↗

**Figure 7.** Figure 7: Per-token perplexity results on different test samples. Tokens highlighted with brighter colors have higher perplexity. The conversation on the left is from the sycophantic test set and the conversation on the right is from the function calling deception (missing tools) test set. Many of the sycophantic tokens are flagged as high-perplexity in the sycophantic conversation, while very few of the tokens are … view at source ↗

**Figure 8.** Figure 8: The distributions of the numbers of tokens and Flesch-Kincaid grade levels (Kincaid et al., 1975) of conversations in each MOOD test set. The significant overlap between test set and train set distributions means that it is not trivial to detect OOD conversations based on surface level features. The majority of samples in our test datasets are cleanly classifiable with respect to the training dataset using… view at source ↗

read the original abstract

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem. We release the code and data for our experiments publicly, and you can find the relevant links here: https://github.com/Dylan102938/mood-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Guard models plus basic OOD detectors improve recall on alignment failures in this new benchmark, though confirming the tests are truly OOD would strengthen the case.

read the letter

The main thing to know is that this paper builds a benchmark called MOOD to test monitors on out-of-distribution alignment failures in LLMs, and finds that adding Mahalanobis and perplexity OOD detectors to a guard model raises recall from 39% to 45%, outperforming a much larger guard model. They do a few things right. The setup with a restricted training set and seven separate test sets for different failure modes gives a clean way to measure generalization. They run comparisons across four detector types and track how performance scales with model size. The positive scaling for the combined monitors is a useful data point, and the claim that OOD detection helps more than just scaling parameters has practical implications for safety work. The weaker part is the lack of direct evidence that the test sets are truly out of distribution from the training data. The abstract describes the construction but does not include any quantitative checks like distance metrics or distribution comparisons. That leaves open the possibility that the recall gain comes from general differences rather than the OOD-specific handling the authors intend. Methods details are also thin in the summary, so it's hard to judge if the splits or hyperparameters were tuned in ways that affect the results. This paper is aimed at researchers building monitoring systems for deployed LLMs. Anyone thinking about how to catch unexpected failures would find the benchmark and the detector comparisons worth looking at. It is worth sending to peer review because the core idea addresses a real gap in current guard models, and the empirical results are concrete enough to spark discussion even if some validation steps need more work.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MOOD benchmark for evaluating monitors on out-of-distribution (OOD) alignment failures in LLMs. It uses a restricted training set to train monitors and seven test sets containing diverse alignment failures asserted to lie outside that distribution. The central empirical finding is that guard models (safety classifiers) generalize poorly OOD, but combining a guard model with Mahalanobis-distance and perplexity-based OOD detectors raises recall from 39% to 45%. The work also reports positive scaling trends for combined monitors across model sizes and claims that adding OOD detection yields larger recall gains than scaling the guard model by a factor of 20.

Significance. If the OOD status of the test sets and the reported recall gains are robustly established, the paper supplies a concrete benchmark and practical evidence that OOD detection is a high-leverage addition to LLM monitoring pipelines. The scaling results and the comparison against larger guard models are directly actionable for safety engineering.

major comments (1)

[§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.

minor comments (2)

[§4.1] The abstract and experimental sections should explicitly state the precise definitions and hyper-parameter choices for the four OOD detectors tested, including any post-hoc tuning that could affect the 39%-to-45% comparison.
[Table 2 and Figure 3] Figure captions and tables reporting recall should include error bars or statistical significance tests for the scaling trends across model sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address the major comment below and will incorporate the suggested verification to strengthen the interpretation of the MOOD benchmark results.

read point-by-point responses

Referee: [§3 (Benchmark Construction) and §4 (Experiments)] The central claim that the 39%→45% recall improvement is attributable to OOD detection (rather than any distributional difference) rests on the seven test sets being genuinely out-of-distribution relative to the restricted training set. No quantitative verification—such as mean Mahalanobis distance, perplexity histograms, maximum mean discrepancy, or other distributional statistics—is reported comparing the training distribution to each test set. This verification is load-bearing for interpreting the benchmark results as OOD-specific.

Authors: We agree that explicit quantitative verification of the distributional shift would strengthen the central claim. The MOOD benchmark defines the test sets by selecting diverse alignment failures (e.g., novel jailbreak styles, unusual response patterns, and failure modes) that are excluded from the restricted training set by construction; this restricted set is a curated subset of safety data used to train the monitors. Nevertheless, we acknowledge that reporting statistics such as mean Mahalanobis distances on model embeddings, perplexity histograms, or maximum mean discrepancy would provide more rigorous evidence that the performance gains arise specifically from OOD detection rather than incidental distributional differences. We will add these analyses to §3 in the revised manuscript, including comparisons for each of the seven test sets, and will reference them when interpreting the recall improvements in §4. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with measured recall on held-out sets

full rationale

The paper constructs the MOOD benchmark with a restricted training set used to train monitors and seven test sets asserted to contain alignment failures outside that distribution. Reported results consist of directly measured recall improvements (39% to 45%) and scaling trends on these held-out test sets rather than any derivation, fitted parameter, or self-referential definition that reduces the central claim to its inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the evaluation is falsifiable via standard held-out performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no new mathematical axioms, free parameters fitted to the target result, or invented entities; the central claims rest on the assumption that the constructed test sets are OOD relative to the restricted training distribution.

pith-pipeline@v0.9.0 · 5778 in / 1164 out tokens · 29907 ms · 2026-05-22T09:34:20.324048+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD)... combining guard models with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.