pith. sign in

arxiv: 2606.07808 · v1 · pith:IWAAA7GVnew · submitted 2026-06-05 · 💻 cs.AI

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Pith reviewed 2026-06-27 21:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords instruction hierarchyreasoning language modelsself-monitoringconflict resolutionfailure localizationagentic workflowsrule followingLLM safety
0
0 comments X

The pith

A diagnostic framework localizes instruction hierarchy failures in reasoning models to three stages and uses training-free self-monitors to reduce most violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that non-compliant responses in reasoning language models arise from distinct, separable failures rather than a single end-to-end issue. It introduces a white-box framework that separates these into instruction identification, conflict resolution, and response realization, revealing that the dominant failure varies by model, task, and context length. The authors observe that models often detect conflicts when explicitly prompted and build on this to create two monitors: one that checks inputs in parallel for low-latency detection and one that reviews outputs sequentially for repair. Across three models the monitors cut rule-following non-compliance by 81-99 percent, including large reductions under static and adaptive attacks. This matters for agentic workflows where models must reliably obey the highest-privilege instruction when sources conflict.

Core claim

The central claim is that instruction hierarchy failures can be diagnosed by localizing them to instruction identification, conflict resolution, or response realization stages using a white-box approach on long-context adaptations of IHEval and IHChallenge, and that two training-free self-monitoring mechanisms—a parallel input monitor for conflict detection before generation and a sequential output monitor for response-level review and repair—reduce rule-following non-compliance by 81-99% across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, with GPT-5.3 seeing 86% reduction under static attacks and 45% under adaptive attacks.

What carries the argument

The white-box diagnostic framework that partitions failures into instruction identification, conflict resolution, and response realization, together with the parallel input monitor and sequential output monitor.

If this is right

  • Dominant failure modes differ across models, tasks, and context lengths.
  • The strongest monitor reduces rule-following non-compliance by 81-99% on the evaluated benchmarks.
  • GPT-5.3 sees 86% reduction under static attacks and 45% under adaptive attacks.
  • Both monitors operate without additional training and apply to long-context instruction conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The localization approach could extend to diagnosing other safety properties such as refusal or truthfulness in the same staged manner.
  • Monitors might be chained with external tools in agentic systems to handle cases where internal detection is weak.
  • The framework suggests a path for iterative improvement by targeting training data at the weakest stage for each model.

Load-bearing premise

Models can detect conflicts and output violations when explicitly prompted, and this ability transfers to effective monitors without further training.

What would settle it

Running the monitors on a new reasoning model where explicit prompting for violation detection produces no usable signals and non-compliance rates remain unchanged would show the repair method does not hold.

Figures

Figures reproduced from arXiv: 2606.07808 by G. Edward Suh, Sanjay Kariyappa.

Figure 1
Figure 1. Figure 1: Instruction hierarchy compliance is a multi-stage process. A model must identify relevant instructions and their sources, resolve conflicts according to privilege, and realize the resolved instruction set in its final response or tool call. White-box reasoning traces let us localize non-compliant outputs to the first stage where this process breaks. To understand why these failures occur, we can view IH co… view at source ↗
Figure 2
Figure 2. Figure 2: Instruction interpretation is context-dependent. The same tool response m2 (an email requesting project details) has no instruction component in case (a), where the user asks only to summarize emails, but carries an instruction to respond and share project details in case (b), where the user asks the model to respond to emails. 2 Diagnostic framework This section formalizes the object of our study: how a r… view at source ↗
Figure 3
Figure 3. Figure 3: Instruction-hierarchy failures increase with context length, but the dominant failure mode varies. Bars show diagnosed failures as a percentage of all examples, decomposed into failure categories. Counts indicate the number of diagnosed failures. Construct long-context conflicts. Each example starts from a benchmark item with a known higher￾priority instruction, a known lower-priority conflicting instructi… view at source ↗
Figure 4
Figure 4. Figure 4: Compliance degrades with context length, while detection re￾main accurate. We next evaluate whether IH compliance can be improved with training-free self-monitoring. We consider two mecha￾nisms that monitor different points in the compliance pipeline: a parallel input monitor (PIM), which checks the input con￾text for conflicts before the main response is produced, and a sequential output monitor (SOM), wh… view at source ↗
Figure 5
Figure 5. Figure 5: Parallel input monitoring. PIM checks context in parallel and restarts with a warning on conflicts. You are a helpful assistant. Respond only in English. Translate the following to French- “No, I don't regret anything” sys user <think> The phrase translates to.. </think> “Non, Je ne regrette rien” asst. <think> The response violates sys instruction to respond in English </think> {conflict: True, feedback: … view at source ↗
Figure 6
Figure 6. Figure 6: Sequential output monitoring. SOM checks and rewrites drafts when vio￾lations are detected [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Targeted interventions reduce non-compliance. Baseline (B), generic warnings (Warn), PIM, and SOM on rule-following and safety conflicts. Colored bars decompose adjudicated valid failures by stage; slate bars show aggregate non-compliance when diagnostic traces are unavailable. Results on adaptive attacks [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Adaptive PAIR attacks expose category-specific robustness. Attack success rates across IHChallenge categories. Colored stacks show diagnostic categories when judgments are available; slate bars show aggregate ASR otherwise. Hatching denotes the intervention, and the rightmost group shows how often PIM/SOM fired on successful attacks. monitors the shared input context before generation, while SOM reviews th… view at source ↗
Figure 9
Figure 9. Figure 9: Benchmark construction and failure diagnosis. We construct long-context variants by inserting benign multi-turn context between a known higher-priority instruction and a later conflicting lower-priority instruction. After rollout, final responses are graded for compliance; non-compliant examples are then diagnosed from the model’s reasoning trace and final response as instruction￾identification, conflict-r… view at source ↗
Figure 10
Figure 10. Figure 10: Latency overhead on aligned requests. PIM adds a small overhead over the baseline because the monitor runs in parallel with the main response. SOM is substantially slower because it waits for the draft and then performs a sequential output review [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a white-box diagnostic framework that decomposes instruction hierarchy (IH) failures in reasoning LMs into three stages—instruction identification, conflict resolution, and response realization—on long-context adaptations of IHEval and IHChallenge. It evaluates Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6, finding that dominant failure modes vary by model, task, and context length. Building on the observation that models often detect conflicts and violations under explicit prompting, the authors propose two training-free self-monitoring mechanisms (parallel input monitor for pre-generation detection and sequential output monitor for post-generation review/repair). Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor yields 81-99% reductions in rule-following non-compliance (86% static / 45% adaptive for GPT-5.3).

Significance. If the empirical reductions hold under rigorous controls, the work supplies both an interpretable localization tool for IH failures and practical, training-free repair methods that could meaningfully improve compliance in agentic deployments. The multi-model evaluation and explicit separation of failure stages are strengths; the training-free claim and reported effect sizes on held-out behaviors add practical value. The significance is limited by the absence of reported statistical validation and detection metrics for the core prompting observation.

major comments (3)
  1. [§4] §4 (Evaluation): The abstract and results claim 81-99% reductions without reporting statistical tests, confidence intervals, or controls for prompt phrasing variations in the explicit-prompt detection step; this makes it impossible to determine whether the gains exceed what would be expected from prompt engineering alone under the same IHEval/IHChallenge distributions.
  2. [§5.1-5.2] §5.1-5.2 (Monitor construction): The parallel input and sequential output monitors are derived directly from the observation that models output violations under explicit conflict prompts; the paper must report precision/recall (or equivalent) of this detection step on the long-context evaluation sets, as low detection reliability would mean the headline compliance gains are not load-bearing evidence for the proposed repair mechanism.
  3. [§3] §3 (Diagnostic framework): The localization into identification/resolution/realization stages is presented as white-box and interpretable, but no inter-rater reliability, automated validation against human annotations, or ablation on the diagnostic prompts themselves is described; without this, the claim that dominant failure modes vary across models/tasks/context length rests on unvalidated stage assignments.
minor comments (2)
  1. [Abstract / §4] The abstract mentions GPT-5.3 but the evaluation section only details three models; clarify whether GPT-5.3 results use the same diagnostic or a different protocol.
  2. [§3] Notation for the three failure stages is introduced without a summary table; adding one would improve readability when comparing dominant modes across context lengths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the points raised identify areas where additional rigor will strengthen the claims. We outline revisions below to address each major comment.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The abstract and results claim 81-99% reductions without reporting statistical tests, confidence intervals, or controls for prompt phrasing variations in the explicit-prompt detection step; this makes it impossible to determine whether the gains exceed what would be expected from prompt engineering alone under the same IHEval/IHChallenge distributions.

    Authors: We agree that statistical validation and controls for prompt sensitivity are necessary to substantiate the reductions. In the revised manuscript we will report 95% bootstrap confidence intervals on all compliance metrics, paired statistical tests (Wilcoxon signed-rank) comparing each monitor condition to the no-monitor baseline, and an ablation using three alternative phrasings of the explicit detection prompt to quantify variance attributable to prompt engineering. revision: yes

  2. Referee: [§5.1-5.2] §5.1-5.2 (Monitor construction): The parallel input and sequential output monitors are derived directly from the observation that models output violations under explicit conflict prompts; the paper must report precision/recall (or equivalent) of this detection step on the long-context evaluation sets, as low detection reliability would mean the headline compliance gains are not load-bearing evidence for the proposed repair mechanism.

    Authors: The monitors are motivated by the observed detection behavior under explicit prompting. We will add precision, recall, and F1 scores for the violation-detection step on the long-context IHEval and IHChallenge sets for all models in the revision. These metrics will be computed by comparing the monitor outputs against the ground-truth violation labels already present in the benchmarks. revision: yes

  3. Referee: [§3] §3 (Diagnostic framework): The localization into identification/resolution/realization stages is presented as white-box and interpretable, but no inter-rater reliability, automated validation against human annotations, or ablation on the diagnostic prompts themselves is described; without this, the claim that dominant failure modes vary across models/tasks/context length rests on unvalidated stage assignments.

    Authors: The three-stage localization relies on diagnostic prompts. We will add a validation subsection reporting (i) Cohen’s kappa from two human annotators on a 300-example stratified sample, (ii) agreement between the automated stage labels and human annotations, and (iii) an ablation of the diagnostic prompt wording to measure sensitivity of the reported failure-mode distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation of monitors on held-out behaviors

full rationale

The paper presents an empirical diagnostic framework and training-free monitors derived from an explicit observation about prompted conflict detection. No equations, fitted parameters, or self-citations are used to derive the reported compliance reductions (81-99%); the gains are measured directly on evaluated models and tasks. The central claim remains independent of any definitional or predictive reduction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or fitted parameters; the paper is an empirical study of LLM behavior with no free parameters, axioms, or invented entities required for the central claims.

pith-pipeline@v0.9.1-grok · 5810 in / 1079 out tokens · 17525 ms · 2026-06-27T21:49:47.655597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 6 linked inside Pith

  1. [1]

    Accessed 2026-05-06

    Model page. Accessed 2026-05-06. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

  2. [2]

    Defeating prompt injections by design.arXiv preprint arXiv:2503.18813,

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813,

  3. [3]

    Accessed 2026-05-06

    Hugging Face model card. Accessed 2026-05-06. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security,

  4. [4]

    Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Michael Pokorny, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, and Kai Xiao

    Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Michael Pokorny, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, and Kai Xiao. IH-challenge: A training dataset to improve instruction hierarchy on frontier LLMs. arXiv preprint arXiv:2603.10521,

  5. [5]

    Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720,

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kıcı- man. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720,

  6. [6]

    Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, and Neil Gong

    URL https://trychroma.com/ research/context-rot. Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, and Neil Gong. AlignSentinel: Alignment-aware detection of prompt injection attacks.arXiv preprint arXiv:2602.13597,

  7. [7]

    InjecGuard: Benchmarking and mitigating over-defense in prompt injection guardrail models.arXiv preprint arXiv:2410.22770,

    Hao Li and Xiaogeng Liu. InjecGuard: Benchmarking and mitigating over-defense in prompt injection guardrail models.arXiv preprint arXiv:2410.22770,

  8. [8]

    DRIFT: Dynamic rule-based defense with injection isolation for securing LLM agents.arXiv preprint arXiv:2506.12104,

    Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, and Chaowei Xiao. DRIFT: Dynamic rule-based defense with injection isolation for securing LLM agents.arXiv preprint arXiv:2506.12104,

  9. [9]

    LLM defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221,

    Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. LLM defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221,

  10. [10]

    A closer look at system prompt robustness.arXiv preprint arXiv:2502.12197,

    10 Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A closer look at system prompt robustness.arXiv preprint arXiv:2502.12197,

  11. [11]

    Accessed 2026-05-05

    Hugging Face dataset. Accessed 2026-05-05. OpenAI. GPT-5.3 Chat Latest. https://developers.openai.com/api/docs/models/gpt-5. 3-chat-latest,

  12. [12]

    Accessed 2026-05-06

    OpenAI API model documentation. Accessed 2026-05-06. ProtectAI. DeBERTa-v3-base prompt injection detector. https://huggingface.co/protectai/ deberta-v3-base-prompt-injection-v2,

  13. [13]

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao

    URL https://qwen.ai/blog?id=qwen3.6-35b-a3b. Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Derail yourself: Multi-turn LLM jailbreak attack through self-discovered clues. arXiv preprint arXiv:2410.10700,

  14. [14]

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al

    arXiv:2404.01833. Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219,

  15. [15]

    The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

  16. [16]

    Foot-in-the-door: A multi-turn jailbreak for LLMs.arXiv preprint arXiv:2502.19820,

    Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs.arXiv preprint arXiv:2502.19820,

  17. [17]

    Wang, and Prateek Mittal

    Tong Wu, Chong Xiang, Jiachen T. Wang, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention.arXiv preprint arXiv:2503.24370,

  18. [18]

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506,

  19. [19]

    Many-tier instruction hierarchy in LLM agents.arXiv preprint arXiv:2604.09443,

    Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, and Daniel Khashabi. Many-tier instruction hierarchy in LLM agents.arXiv preprint arXiv:2604.09443,

  20. [20]

    IHEval: Evaluating language models on following the instruction hierarchy.arXiv preprint arXiv:2502.08745,

    Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, and Meng Jiang. IHEval: Evaluating language models on following the instruction hierarchy.arXiv preprint arXiv:2502.08745,

  21. [21]

    A Benchmark construction IHEval-Long.IHEval-Long is a long-context adaptation of the existing IHEval conflict splits, not a new task family. Each source example already contains a higher-priority system instruction, a later lower-priority user instruction that conflicts with it, and a task-specific programmatic checker for whether the final response follo...

  22. [22]

    conflict_detected

    All IHEval-Long examples in these splits contain a known lower-priority instruction that conflicts with a higher-priority instruction, so this rate is a true-positive rate (TPR) with respect to the benchmark conflict label. We also report the unnecessary warning rate (UWR) on aligned controls: the fraction of aligned examples for which the monitor trigger...

  23. [23]

    introduced the instruction hierarchy framework and showed that fine-tuning on synthetic conflict scenarios improves robustness to conflicts between higher- and lower-privilege instructions. Subsequent benchmarks evaluate whether models follow the hierarchy in controlled settings, including IHEval [Zhang et al., 2025], IHChallenge [Guo et al., 2026], Hiera...