pith. sign in

arxiv: 2606.10860 · v1 · pith:PQ5MYA73new · submitted 2026-06-09 · 💻 cs.CR · cs.CL

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

Pith reviewed 2026-06-27 12:44 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords instruction hierarchyprompt injectiondirect preference optimizationLLM safetypriority adherencegravity-weighted DPOmulti-level instructionsover-refusal
0
0 comments X

The pith

Gravity-weighted DPO scales optimization by level distance to enforce five-level instruction hierarchies in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes a five-level instruction hierarchy that produces ten pairwise priority relations a model must respect when instructions from sources of different trust conflict. It introduces Gravity-Weighted Direct Preference Optimization, whose loss offset grows with the structural distance between levels, using either a linear or bilateral schedule that also factors in the victim level's privilege. When combined with hierarchy delimiter tokens and Instructional Segment Embeddings, the bilateral variant raises macro pairwise adherence on Llama-3.1-8B-Instruct while cutting over-refusal to half the rate of standard DPO. Ablations show that five-level training trades off generality against specialization relative to three-level training.

Core claim

We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens and Instructional Segment Embeddings, GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priorit

What carries the argument

Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting instruction levels under a bilateral schedule weighting both the privilege gap and the victim level's privilege.

If this is right

  • Models will more reliably follow higher-privilege instructions when they conflict with lower-privilege ones.
  • Over-refusal rates remain lower than under standard DPO while priority adherence rises.
  • Five-level training produces a generality-specialization tradeoff compared with three-level training.
  • Instructional Segment Embeddings act as a refusal-threshold calibrator.
  • Hierarchy-specific delimiter tokens support enforcement of the defined priority relations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bilateral schedule may generalize to other multi-source instruction settings such as tool-use or agentic workflows.
  • The same weighting principle could be tested on models larger than 8B to check scaling behavior.
  • Real deployment might reduce successful prompt-injection success rates by anchoring refusals to explicit privilege gaps.
  • A dynamic hierarchy that updates level distances from observed user corrections could be a natural next extension.

Load-bearing premise

That the structural distance between levels in the 5-level hierarchy, combined with the linear or bilateral weighting schedule, correctly captures the real-world severity and priority of instruction conflicts.

What would settle it

A dataset of real instruction conflicts in which human or downstream-task preferences assign different relative severities than the assumed five-level structural distances, such that GW-DPO shows no gain or a loss in macro pairwise adherence relative to standard DPO.

read the original abstract

Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formalizes a k-level (k=5) instruction hierarchy problem with ten pairwise priority relations, introduces Gravity-Weighted DPO (GW-DPO) whose per-sample offset scales with structural distance under linear or bilateral schedules (the latter weighting both gap and victim privilege), and claims that GW-DPO with the bilateral schedule plus hierarchy-specific delimiters and Instructional Segment Embeddings (ISE) Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct by raising macro pairwise priority adherence while halving the over-refusal rate; ablations are said to isolate ISE as a refusal-threshold calibrator and frame five- versus three-level training as a generality-specialization tradeoff.

Significance. If the empirical results hold under proper controls, the work supplies a concrete preference-optimization method for enforcing trust-differentiated instruction hierarchies, directly targeting prompt-injection vulnerabilities that arise from uniform token privilege; the bilateral schedule is a distinctive technical contribution that couples gap size with victim-level privilege, and the reported ablations on ISE and hierarchy depth provide useful empirical insight into the associated tradeoffs.

major comments (2)
  1. [Abstract] Abstract: the central claim of Pareto improvement and halved over-refusal is stated without dataset details, statistical tests, error bars, or complete ablation controls, which are required to substantiate the reported gains on macro pairwise priority adherence.
  2. [Abstract] Abstract: the bilateral weighting schedule is constructed directly from the internal 5-level structural distances; because both the training objective and the evaluation metric are defined over the same synthetic hierarchy, the reported improvements are consistent with the construction but do not demonstrate that the distances correctly encode real-world priority severities in prompt-injection or multi-source instruction data.
minor comments (1)
  1. [Abstract] Abstract: the citations to Chen et al., 2025 and Wu et al., 2025 should be expanded with full bibliographic details once the references section is examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the manuscript's content and indicating revisions where they strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of Pareto improvement and halved over-refusal is stated without dataset details, statistical tests, error bars, or complete ablation controls, which are required to substantiate the reported gains on macro pairwise priority adherence.

    Authors: We agree the abstract is highly condensed and omits supporting details. The full manuscript reports the training set size and composition in Section 3.2, presents all main results with error bars across three random seeds in Table 2, includes paired statistical tests (p < 0.05) in Appendix C, and provides complete ablations isolating ISE and hierarchy depth in Section 4. To make the central claim more self-contained, we will expand the abstract by one sentence referencing dataset scale and statistical significance while remaining within length limits. revision: yes

  2. Referee: [Abstract] Abstract: the bilateral weighting schedule is constructed directly from the internal 5-level structural distances; because both the training objective and the evaluation metric are defined over the same synthetic hierarchy, the reported improvements are consistent with the construction but do not demonstrate that the distances correctly encode real-world priority severities in prompt-injection or multi-source instruction data.

    Authors: The five-level hierarchy and its ten pairwise relations are explicitly motivated by documented trust distinctions in prompt-injection and multi-source instruction scenarios (Section 2). The bilateral schedule is designed to reflect both gap size and victim privilege precisely because these factors determine practical severity in such attacks. The synthetic construction enables exhaustive, controlled measurement of all ten relations, which would be infeasible on naturalistic data. We acknowledge that the work does not include direct empirical mapping to external real-world corpora and will add an explicit limitations paragraph stating this scope. The primary contribution remains the GW-DPO objective and its empirical isolation of the bilateral schedule's effect under controlled conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; hierarchy and weighting are explicit design choices, not reductions to inputs

full rationale

The paper formalizes a k-level hierarchy (instantiated at k=5) and defines GW-DPO offsets explicitly from structural distances under linear/bilateral schedules. Both the training objective and the macro pairwise adherence metric operate over this same synthetic construction, but this is a standard problem-definition-plus-solution pattern rather than a circular reduction: the claimed Pareto improvement is an empirical outcome on the defined data, not forced by re-labeling fitted parameters or by self-citation chains. No equations are shown that equate the reported gains to the input distances by construction, and the cited delimiter/ISE components come from external 2025 works. The derivation therefore remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central approach rests on the assumption that preference optimization can encode structural hierarchy distances, with the schedule type as a design choice.

free parameters (1)
  • linear or bilateral schedule
    The per-sample offset scaling is a modeling choice whose exact parameterization is not detailed in the abstract.
axioms (1)
  • domain assumption LLMs can be trained via preference optimization to enforce explicit pairwise priority relations across instruction levels
    Invoked as the basis for applying GW-DPO to the formalized hierarchy.

pith-pipeline@v0.9.1-grok · 5800 in / 1308 out tokens · 27497 ms · 2026-06-27T12:44:45.909951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    InAdvances in Neural Informa- tion Processing Systems, volume 37, pages 136037– 136083

    Refusal in language models is mediated by a single direction. InAdvances in Neural Informa- tion Processing Systems, volume 37, pages 136037– 136083. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international confer- ence on machine learning, pages 41–48. Kenneth J. Biba. ...

  2. [2]

    InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90, New York, NY , USA

    Not what you’ve signed up for: Compromis- ing real-world LLM-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90, New York, NY , USA. Association for Computing Machinery. Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, a...

  3. [3]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

    sDPO: Don’t use your data all at once. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 366–373, Abu Dhabi, UAE. Association for Compu- tational Linguistics. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations. R. Thom...

  4. [4]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Enhancing alignment using curriculum learn- ing & ranked preferences. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA. Associa- tion for Computational Linguistics. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.095...

  5. [5]

    Mark Russinovich, Ahmed Salem, and Ronen Eldan

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Mark Russinovich, Ahmed Salem, and Ronen Eldan

  6. [6]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Great, now write an article about that: The crescendo Multi-Turn LLM jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440. Paul Röttger, Hannah R. Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. XSTest: A test suite for identifying exaggerated safety behaviours in large language models.Pr...

  7. [7]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Instructional segment embedding: Improving LLM safety with instruction hierarchy. InInterna- tional Conference on Learning Representations. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Lingui...

  8. [8]

    Sampling a base row from the filtered evalua- tion pool

  9. [9]

    Classifying its domain

  10. [10]

    Gathering candidate materials: 2–3 L0 rules from the relevant category, 2–3 domain- matched L1 prompts, 2–3 injection templates (safety-targeting for L0-victim pairs, non- safety otherwise), 2–3 domain-matched L4 candidates, andL 2 attribute/value options

  11. [11]

    Sending everything to GPT-4o (temperature 0.7) with a structured prompt that names the conflict pair, identifies which level must win and which must lose, and requires that all five levels are topically coherent and that evaluation_criteria are automat- ically checkable

  12. [12]

    genuine un- derstanding test

    Receiving a structured JSON re- sponse with the selected/adapted L0– L4 content, a conflict_description, correct_behavior, violation_behavior, and a list ofevaluation_criteria. Malformed JSON or validation failures (missing fields, empty criteria) trigger up to two retries with a +0.1 temperature bump per retry; after three fail- ures the scenario is disc...

  13. [13]

    philosophy

    by closing a Bradley-Terry preference likeli- hood (Bradley and Terry, 1952) in the policy itself, yielding the loss LDPO(πθ;π ref) =−E (x,yw,yl)∼D h logσ β·∆r θ(x, yw, yl) i , (5) where ∆rθ is the implicit reward margin defined in §5. The Bradley-Terry derivation guarantees only that ∆rθ becomes positive at the optimum: the chosen response is more likely...