pith. machine review for the scientific record. sign in

arxiv: 2605.09737 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords large language modelssystem promptscross-attentionprompt injectioninstruction followingjailbreakingtransformer layers
0
0 comments X

The pith

Dedicated cross-attention layers anchor system prompts inside transformers and raise instruction adherence while lowering jailbreak rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard self-attention gives system prompts and user messages equal structural weight, allowing later instructions to erode earlier rules. CALYREX inserts cross-attention between the user input and the fixed system prompt at selected layers so the model can route behavioral constraints separately. Ablations locate the best insertion point in the final eighth of layers, where activation analysis already shows rule-related signals concentrate. At 8B scale with matched data and parameters the method improves instruction-following scores and multi-turn adherence while cutting many-shot jailbreak success.

Core claim

By adding cross-attention pathways that let every user token attend directly to the system-prompt tokens, the architecture structurally isolates privileged instructions. Placing these layers in the last eighth of the stack proves optimal on a 1.5B model and transfers to 8B, where the same change yields a 7.4 point gain on IFEval, a 16.3 point gain on multi-turn adherence, and a 13 point drop in jailbreak success rate. The advantage grows with scale, consistent with larger models making fuller use of the dedicated routing path.

What carries the argument

Cross-attention between the input sequence and the system-prompt tokens, inserted only at the final eighth of transformer layers to isolate behavioral constraints.

If this is right

  • Instruction adherence improves on both single-turn and multi-turn tasks without extra training data.
  • Many-shot jailbreak attacks become less effective because the system rules receive a separate attention route.
  • The benefit widens as model size increases, suggesting the mechanism scales favorably.
  • Optimal layer placement aligns with where behavioral signals already concentrate in standard models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-attention pattern could be used to anchor other fixed context such as few-shot examples or retrieved documents.
  • Combining the structural anchor with existing safety fine-tuning might allow lighter post-training.
  • The approach is architecture-agnostic and could be tested on decoder-only models beyond the 1.5B–8B range examined.

Load-bearing premise

The measured gains are produced by the added cross-attention pathway rather than by any uncontrolled difference in training data, optimizer schedule, or evaluation protocol.

What would settle it

Train an identical 8B model on the same data without the cross-attention layers and compare its IFEval score, multi-turn adherence, and jailbreak success rate to the CALYREX version.

Figures

Figures reproduced from arXiv: 2605.09737 by Li Lixing.

Figure 1
Figure 1. Figure 1: The CALYREX Architecture. A CAL block performs cross-attention between the system prompt and the full input. The result is then fed to a feedforward network before the next layer. The CAL block is inserted between normal self-attention blocks and residues can bypass the CAL block. To structurally enforce system-prompt adherence, we introduce the Cross-Attention Layer (CAL) block. Inserted immediately after… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical benchmark results across CAL placement configurations on the Qwen2.5-1.5B [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mechanistic activation magnitude heatmap. Each cell represents the CAL/Attn [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Structural resistance to adversarial and contextual degradation at 8B scale. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Modern large language models (LLMs) rely on system prompts to establish behavioral constraints and safety rules. Standard causal self-attention treats privileged instructions and untrusted user content with equal structural priority -- a mismatch that leaves models vulnerable to prompt injection and instruction erosion over extended contexts. We propose CALYREX (Cross-Attention LaYeR EXtended transformers), which utilizes cross-attention between input and system prompt to structurally isolate and anchor the rule. A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. At 8B scale, controlling for training data, backbone, and parameter budget, CALYREX yields $+7.4\%$ on instruction-following (IFEval) and $+16.3\%$ on multi-turn instruction adherence, while reducing many-shot jailbreaking attack success rate by $13\%$. This advantage appears to widen with model scale, consistent with larger models more effectively utilizing the dedicated routing pathway.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CALYREX, a transformer variant that inserts cross-attention layers between the system prompt and input tokens to structurally anchor behavioral rules and safety constraints. A layer-placement ablation on a 1.5B backbone identifies the final eighth of layers as optimal, supported by mechanistic activation analysis. At 8B scale, with controls for training data, backbone, and parameter count, the method reports +7.4% on IFEval, +16.3% on multi-turn instruction adherence, and a 13% reduction in many-shot jailbreak success rate, with gains appearing to increase with scale.

Significance. If the reported gains are attributable to the cross-attention anchoring mechanism rather than capacity or optimization differences, the approach would offer a lightweight architectural route to improved prompt adherence and jailbreak resistance that scales with model size. The combination of targeted ablation and mechanistic analysis provides a concrete starting point for further work on structural isolation of privileged context.

major comments (3)
  1. [Abstract] Abstract: The claim of controlling for parameter budget is load-bearing for attributing the +7.4% IFEval and +16.3% multi-turn gains to the cross-attention pathway, yet the addition of separate Q/K/V projection matrices for the system stream necessarily increases parameter count; no description is given of how this overhead is exactly offset (e.g., by reducing hidden dimension or layer width elsewhere) to keep total parameters matched to the baseline.
  2. [Abstract and §4] Abstract and §4 (placement ablation): The optimal insertion point (final eighth of layers) is identified solely on the 1.5B model and transferred to 8B without re-ablation or sensitivity analysis; because the central scaling claim rests on this placement generalizing, the absence of 8B-specific placement results leaves open the possibility that the observed deltas arise from an unoptimized or mismatched configuration at the larger scale.
  3. [Abstract] Abstract: The quantitative results (+7.4% IFEval, +16.3% multi-turn, -13% jailbreak ASR) are presented without reported standard deviations across seeds, number of evaluation runs, or statistical significance tests, which is required to establish that the improvements exceed noise and are not driven by uncontrolled training dynamics or evaluation variance.
minor comments (1)
  1. [Abstract] The abstract refers to 'mechanistic activation analysis' confirming concentration of behavioral constraints but provides no figures, equations, or summary statistics from that analysis; a brief description or pointer to the relevant subsection would improve traceability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest responses and indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The claim of controlling for parameter budget is load-bearing for attributing the +7.4% IFEval and +16.3% multi-turn gains to the cross-attention pathway, yet the addition of separate Q/K/V projection matrices for the system stream necessarily increases parameter count; no description is given of how this overhead is exactly offset (e.g., by reducing hidden dimension or layer width elsewhere) to keep total parameters matched to the baseline.

    Authors: We agree that explicit accounting for parameter matching is essential to support attribution of gains to the cross-attention mechanism. The additional Q/K/V projections for the system stream were offset by a corresponding reduction in the feed-forward intermediate dimension within the self-attention layers, preserving identical total parameter counts for both the baseline and CALYREX models at 1.5B and 8B scales. We will revise the abstract and §4 to describe this compensation in detail, including tabulated parameter counts for transparency. revision: yes

  2. Referee: [Abstract and §4] The optimal insertion point (final eighth of layers) is identified solely on the 1.5B model and transferred to 8B without re-ablation or sensitivity analysis; because the central scaling claim rests on this placement generalizing, the absence of 8B-specific placement results leaves open the possibility that the observed deltas arise from an unoptimized or mismatched configuration at the larger scale.

    Authors: We acknowledge that a complete re-ablation at 8B scale would provide stronger support for the generalization of the placement choice. The exhaustive search was performed at 1.5B owing to computational constraints. The mechanistic activation analysis offers supporting evidence that behavioral constraint signals concentrate in later layers as a scale-invariant property. We will expand §4 with a dedicated discussion of this transfer assumption and its limitations; additional 8B sensitivity checks will be included if resources permit. revision: partial

  3. Referee: [Abstract] The quantitative results (+7.4% IFEval, +16.3% multi-turn, -13% jailbreak ASR) are presented without reported standard deviations across seeds, number of evaluation runs, or statistical significance tests, which is required to establish that the improvements exceed noise and are not driven by uncontrolled training dynamics or evaluation variance.

    Authors: We concur that variance reporting and statistical testing are required for rigorous interpretation. The metrics derive from multiple independent training runs using distinct random seeds. We will update the abstract and §4 to specify the number of runs, include standard deviations, and report the outcomes of statistical significance tests (such as paired t-tests) to confirm the improvements exceed evaluation noise. revision: yes

Circularity Check

0 steps flagged

No circularity in architecture proposal or empirical claims

full rationale

The paper proposes CALYREX as a cross-attention extension for system-prompt isolation, determines layer placement via ablation on a 1.5B model, and reports controlled 8B-scale results on IFEval, multi-turn adherence, and jailbreak resistance. No equations, fitted parameters, or derivations appear that reduce any claimed prediction or optimality to the inputs by construction. Placement optimality is presented as an empirical finding confirmed by activation analysis, not a self-definition or renamed known result. Central performance deltas are attributed to experimental comparisons with stated controls for data, backbone, and budget; no self-citation chain or ansatz smuggling is invoked as load-bearing justification. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that dedicated cross-attention creates effective structural isolation for system prompts, validated only through empirical ablation and scaling tests described at high level.

free parameters (1)
  • layer insertion fraction
    Chosen as final eighth after ablation on 1.5B model
axioms (1)
  • domain assumption Cross-attention between system prompt and input provides superior anchoring compared to standard self-attention
    Invoked as the core mechanism for isolating behavioral constraints

pith-pipeline@v0.9.0 · 5481 in / 1354 out tokens · 70690 ms · 2026-05-12T03:13:29.537922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Breath1024.lean period8 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. ... LATE8TH configuration

  • IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    late-stage and sparse placements consistently outperform early or dense interventions: final-layer cross-attention anchors formatting rules

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 2 internal anchors

  1. [2]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    URLhttps://openreview.net/forum?id=fsW7wJGLBd. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Ro- man Garnett, editors,Advances in Neural Inf...

  2. [4]

    Instruction-Following Evaluation for Large Language Models

    doi: 10.48550/ARXIV .2311.07911. URL https://doi.org/10.48550/arXiv.2311. 07911. Andy Zou, Long Phan, Sarah Li Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. ...

  3. [5]

    Decode each tokent i in the sequence to its string representation

  4. [6]

    Start detection:if "system" appears in the current decoded stringandthe previous token string is <|im_start|> (ChatML) or <|start_header_id|> (Llama-3 header), record s=i−1(the inclusive index of the opening delimiter)

  5. [7]

    End detection:once inside the system-prompt span, if the current token string contains <|im_end|> or <|eot_id|>, record e=i+ 1 (one past the closing delimiter) and return (s, e)

  6. [8]

    You are a helpful AI assistant

    If no system prompt is found, return (0,0) ; the CAL cross-attention is a no-op for that sample and its output is zero-masked. 13 Setting Qwen2.5-1.5B Llama-3.1-8B Learning rate2×10 −4 5×10 −5 LR schedule cosine cosine Warmup ratio 0.05 0.10 Weight decay 0.01 0.01 Optimizer AdamW (fused) AdamW (fused) Per-device batch size 4 4 Gradient accumulation steps ...