Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Kehai Chen; Min Zhang; Mufan Xu; Pengfei Zhang; Xuefeng Bai; Yang Xiang; Yu Zhang

arxiv: 2602.03677 · v2 · submitted 2026-02-03 · 💻 cs.CL

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Yu Zhang , Mufan Xu , Xuefeng Bai , Kehai Chen , Pengfei Zhang , Yang Xiang , Min Zhang This is my paper

Pith reviewed 2026-05-16 07:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords modality followingmultimodal large language modelsattention mechanismsmechanistic interpretabilityinstruction tokensmodality arbitration

0 comments

The pith

Instruction tokens serve as structural anchors that let multimodal models arbitrate which input modality to follow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the internal flow of information in multimodal large language models to understand how they selectively use visual or other inputs according to user instructions. It shows that instruction tokens collect raw cues from all modalities in early attention layers without much filtering. In later layers a small group of attention heads then sharpens the subspace that matches the instruction, deciding the arbitration. Experiments confirm the mechanism by showing that disabling only five percent of those heads impairs instruction following while leaving general language and vision abilities intact, and that amplifying the same heads recovers many failures.

Core claim

Instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer, aggregating multimodal cues to instruction tokens as a latent buffer; in contrast, deep attention layers selectively strengthen the instruction-compliant subspace and resolve modality arbitration according to the instruction-specified intent, with a sparse subset of attention heads driving this process.

What carries the argument

Instruction tokens functioning as a latent buffer across attention layers, with shallow layers performing broad aggregation and deep layers using sparse heads to enforce instruction-compliant subspaces.

If this is right

Blocking only five percent of the identified heads substantially degrades modality following while preserving general visual and language capabilities.
Targeted amplification of those heads can restore failed modality-following samples by up to approximately sixty percent.
The functional specificity of the heads suggests that modality arbitration is localized rather than distributed uniformly across the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicitly protecting or reinforcing these anchor tokens during training could reduce unintended modality leakage in deployed systems.
The same shallow-buffer then deep-resolution pattern may appear in other instruction-guided decisions inside the same models.
Architectures that make instruction tokens more prominent early in the network might improve reliability without retraining the full model.

Load-bearing premise

The attention patterns and the results of blocking or amplifying five percent of heads are causally responsible for modality following rather than merely correlated with other unmeasured model behaviors.

What would settle it

An experiment in which blocking the identified heads degrades modality following no more than blocking random heads of equal number, or in which the shallow-to-deep pattern fails to predict compliance on new instruction sets.

read the original abstract

Modality following is the ability to selectively leverage multimodal contexts based on user instructions. It is fundamental to the safety and reliability of multimodal large language models (MLLMs) in real-world deployments. However, the internal mechanisms governing this decision-making process remain largely under-explored. In this work, we investigate the mechanism underlying modality following through an information flow perspective. Our findings reveal that instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer, aggregating multimodal cues to instruction tokens as a latent buffer; in contrast, deep attention layers selectively strengthen the instruction-compliant subspace and resolve modality arbitration according to the instruction-specified intent, with a sparse subset of attention heads driving this process. Targeted attention-head interventions further validate the functional specificity of these heads: blocking only $5\%$ of the identified heads substantially degrades modality following while preserving general visual and language capabilities, whereas targeted amplification can restore failed modality-following samples by up to approximately $60\%$. Together, this work provides a mechanistic account of modality following and informs future efforts to improve how MLLMs integrate and utilize multimodal evidence under user instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Instruction tokens act as anchors for modality arbitration in MLLMs, with shallow layers aggregating cues and deep sparse heads selectively strengthening instruction-compliant paths, plus intervention evidence.

read the letter

The main thing here is the concrete mapping of how instruction tokens function as a hub. Shallow attention layers do broad aggregation of multimodal signals into those tokens as a buffer, while deeper layers use a sparse set of heads to boost the subspace that matches the given instruction. The intervention results give this some teeth: knocking out just 5% of the identified heads hurts modality following without wrecking general visual or language performance, and amplifying them recovers roughly 60% of the failures. That layer split and the head-level specificity go beyond generic attention maps in prior multimodal work. The soft spot is the causal isolation. The abstract does not describe random-head controls, layer-matched baselines, or checks on whether the preserved general capabilities actually test the same information flow dimensions, so it remains possible the effects come from nonspecific capacity reduction rather than targeted arbitration. If the full paper has those controls, the claim strengthens; otherwise the functional specificity needs more work. This is the sort of mechanistic detail that could guide targeted edits for better instruction compliance. It is worth a reading group discussion on the intervention design and deserves peer review to tighten the evidence on specificity.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates the mechanisms of modality following in multimodal LLMs through an information-flow lens. It claims that instruction tokens serve as structural anchors: shallow attention layers perform undifferentiated aggregation of multimodal cues onto these tokens as a latent buffer, while deep layers use a sparse subset of attention heads to selectively strengthen the instruction-compliant subspace and resolve arbitration. Targeted interventions are presented as causal validation, with blocking 5% of the identified heads degrading modality following while preserving general capabilities, and amplification restoring up to ~60% of failures.

Significance. If the central claims hold, the work supplies a mechanistic account of instruction-guided multimodal integration that could guide targeted improvements in MLLM safety and reliability. The intervention experiments provide direct evidence for the functional role of specific heads, moving beyond purely observational attention patterns and offering a falsifiable basis for the structural-anchor hypothesis.

major comments (2)

[Abstract] Abstract (targeted attention-head interventions paragraph): The claim that blocking only 5% of identified heads substantially degrades modality following while preserving general visual and language capabilities is load-bearing for the functional-specificity argument, yet the description provides no detail on random-head controls, layer-matched baselines, or whether the general-capability metrics capture the same information-flow dimensions as the modality-following task.
[Abstract] Abstract (restoration results): The report that targeted amplification restores failed modality-following samples by up to approximately 60% lacks accompanying information on sample size, statistical robustness, variance across runs, or comparison to non-specific amplification controls; without these, the causal link to the hypothesized deep-layer arbitration process cannot be isolated from general capacity changes.

minor comments (1)

[Abstract] The phrase 'instruction-compliant subspace' is used without an explicit operational definition or pointer to the precise metric (e.g., attention-weight threshold or activation projection) used to identify it in the deep layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor that were insufficiently detailed in the abstract. We have revised the abstract and added clarifying text in the methods and results sections to address these points directly.

read point-by-point responses

Referee: [Abstract] Abstract (targeted attention-head interventions paragraph): The claim that blocking only 5% of identified heads substantially degrades modality following while preserving general visual and language capabilities is load-bearing for the functional-specificity argument, yet the description provides no detail on random-head controls, layer-matched baselines, or whether the general-capability metrics capture the same information-flow dimensions as the modality-following task.

Authors: We agree that the original abstract was overly concise and omitted key controls. In the revised version we have expanded the relevant paragraph to state that the 5% blocking results were compared against both random-head controls (matched for layer and count) and layer-matched random subsets, with the identified heads showing significantly stronger degradation of modality following. The general-capability metrics are the standard VQA and language-modeling benchmarks used throughout the paper; while they do not directly measure information-flow dimensions, they evaluate the same downstream capabilities that would be affected by non-specific capacity loss. Full control results and statistical comparisons appear in the new Table 3 and Section 4.3. revision: yes
Referee: [Abstract] Abstract (restoration results): The report that targeted amplification restores failed modality-following samples by up to approximately 60% lacks accompanying information on sample size, statistical robustness, variance across runs, or comparison to non-specific amplification controls; without these, the causal link to the hypothesized deep-layer arbitration process cannot be isolated from general capacity changes.

Authors: We accept that the abstract lacked the necessary statistical context. The revision now specifies that the amplification experiments were conducted on 200 failed modality-following instances, averaged across three independent runs (mean restoration 58%, std 6.2%). We also report a non-specific control in which random deep-layer heads were amplified at the same magnitude, yielding only 11% average restoration. These controls and variance statistics have been added to the abstract and are detailed with per-run tables in Section 5.2, strengthening the link to the hypothesized arbitration mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attention analysis and interventions are independent of inputs

full rationale

The paper reports observational findings on attention patterns across layers and causal effects from targeted head interventions (blocking 5% of heads, amplification restoring ~60% failures). No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations are invoked to justify the core claims. The distinction between shallow undifferentiated transfer and deep instruction-compliant subspace resolution is presented as a direct result of information-flow measurements rather than any self-definitional loop or ansatz. The analysis stands as self-contained empirical work without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard transformer attention mechanisms and information-flow tracing; no new free parameters, ad-hoc axioms, or invented entities are introduced in the reported findings.

axioms (1)

standard math Transformer attention layers support information-flow analysis via attention weights and head-specific interventions.
The paper applies an information-flow perspective to attention patterns in standard MLLM architectures.

pith-pipeline@v0.9.0 · 5509 in / 1184 out tokens · 28462 ms · 2026-05-16T07:50:27.454367+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer... deep attention layers selectively strengthen the instruction-compliant subspace... sparse subset of attention heads
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

blocking only 5% of the identified heads substantially degrades modality following while preserving general visual and language capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
cs.CV 2026-05 unverdicted novelty 6.0

Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
cs.CV 2026-05 unverdicted novelty 5.0

Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
eess.AS 2026-04 unverdicted novelty 4.0

NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.