Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Hyunjong Ok; Jaeho Lee

arxiv: 2601.14152 · v2 · submitted 2026-01-20 · 💻 cs.CL · cs.AI· cs.LG

Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Hyunjong Ok , Jaeho Lee This is my paper

Pith reviewed 2026-05-16 12:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords causal attentionprompt orderlanguage modelsmultiple-choice QAinformation bottlenecktransformer architectureprompt sensitivity

0 comments

The pith

Causal attention creates an information bottleneck in QOC prompt orders by blocking options from attending to context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models achieve significantly higher accuracy on multiple-choice questions when context is placed before the question and options compared to the reverse order. This advantage exceeds 14 percentage points consistently across models and datasets. The root cause is identified as the causal attention mechanism, which enforces that tokens can only attend to earlier positions in the sequence. In the reversed order, this prevents options from accessing the context, creating a bottleneck that limits the model's ability to use relevant information for choosing the correct answer. Revealing this architectural constraint explains much of the observed prompt sensitivity.

Core claim

The central claim is that the performance superiority of context-first prompts over question-and-options-first prompts stems from the limitations imposed by causal attention. Specifically, when context follows the options in QOC order, the attention mechanism ensures that option tokens receive no information from the subsequent context tokens, leading to poorer decision making across a wide range of models and datasets.

What carries the argument

The causal attention mask, which restricts each token to attend only to preceding tokens in the input sequence.

If this is right

CQO ordering enables full attention flow from context to options, producing the observed accuracy gains.
The information bottleneck persists across diverse models and datasets as a direct result of the causal mask.
Prompt designs that place context after options will systematically underperform on tasks requiring context-informed choices.
The effect arises from the autoregressive property of the attention mechanism rather than model-specific training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bottleneck is likely to appear in chain-of-thought or reading-comprehension tasks where key facts follow the query.
Hybrid attention patterns that relax the mask for specific prompt sections could reduce order sensitivity without retraining.
Training on randomly ordered prompts may not fully overcome the limitation because the mask remains fixed at inference time.

Load-bearing premise

The performance gap between CQO and QOC orders is caused primarily by the causal attention mechanism rather than confounding factors such as tokenization differences or training data order.

What would settle it

Modify the causal mask in a controlled experiment to allow option tokens to attend to following context tokens in QOC prompts and measure whether the 14-point accuracy gap with CQO order closes.

read the original abstract

Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates prompt-order sensitivity in multiple-choice question answering, reporting that context-question-option (CQO) ordering outperforms question-option-context (QOC) ordering by more than 14 percentage points across a range of models and datasets. Through architectural analysis, the authors identify the causal attention mask as the primary cause: in QOC prompts the mask prevents option tokens from attending to preceding context tokens, creating an information bottleneck that renders context invisible to the options.

Significance. If the causal-mask attribution holds after proper isolation from positional and ordering confounders, the result would be significant for understanding fundamental limitations of causal transformers. It would supply a mechanistic account of a large, reproducible prompt-order effect and suggest concrete architectural or inference-time interventions, with direct relevance to prompt engineering and model design.

major comments (2)

[Abstract] Abstract and the architectural-analysis section: the claim that the CQO-QOC gap is caused by the causal mask creating an information bottleneck is not isolated from positional-embedding and token-order confounders. An ablation that holds the exact token sequence fixed while selectively relaxing only the mask entries between context and option tokens is required; without it the observed gap remains consistent with multiple mechanisms.
[Results] Results section (performance tables): the reported >14%p gap is presented as consistent, yet no quantitative ablation results or controls for tokenization differences, training-data order statistics, or other architectural elements are described that would rule out alternative explanations for the performance difference.

minor comments (2)

[Abstract] The abstract states the gap holds 'over a wide range of models and datasets' but does not list the specific models or datasets; adding this information would improve reproducibility.
[Introduction] Notation for prompt orders (CQO vs. QOC) should be defined explicitly on first use rather than assumed from the title.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and detailed comments. We agree that stronger isolation of the causal-mask effect and additional controls would improve the manuscript. We address each major comment below, indicating the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and the architectural-analysis section: the claim that the CQO-QOC gap is caused by the causal mask creating an information bottleneck is not isolated from positional-embedding and token-order confounders. An ablation that holds the exact token sequence fixed while selectively relaxing only the mask entries between context and option tokens is required; without it the observed gap remains consistent with multiple mechanisms.

Authors: We agree that the current experiments do not fully isolate the causal mask from positional and ordering effects. In the revised manuscript we will add a controlled ablation that keeps the exact token sequence and positional embeddings fixed while selectively allowing bidirectional attention only between context and option tokens. Preliminary runs of this ablation recover most of the performance gap, providing direct evidence that the mask is the dominant factor. These results will be reported in a new subsection of the architectural analysis. revision: yes
Referee: [Results] Results section (performance tables): the reported >14%p gap is presented as consistent, yet no quantitative ablation results or controls for tokenization differences, training-data order statistics, or other architectural elements are described that would rule out alternative explanations for the performance difference.

Authors: We will expand the results section with quantitative ablations that fix tokenization and prompt formatting across orderings. We have already tested multiple model families and datasets; we will add further breakdowns by model scale and tokenizer type. Controls for training-data order statistics, however, cannot be performed without access to the original pretraining corpora, which are unavailable for the closed models we evaluate. We will explicitly note this limitation and its implications for alternative explanations. revision: partial

standing simulated objections not resolved

Quantitative controls for training-data order statistics, which would require access to proprietary pretraining data not available to the authors.

Circularity Check

0 steps flagged

No circularity: empirical gap attributed to standard causal masking properties

full rationale

The paper's derivation rests on direct comparison of CQO vs QOC performance across models and datasets, followed by architectural inspection of the causal mask's effect on attention between context and option tokens. This follows from the fixed definition of causal attention (lower-triangular mask) without any fitted parameters renamed as predictions, without self-citations as load-bearing premises, and without ansatzes or uniqueness theorems imported from prior author work. The information-bottleneck description is a straightforward consequence of the mask definition applied to the QOC token ordering, not a reduction of the observed gap to itself by construction. The analysis is therefore self-contained against external benchmarks of transformer attention mechanics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the standard definition of causal attention in decoder-only transformers and on the assumption that observed accuracy differences isolate the attention mask effect.

axioms (1)

standard math Causal attention mask prevents any token from attending to future tokens in the sequence
Core architectural property of decoder-only transformers invoked to explain the information bottleneck.

pith-pipeline@v0.9.0 · 5395 in / 1174 out tokens · 28471 ms · 2026-05-16T12:28:18.223032+00:00 · methodology

Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)