pith. machine review for the scientific record. sign in

arxiv: 2604.27249 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI

Recognition: unknown

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords adversarial instructionspositional biasLLM evaluationsandbaggingMMLU-Proinstruction complexitycontent engagement
0
0 comments X

The pith

Complex adversarial instructions cause LLMs to collapse onto positional shortcuts instead of engaging with question content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models under adversarial instructions actually process question content or fall back on positional patterns. It applies a six-condition gradient of instructions to Llama-3-8B and Llama-3.1-8B models on 2,000 MMLU-Pro items and measures responses with two independent screens: positional entropy and difficulty-accuracy correlation. Vague instructions reduce accuracy while models still track difficulty. Standard sandbagging instructions produce partial positional collapse with some content engagement. The most complex two-step instruction triggers near-total collapse onto one position with zero content sensitivity. The attractor position matches each model's null-prompt default and appears across both models and multiple domains.

Core claim

A gradient of adversarial instructions on instruction-tuned LLMs reveals three regimes rather than a smooth transition. Vague instructions reduce accuracy while preserving content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. The collapsed position matches the models' content-absent null-prompt defaults, the pattern replicates across models and domains, and the two screening criteria show only 50% conc

What carries the argument

The six-condition adversarial instruction-specificity gradient, jointly screened by response-position entropy to detect distributional collapse and by difficulty-accuracy correlation to detect content engagement.

Load-bearing premise

That the difficulty-accuracy correlation measures content engagement independently of positional bias and that observed collapse is caused by instruction complexity.

What would settle it

A positive difficulty-accuracy correlation under the two-step answer-aware instruction, or substantial content sensitivity remaining after the positional collapse occurs.

Figures

Figures reproduced from arXiv: 2604.27249 by Jon-Paul Cacioli.

Figure 1
Figure 1. Figure 1: Instruction-specificity gradient across six measures for both models. The S5 collapse view at source ↗
Figure 2
Figure 2. Figure 2: Response-position distributions for selected conditions. Top: Llama-3-8B. Bottom: view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy by correct-answer position. Top: Llama-3-8B. Bottom: Llama-3.1-8B. view at source ↗
read the original abstract

When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse, with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. This was the only multi-step instruction tested, and it produced the most extreme shortcut. The attractor position matches each model's content-absent null-prompt default. The effect replicates across both models and four academic domains. Distributional collapse and content engagement can co-occur (50% concordance between screening criteria), indicating that entropy-based screening and difficulty-based content assessment capture partially independent dimensions of response validity. Results suggest that instruction complexity can determine whether adversarial compliance uses content-aware or content-blind mechanisms in small instruction-tuned LLMs under greedy decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that adversarial instructions on LLMs for underperforming on multiple-choice questions (MMLU-Pro) produce three distinct regimes depending on instruction specificity: vague instructions yield moderate accuracy drops with preserved content engagement; standard sandbagging and capability-imitation instructions cause positional entropy collapse alongside partial content engagement; and a two-step answer-aware avoidance instruction triggers extreme positional collapse (99.9% and 87.4% concentration on one position) with zero content sensitivity. Tested on Llama-3-8B and Llama-3.1-8B across 2000 items and four domains under greedy decoding, the work uses response-position entropy and difficulty-accuracy correlation as joint criteria, finds 50% concordance between them, notes the attractor matches null-prompt defaults, and concludes that instruction complexity determines whether compliance is content-aware or content-blind.

Significance. If the empirical patterns hold, the work demonstrates that complex adversarial instructions can induce content-blind positional shortcuts in small instruction-tuned LLMs, with direct implications for the validity of capability evaluations and the design of robust adversarial testing protocols. The replication across two models and four domains, plus the finding that entropy-based and difficulty-based criteria capture partially independent dimensions, provides a useful empirical map of response regimes and underscores the need for multi-criterion screening in LLM assessments.

major comments (3)
  1. [experimental design and results on the six-condition gradient] The central causal claim that 'instruction complexity induces positional collapse' rests on a six-condition gradient in which only one condition is multi-step (the two-step answer-aware avoidance prompt); this condition alone produces the reported 99.9%/87.4% collapse and zero difficulty-accuracy correlation. Because step count is not varied independently of answer-awareness, avoidance framing, or prompt length, the design cannot isolate complexity as the inducing variable. See the condition specifications and regime characterization in the main experimental section.
  2. [results reporting of percentages and correlations] No statistical tests, confidence intervals, or error bars are reported for the key percentages (99.9%, 87.4%), the difficulty-accuracy correlations, or the 50% concordance rate between screening criteria. This makes it impossible to assess whether the three-regime distinction is statistically reliable or sensitive to sampling variability in the 2000-item set. See the results paragraphs reporting these figures.
  3. [methods and results on content-engagement criterion] The difficulty-accuracy correlation is presented as an 'independent' content-engagement criterion, yet the paper provides no ablation or control showing that this metric remains valid once positional bias is present (e.g., when responses concentrate on one option regardless of item difficulty). The observed zero correlation in the extreme condition could therefore be an artifact of the collapse rather than evidence of absent content sensitivity. See the definition and application of the content-engagement criterion.
minor comments (2)
  1. [methods] Exact prompt wording for all six conditions, including the null-prompt baseline, should be provided in an appendix or table to allow replication and to verify controls for length and lexical overlap.
  2. [results] The manuscript would benefit from a table summarizing the six conditions, their key features (step count, answer-awareness, etc.), and the measured entropy and correlation values side-by-side.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments in detail below, providing clarifications and indicating where revisions have been or will be made to the manuscript.

read point-by-point responses
  1. Referee: [experimental design and results on the six-condition gradient] The central causal claim that 'instruction complexity induces positional collapse' rests on a six-condition gradient in which only one condition is multi-step (the two-step answer-aware avoidance prompt); this condition alone produces the reported 99.9%/87.4% collapse and zero difficulty-accuracy correlation. Because step count is not varied independently of answer-awareness, avoidance framing, or prompt length, the design cannot isolate complexity as the inducing variable. See the condition specifications and regime characterization in the main experimental section.

    Authors: The six-condition gradient was designed to systematically increase the specificity and adversarial sophistication of the instructions, culminating in the two-step answer-aware avoidance prompt as the most complex. We note in the manuscript that this was the only multi-step instruction tested and produced the most extreme shortcut. While the design does not orthogonally vary step count from other factors, the progressive nature of the conditions and the replication across models and domains provide evidence for the role of complexity. We will revise the discussion to more explicitly acknowledge the confounding and to propose future experiments that isolate individual factors such as step count. revision: partial

  2. Referee: [results reporting of percentages and correlations] No statistical tests, confidence intervals, or error bars are reported for the key percentages (99.9%, 87.4%), the difficulty-accuracy correlations, or the 50% concordance rate between screening criteria. This makes it impossible to assess whether the three-regime distinction is statistically reliable or sensitive to sampling variability in the 2000-item set. See the results paragraphs reporting these figures.

    Authors: We agree that including statistical measures would enhance the rigor of the results. In the revised version, we will add bootstrap-derived 95% confidence intervals for the positional concentration percentages, the difficulty-accuracy correlations, and the concordance rate. This will help demonstrate the reliability of the observed regimes given the sample size of 2000 items. revision: yes

  3. Referee: [methods and results on content-engagement criterion] The difficulty-accuracy correlation is presented as an 'independent' content-engagement criterion, yet the paper provides no ablation or control showing that this metric remains valid once positional bias is present (e.g., when responses concentrate on one option regardless of item difficulty). The observed zero correlation in the extreme condition could therefore be an artifact of the collapse rather than evidence of absent content sensitivity. See the definition and application of the content-engagement criterion.

    Authors: We will clarify in the methods that the difficulty-accuracy correlation serves as a measure of whether the model's performance under the instruction varies with the inherent difficulty of the items, which would be expected if content is engaged. In the extreme collapse case, the fixed positional choice leads to uniform accuracy independent of difficulty, resulting in zero correlation. This indicates content-blind behavior because a content-aware model, even with bias, would likely show some variation. To address the potential artifact concern, we have added a supplementary note explaining why the metric is still informative and will include a brief simulation in the revision showing that pure positional bias without content sensitivity produces zero correlation, consistent with our findings. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with external baselines

full rationale

The paper reports experimental results from administering a six-condition adversarial instruction gradient to two LLMs on 2000 MMLU-Pro items, using response-position entropy and difficulty-accuracy correlation as independent screening criteria. No equations, fitted parameters, derivations, or self-citations are load-bearing; the central observations (regimes of collapse, 99.9%/87.4% positional concentration under the two-step condition, and replication across models/domains) are direct data outputs. The null-prompt baseline is an external reference, not a self-definition. The work is self-contained against the experimental data with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on two measurement assumptions: that response-position entropy detects shortcut use and that difficulty-accuracy correlation detects content engagement. No free parameters are introduced. No new entities are postulated.

axioms (1)
  • domain assumption Response-position entropy and difficulty-accuracy correlation jointly and partially independently characterize content engagement versus positional shortcuts.
    Invoked to interpret the three regimes and the 50% concordance result.

pith-pipeline@v0.9.0 · 5552 in / 1280 out tokens · 31925 ms · 2026-05-07T09:03:13.247966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    Baxi, A. (2025). Separating constraint compliance from semantic accuracy.arXiv preprint arXiv:2512.17920

  2. [2]

    Ben-Porath, Y. S. and Tellegen, A. (2020).Minnesota Multiphasic Personality Inventory-3 (MMPI-3): Manual for administration, scoring, and interpretation. University of Minnesota Press

  3. [3]

    Cacioli, J.-P. (2026a). Below-chance blindness: Positional collapse under sandbagging instruc- tion in small LLMs.arXiv preprintarXiv:2604.25249

  4. [4]

    Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wich- mann, F. A. (2020). Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673

  5. [5]

    Li, S., Phuong, M., and Siegel, M. (2025). LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring.arXiv preprintarXiv:2508.00943

  6. [6]

    Lim, D. et al. (2025). The atomic instruction gap.arXiv preprintarXiv:2510.17388

  7. [7]

    Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., and Hobbhahn, M. (2024). Frontier models are capable of in-context scheming.arXiv preprintarXiv:2412.04984

  8. [8]

    Morey, L. C. (2007).Personality Assessment Inventory (PAI): Professional manual(2nd ed.). Psychological Assessment Resources

  9. [9]

    Nguyen, T. et al. (2025). Probing for sandbagging: Detecting strategic underperformance in language model activations.arXiv preprint

  10. [10]

    and Hruschka, E

    Pezeshkpour, P. and Hruschka, E. (2024). Large language models sensitivity to the order of options in multiple-choice questions.arXiv preprintarXiv:2308.11483

  11. [11]

    Taylor, J. et al. (2025). Black-box detection of language model sandbagging.arXiv preprint arXiv:2502.00414. van der Weij, W. et al. (2025). AI sandbagging: Language models can strategically underperform on evaluations. InICLR. arXiv:2406.07358

  12. [12]

    Wang, Y. et al. (2024). MMLU-Pro: A more robust and challenging multi-task language under- standing benchmark.arXiv preprintarXiv:2406.01574

  13. [13]

    Yang, J. et al. (2025). What prompts don’t say: Understanding and managing underspecification in LLM prompts.arXiv preprintarXiv:2505.13360

  14. [14]

    Zheng, C. et al. (2024). Large language models are not robust multiple choice selectors. InICLR. arXiv:2309.03882. 12