arxiv: 2605.01846 · v1 · submitted 2026-05-03 · 💻 cs.CL

Recognition: unknown

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

Xuemei Tang , Xufeng Duan , Zhenguang G. Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsposition biasmultiple-choice questionsactivation steeringhidden representationsquestion generationimplicit planning

0 comments

The pith

LLMs encode predictive signals of the correct answer position in the hidden states of question stems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models exhibit systematic position biases when generating multiple-choice questions instead of spreading correct answers evenly. Experiments with many models reveal that these biases follow consistent patterns within model families. Probing shows that the hidden representations of the question stem alone carry information that predicts where the correct answer will end up, pointing to planning that occurs before options are produced. Activation steering on these representations can then shift the resulting position distributions. This matters because generated questions are used for model evaluation and education, where unintended positional preferences can distort results and reduce reliability.

Core claim

Hidden representations in the question stem encode predictive signals of the correct answer position, suggesting that answer position may be implicitly planned during generation. Activation steering applied to these representations can partially control positional preferences and substantially shift answer position distributions.

What carries the argument

Hidden representations in the question stem that encode predictive signals of answer position, which can be read out by probes and altered via activation steering.

Load-bearing premise

Signals detected in the hidden states of the question stem reflect the model's planning of answer position rather than mere correlations from training data or surface features.

What would settle it

An experiment in which the predictive accuracy of stem probes falls to chance after controlling for question length, vocabulary, or other surface statistics, or in which activation steering produces no reliable change in final answer position distributions.

Figures

Figures reproduced from arXiv: 2605.01846 by Xuemei Tang, Xufeng Duan, Zhenguang G. Cai.

**Figure 1.** Figure 1: Position bias in LLMGenerated Multiple Choice Questions. The gray dashed line at 25% represents the theoretical unbiased baseline where correct answers are distributed evenly across all options. Large language models (LLMs) and visual-language models (VLMs) are increasingly used to generate multiplechoice questions (MCQs) for applications such as educational assessment [Mucciaccia et al., 2025], readi… view at source ↗

**Figure 2.** Figure 2: Distribution of correct answer positions across three MCQ generation tasks. view at source ↗

**Figure 3.** Figure 3: Comparison of correct answer position distribution view at source ↗

**Figure 5.** Figure 5: Average macro F1 of all layers and peak macro F1 scores of MLP probes for predicting cor view at source ↗

**Figure 6.** Figure 6: Average macro-F1 of all layers across different token positions in the question stem. view at source ↗

**Figure 4.** Figure 4: Comparison of standard and balanced prompting view at source ↗

**Figure 7.** Figure 7: Layer-wise probing performance for predicting correct answer positions. Colors indicate macro F1 scores, and black curves represent row-normalized relative trends across layers. Position-dependent encoding signals. As shown in view at source ↗

**Figure 8.** Figure 8: Layer-wise steering effectiveness (best A view at source ↗

**Figure 9.** Figure 9: Comparison of meandifference and random directions at layer 13 as a function of α. The random baseline is computed as the average over three randomly sampled directions. Mean-Difference vs. Random Directions. We compare the mean-difference direction with random vectors under varying α. Random vectors are sampled from a standard normal distribution and rescaled to match the norm of the corresponding mean-… view at source ↗

**Figure 10.** Figure 10: Target vs. off-target trade-off at the penultimate token. Each point corresponds to a layer with its optimal intervention strength. “Mean Off-target Rate” is defined as the average rate of A→C and A→D. Case Study To better understand the mechanisms behind these quantitative trends, we present representative examples in view at source ↗

**Figure 11.** Figure 11: Distribution of correct answer positions in K-MCQ generated by 10 LLMs across five view at source ↗

**Figure 12.** Figure 12: Comparison of position bias with and without option identifiers on Task 2 and 3. While view at source ↗

**Figure 13.** Figure 13: Effect of balanced prompting on position bias across Task 2 and 3. view at source ↗

**Figure 14.** Figure 14: Average and peak macro F1 scores of MLP probes for predicting correct answer positions. view at source ↗

**Figure 15.** Figure 15: A→B change rate across 27 layers under mean-difference intervention over α. show weaker or less consistent responses. In contrast, interventions at the final token remain relatively limited across all layers. These results suggest that intermediate representations are more sensitive to steering and that earlier intervention (at the penultimate token) allows the modification to more effectively influence t… view at source ↗

**Figure 16.** Figure 16: A→B change rate across 27 layers under classifier-based intervention over α. 20 view at source ↗

**Figure 17.** Figure 17: Macro-F1 scores under different MLP hidden sizes on Task 1, using balanced prompting and lasttoken representations. We further examine the impact of probe capacity by varying the hidden size of the MLP and measuring performance. As shown in view at source ↗

**Figure 18.** Figure 18: Target vs. offtarget trade-off at the final token. Each point corresponds to a layer with its optimal intervention strength view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs), where correct answers should ideally be uniformly distributed across options. However, we observe that LLMs exhibit systematic position biases during generation. Through extensive experiments with 10 LLMs and 5 vision-language models (VLMs) on three MCQ generation tasks, we show that these biases are structured, with similar patterns emerging within model families. To investigate the underlying mechanisms, we conduct probing experiments and find that hidden representations in the question stem encode predictive signals of the correct answer position, suggesting that answer position may be implicitly planned during generation. Building on this insight, we apply activation steering to manipulate internal representations and influence answer position. Our results show that steering can partially control positional preferences and substantially shift answer position distributions. Our findings provide a practical framework for studying implicit positional planning in LLMs and highlight the importance of controllable generation for reliable MCQ construction and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hidden states from the question stem predict answer position in MCQ generation, and steering can shift the bias, though the planning interpretation rests on correlation.

read the letter

The main point is that question-stem representations already encode signals about where the correct answer will land, and you can intervene on those states to move the position distribution in generated MCQs. This shows up across 10 LLMs and 5 VLMs with family-level patterns holding steady. The probing and steering experiments are the concrete new pieces here. Prior bias work mostly looked at final outputs; this one checks internal states before options appear and demonstrates a control method that actually changes the output spread. That combination is useful for anyone trying to make MCQ generation less arbitrary. The steering results in particular give a practical handle on the problem rather than just documenting it. The paper does a reasonable job laying out the setup and showing the effect is not isolated to one model type. The claim that this amounts to implicit planning is the soft spot. The probes show prediction is possible, and steering shows the states are causally relevant to the output, but both findings are consistent with the model having picked up statistical associations from training data between stem content and common position habits. Nothing in the abstract rules out surface-feature correlations or simple pattern completion as the source. Details on sample sizes, exact statistical tests, and controls for alternative explanations are missing from the summary, which makes it hard to judge how robust the signals are. This is the kind of work that matters for people generating assessments or trying to reduce unwanted biases in LLM outputs. Readers who care about interpretability of generation or practical steering techniques will find the methods worth examining. It is solid enough on the empirical side to deserve a serious referee who can check the controls and whether the planning language fits the data or overreaches. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit systematic position biases when generating MCQs across 10 LLMs and 5 VLMs on three tasks, that hidden representations in the question stem encode predictive signals of answer position (suggesting implicit planning during generation), and that activation steering on these representations can partially control and shift positional preferences. The work combines observational bias analysis, linear probing, and causal-style interventions via steering.

Significance. If the results hold, the work is significant for providing an empirical framework to study and mitigate positional biases in LLM-generated MCQs, which has direct implications for reliable evaluation and data construction. Strengths include the scale of model coverage (10 LLMs + 5 VLMs), the use of activation steering as an intervention (a positive for moving beyond pure correlation), and the observation of family-level patterns in biases. These elements make the findings potentially actionable for controllable generation.

major comments (3)

[Probing experiments] Probing experiments: the central claim that stem hidden states encode signals of 'implicit planning' is load-bearing but rests on correlational linear probes. This is consistent with learned training-data associations between stem patterns and position rather than a dedicated planning mechanism; the manuscript should add controls (e.g., ablation of surface features or content-matched baselines) or stronger causal tests to distinguish these alternatives.
[Activation steering] Activation steering section: while steering shifts answer-position distributions, the intervention does not isolate a planning-specific mechanism from co-varying content or surface features. Additional ablations (e.g., steering on unrelated dimensions or random vectors) are needed to support the interpretation that the effect targets implicit positional planning.
[Methods] Methods and experimental details: the abstract and results describe extensive experiments but omit sample sizes per task/model, the statistical tests used to establish 'systematic' biases, and explicit controls for confounding factors. These omissions undermine evaluation of the robustness of the structured-bias and predictive-signal claims.

minor comments (2)

[Abstract] The abstract and introduction could more precisely separate the observational bias findings from the stronger 'implicit planning' interpretation to avoid overstatement.
[Figures/Tables] Figure and table captions should include exact definitions of the position-bias metric and probe accuracy computation for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing our responses and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: Probing experiments: the central claim that stem hidden states encode signals of 'implicit planning' is load-bearing but rests on correlational linear probes. This is consistent with learned training-data associations between stem patterns and position rather than a dedicated planning mechanism; the manuscript should add controls (e.g., ablation of surface features or content-matched baselines) or stronger causal tests to distinguish these alternatives.

Authors: We agree that linear probes yield correlational evidence and cannot alone establish a dedicated planning mechanism over learned associations. The probes operate on stem representations before option generation, and predictive signals are consistent across model families and tasks, which reduces the likelihood of purely superficial confounds. Nevertheless, we will add the suggested controls in revision: content-matched baselines (pairing identical stems with varied positions) and surface-feature ablations (e.g., lexical shuffling or masking). We will also report probe performance on these controls to better isolate positional signals. revision: yes
Referee: Activation steering section: while steering shifts answer-position distributions, the intervention does not isolate a planning-specific mechanism from co-varying content or surface features. Additional ablations (e.g., steering on unrelated dimensions or random vectors) are needed to support the interpretation that the effect targets implicit positional planning.

Authors: We acknowledge that the current steering results demonstrate a causal influence on position distributions but do not fully rule out effects from co-varying features. The steered directions were derived from position-predictive probe weights, providing some specificity. To address this, we will include additional ablations: steering with random vectors of matched norm and with directions from unrelated dimensions (e.g., unrelated task probes). These will be reported to quantify the degree of specificity to positional planning. revision: yes
Referee: Methods and experimental details: the abstract and results describe extensive experiments but omit sample sizes per task/model, the statistical tests used to establish 'systematic' biases, and explicit controls for confounding factors. These omissions undermine evaluation of the robustness of the structured-bias and predictive-signal claims.

Authors: We appreciate this observation. The revised manuscript will explicitly report sample sizes for every task-model combination, detail the statistical tests (chi-squared tests for uniformity of position distributions and permutation tests for probe significance), and describe controls already used (prompt randomization, content balancing across positions, and multiple prompt templates). These additions will improve transparency and allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and steering rest on independent measurements

full rationale

The paper reports observational position bias statistics across models, trains linear probes on stem hidden states to detect position-predictive information (standard supervised probe evaluation on held-out data), and performs activation steering interventions that causally alter output distributions. None of these steps reduce a claimed result to a fitted parameter or self-referential definition by construction. The interpretive phrase 'implicitly planned' is presented as a hypothesis suggested by the probe and steering results, not as a derived equality. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; work relies on standard assumptions of representation probing and activation engineering in transformer models.

axioms (1)

domain assumption Hidden states in transformer layers encode semantically meaningful information about future generation choices
Invoked when interpreting probing results as evidence of planning

pith-pipeline@v0.9.0 · 5467 in / 1216 out tokens · 17684 ms · 2026-05-09T17:20:39.913963+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 21 canonical work pages · 2 internal anchors

[1]

Syntactic and semantic control of large language models via sequential

Loula, João and LeBrun, Benjamin and Du, Li and Lipkin, Ben and Pasti, Clemente and Grand, Gabriel and Liu, Tianyu and Emara, Yahya and Freedman, Marjorie and Eisner, Jason and Cotterell, Ryan and Mansinghka, Vikash and Lew, Alexander K. and Vieira, Tim and O’Donnell, Timothy J. , year=. Syntactic and Semantic Control of Large Language Models via Sequenti...

work page doi:10.48550/arxiv.2504.13139
[2]

LayerNavigator: Finding Promising Intervention Layers for Efficient Activation Steering in Large Language Models , url=

Sun Hao and Peng Huailiang and Dai Qiong and Bai Xu and Cao Yanan , year=. LayerNavigator: Finding Promising Intervention Layers for Efficient Activation Steering in Large Language Models , url=
[3]

arXiv preprint arXiv:2406.03009 , year=

Wei, Sheng-Lun and Wu, Cheng-Kuang and Huang, Hen-Hsen and Chen, Hsin-Hsi , year=. Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models , url=. doi:10.48550/arXiv.2406.03009 , abstractNote=

work page doi:10.48550/arxiv.2406.03009
[4]

Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach , url=

Guda, Blessed and Francis, Lawrence and Ashungafac, Gabrial Zencha and Joe-Wong, Carlee and Busogi, Moise , year=. Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach , url=. doi:10.48550/arXiv.2511.21709 , abstractNote=

work page doi:10.48550/arxiv.2511.21709
[5]

Choi, Hyeong Kyu and Xu, Weijie and Xue, Chi and Eckman, Stephanie and Reddy, Chandan K. , year=. Mitigating Selection Bias with Node Pruning and Auxiliary Options , url=. doi:10.48550/arXiv.2409.18857 , abstractNote=

work page doi:10.48550/arxiv.2409.18857
[6]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li, Kenneth and Patel, Oam and Viégas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin , year=. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url=. doi:10.48550/arXiv.2306.03341 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.2306.03341
[7]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , editor=. Steering Llama 2 via Contrastive Activation Addition , url=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , publisher=. 2024 , month=aug, pages=. doi:10.18653/v1/2024.acl-lo...

work page doi:10.18653/v1/2024.acl-long.828 2024
[8]

Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs , url=

Han, Pengrui and Xu, Xueqiang and Xuan, Keyang and Song, Peiyang and Ouyang, Siru and Tian, Runchu and Jiang, Yuqing and Qian, Cheng and Jiang, Pengcheng and Sun, Jiashuo and Cui, Junxia and Zhong, Ming and Liu, Ge and Han, Jiawei and You, Jiaxuan , year=. Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs , url=. doi...

work page doi:10.48550/arxiv.2602.07276
[9]

Improving Instruction-Following in Language Models through Activation Steering, April 2025

Stolfo, Alessandro and Balachandran, Vidhisha and Yousefi, Safoora and Horvitz, Eric and Nushi, Besmira , year=. Improving Instruction-Following in Language Models through Activation Steering , url=. doi:10.48550/arXiv.2410.12877 , abstractNote=

work page doi:10.48550/arxiv.2410.12877
[10]

Extracting Latent Steering Vectors from Pretrained Language Models

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew , editor=. Extracting Latent Steering Vectors from Pretrained Language Models , url=. Findings of the Association for Computational Linguistics: ACL 2022 , publisher=. 2022 , month=may, pages=. doi:10.18653/v1/2022.findings-acl.48 , abstractNote=

work page doi:10.18653/v1/2022.findings-acl.48 2022
[11]

arXiv.org , author=

Steering Language Models With Activation Engineering , url=. arXiv.org , author=. 2023 , month=aug, language=

2023
[12]

Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models , url=

Men, Tianyi and Cao, Pengfei and Jin, Zhuoran and Chen, Yubo and Liu, Kang and Zhao, Jun , editor=. Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models , url=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , publisher=. 2024 , month=nov, pages=. doi:10.18653/v1/...

work page doi:10.18653/v1/2024.emnlp-main.440 2024
[13]

Extracting Paragraphs from LLM Token Activations , url=

Pochinkov, Nicholas and Benoit, Angelo and Agarwal, Lovkush and Majid, Zainab Ali and Ter-Minassian, Lucile , year=. Extracting Paragraphs from LLM Token Activations , url=. doi:10.48550/arXiv.2409.06328 , abstractNote=

work page doi:10.48550/arxiv.2409.06328
[14]

Can multiple- choice questions really be useful in detecting the abilities of llms?

Li, Wangyue and Li, Liangzhi and Xiang, Tong and Liu, Xiao and Deng, Wei and Garcia, Noa , year=. Can multiple-choice questions really be useful in detecting the abilities of LLMs? , url=. doi:10.48550/arXiv.2403.17752 , abstractNote=

work page doi:10.48550/arxiv.2403.17752
[15]

Questioning the Survey Responses of Large Language Models , url=

Dominguez-Olmedo, Ricardo and Hardt, Moritz and Mendler-Dünner, Celestine , year=. Questioning the Survey Responses of Large Language Models , url=. doi:10.48550/arXiv.2306.07951 , abstractNote=

work page doi:10.48550/arxiv.2306.07951
[16]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , year=. Judging LLM-as-a-judge with MT-bench and Chatbot Arena , abstractNote=. Proceedings of the 37th International Conference on ...
[17]

and Levine, Lionel , year=

Wu, Wilson and Morris, John X. and Levine, Lionel , year=. Do language models plan ahead for future tokens? , url=. doi:10.48550/arXiv.2404.00859 , abstractNote=

work page doi:10.48550/arxiv.2404.00859
[18]

Emergent response planning in llm

Dong, Zhichen and Zhou, Zhanhui and Liu, Zhixuan and Yang, Chao and Lu, Chaochao , year=. Emergent Response Planning in LLMs , url=. doi:10.48550/arXiv.2502.06258 , abstractNote=

work page doi:10.48550/arxiv.2502.06258
[19]

https://doi.org/10.18653/v1/2024.alvr-1.15,https://aclanthology.org/ 2024.alvr-1.15/

Urailertprasert, Norawit and Limkonchotiwat, Peerat and Suwajanakorn, Supasorn and Nutanong, Sarana , editor=. SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering , url=. Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR) , publisher=. 2024 , month=aug, pages=. doi:10.18653/v1/2024.alvr-1.15 , ab...

work page doi:10.18653/v1/2024.alvr-1.15 2024
[20]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

Pezeshkpour, Pouya and Hruschka, Estevam , editor=. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , url=. Findings of the Association for Computational Linguistics: NAACL 2024 , publisher=. 2024 , month=june, pages=. doi:10.18653/v1/2024.findings-naacl.130 , abstractNote=

work page doi:10.18653/v1/2024.findings-naacl.130 2024
[21]

Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions , url=

Mucciaccia, Sérgio Silva and Meireles Paixão, Thiago and Wall Mutz, Filipe and Santos Badue, Claudine and Ferreira de Souza, Alberto and Oliveira-Santos, Thiago , editor=. Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions , url=. Proceedings of the 31st International Conference on C...

2025
[22]

Quasi: a synthetic Question-Answering dataset in Swedish using GPT-3 and zero-shot learning , url=

Kalpakchi, Dmytro and Boye, Johan , editor=. Quasi: a synthetic Question-Answering dataset in Swedish using GPT-3 and zero-shot learning , url=. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , publisher=. 2023 , month=may, pages=

2023
[23]

Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models , ISBN=

Loginova, Olga and Bezrukov, Oleksandr and Shekhar, Ravi and Kravets, Alexey , editor=. Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models , ISBN=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , publisher=. 2025 , month=j...

work page doi:10.18653/v1/2025.acl-long.162 2025
[24]

What’s the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering , url=

Maar, Jim and Paperno, Denis and McDougall, Callum Stuart and Nanda, Neel , year=. What’s the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering , url=. doi:10.48550/arXiv.2601.20164 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.2601.20164
[25]

Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

Zheng, Chujie and Zhou, Hao and Meng, Fandong and Zhou, Jie and Huang, Minlie , year=. Large Language Models Are Not Robust Multiple Choice Selectors , url=. doi:10.48550/arXiv.2309.03882 , abstractNote=

work page doi:10.48550/arxiv.2309.03882
[26]

Beyond Performance: Quantifying and Mitigating Label Bias in LLMs , url=

Reif, Yuval and Schwartz, Roy , editor=. Beyond Performance: Quantifying and Mitigating Label Bias in LLMs , url=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , publisher=. 2024 , month=june, pages=. doi:10.18653/v1/2024.naacl-long.37...

work page doi:10.18653/v1/2024.naacl-long.378 2024