Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling
Pith reviewed 2026-05-22 14:16 UTC · model grok-4.3
The pith
Prepending a natural-language prefix like 'The correct option is:' makes first-token probability more reliable for LLM multiple-choice answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By prepending a structured natural-language prefix such as 'The correct option is:' to the model's output, the first-token probability method can be steered to select a valid answer option rather than suffering from misalignment or misinterpretation, yielding higher accuracy, better calibration, and more consistent outputs while remaining computationally lighter than open-ended generation approaches.
What carries the argument
The prefilling attack: a fixed natural-language prefix prepended to the model output that forces the first generated token to function as an explicit answer choice.
If this is right
- Accuracy rises substantially on multiple-choice benchmarks for a wide range of LLMs.
- Probability estimates become better calibrated to actual correctness.
- Answer selections grow more consistent across repeated runs.
- Performance reaches levels close to full open-ended generation plus classification while using far less computation.
Where Pith is reading between the lines
- The same prefix idea could be tested on other constrained-output tasks where the initial token must carry specific meaning.
- Replacing the hand-written prefix with one discovered through search might yield further gains on particular model families.
- The approach may reduce reliance on post-hoc answer extraction in automated evaluation pipelines.
Load-bearing premise
A fixed natural-language prefix will reliably steer the model to treat its first token as an explicit answer choice instead of generic preamble text without creating new misalignments.
What would settle it
Measuring first-token accuracy on a held-out MCQA benchmark after adding the prefix and finding no gain or a drop relative to the no-prefix baseline on models or domains outside those tested.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes repurposing the 'prefilling attack'—prepending a fixed natural-language prefix such as 'The correct option is:' to the model output—as a way to improve first-token probability (FTP) evaluation on multiple-choice question answering (MCQA) tasks. It claims this steers LLMs to emit a valid answer choice as the very first token, thereby reducing misalignment and misinterpretation, and reports substantial gains in accuracy, calibration, and output consistency across LLMs and benchmarks while remaining more efficient than full open-ended generation.
Significance. If the empirical results hold under rigorous controls, the work offers a simple, parameter-free, and computationally cheap intervention that could meaningfully improve the reliability of symbolic FTP-based MCQA evaluation without retraining or external classifiers. The reuse of a technique from AI safety for evaluation purposes is a modest but practical contribution to benchmarking methodology.
major comments (2)
- [§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).
- [§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.
minor comments (2)
- [§2] The notation for first-token probability (FTP) is introduced in the abstract but would benefit from an explicit equation or pseudocode definition in §2 or §3 for readers unfamiliar with the baseline.
- Figure captions and table headers should explicitly state the exact prefix string used in each condition to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and describe the revisions made to improve the rigor and clarity of the work.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).
Authors: We agree that systematic robustness tests strengthen the central claim. In the revised manuscript we have expanded Section 3 with a new robustness subsection that evaluates the prefilling prefix across additional model families (including Llama-3, Mistral, and Gemma variants), models with differing post-training regimes (base, instruction-tuned, and those emphasizing explanatory or chain-of-thought outputs), and alternative option-labeling schemes (letters, numbers, and free-form). These experiments confirm that the prefix reliably produces a valid first token in the large majority of cases while also documenting the remaining edge cases where explanatory continuations still occur. revision: yes
-
Referee: [§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.
Authors: We acknowledge that these reporting details are necessary for verification and generalization. The revised manuscript now contains an explicit table listing each benchmark together with its size and the number of examples evaluated. We have added statistical significance tests (paired t-tests), error bars derived from multiple random seeds, and a prompt-sensitivity ablation that varies the wording of the prefilling prefix. These additions allow readers to assess the reliability and scope of the reported gains. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical study proposing the prefilling attack as a simple natural-language prefix to improve first-token probability evaluation on MCQA tasks. It reports accuracy, calibration, and consistency gains via direct experiments across models and benchmarks, without any mathematical derivation chain, equations, fitted parameters, or predictions that reduce to inputs by construction. The method is defined explicitly as a repurposed technique from AI safety literature (no author-overlapping self-citation load-bearing the central claim), and results are presented as comparative measurements rather than self-referential outputs. This is a standard empirical contribution with independent content against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs respond coherently to natural-language prefixes in their output stream
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. arXiv preprint arXiv:1905.13319 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[3]
Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. In: ICLR (2025)
work page 2025
-
[4]
Balepur, N., Rudinger, R., Boyd-Graber, J.L.: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. In: ACL (2025)
work page 2025
-
[5]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language Models are Few-Shot Learners. In: NeurIPS (2020)
work page 2020
-
[6]
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A Survey on Evaluation of Large Language Models. ACM TIST (2024)
work page 2024
-
[7]
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling Language Modeling with Pathways. JMLR (2023) 14 S. Cappelletti et al
work page 2023
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Hwang, Maxwell Forbes, and Yejin Choi
Emelin, D., Bras, R.L., Hwang, J.D., Forbes, M., Choi, Y.: Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. arXiv preprint arXiv:2012.15738 (2020)
-
[11]
Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly Weather Review (1950)
work page 1950
-
[12]
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning (2009)
work page 2009
-
[13]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding. In: ICLR (2021)
work page 2021
-
[14]
Holtzman, A., West, P., Zettlemoyer, L., Choi, Y.: Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. In: EMNLP (2021)
work page 2021
-
[15]
Huang, J., Chang, K.C.C.: Towards Reasoning in Large Language Models: A Survey. In: ACL Findings (2023)
work page 2023
-
[16]
Kamalloo, E., Dziri, N., Clarke, C.L., Rafiei, D.: Evaluating Open-Domain Question Answering in the Era of Large Language Models. In: ACL (2023)
work page 2023
-
[17]
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large Language Models are Zero-Shot Reasoners. In: NeurIPS (2022)
work page 2022
-
[18]
Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: Large-scale ReAding Compre- hension Dataset From Examinations. In: EMNLP (2017)
work page 2017
- [19]
-
[20]
Li, W., Li, L., Xiang, T., Liu, X., Deng, W., Garcia, N.: Can multiple-choice questions really be useful in detecting the abilities of LLMs? In: LREC-COLING (2024)
work page 2024
-
[21]
Holistic Evaluation of Language Models
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y.: LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In: IJCAI (2020)
work page 2020
-
[23]
arXiv preprint arXiv:2402.02392 (2024)
Liu, O., Fu, D., Yogatama, D., Neiswanger, W.: DeLLMa: Decision Making Under Uncertainty with Large Language Models. arXiv preprint arXiv:2402.02392 (2024)
-
[24]
arXiv preprint arXiv:2504.04141 (2025)
Lyu, Y., Ren, S., Feng, Y., Wang, Z., Chen, Z., Ren, Z., de Rijke, M.: Cognitive Debi- asing Large Language Models for Decision-Making. arXiv preprint arXiv:2504.04141 (2025)
-
[25]
Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In: EMNLP (2018)
work page 2018
-
[26]
Min, S., Lewis, M., Hajishirzi, H., Zettlemoyer, L.: Rethinking the Role of Demon- strations: What Makes In-Context Learning Work? In: EMNLP (2022)
work page 2022
-
[27]
arXiv preprint arXiv:2503.14996 (2025)
Molfese, F.M., Moroni, L., Gioffré, L., Scirè, A., Conia, S., Navigli, R.: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering. arXiv preprint arXiv:2503.14996 (2025)
-
[28]
Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In: NAACL (2016)
work page 2016
-
[29]
In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15
Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining Well Calibrated Probabilities Using Bayesian Binning. In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15
work page 2015
-
[30]
A Comprehensive Overview of Large Language Models
Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring Calibration in Deep Learning. In: CVPR Workshops (2019)
work page 2019
-
[32]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
work page 2022
-
[33]
Reasoning with large language models, a survey, 2024
Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., Back, T.: Reasoning with Large Language Models, a Survey. arXiv preprint arXiv:2407.11511 (2024)
-
[34]
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and Narrowing the Compositionality Gap in Language Models. In: EMNLP (2023)
work page 2023
-
[35]
Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T.: Whose Opinions Do Language Models Reflect? In: ICML (2023)
work page 2023
-
[36]
Sap, M., Rashkin, H., Chen, D., LeBras, R., Choi, Y.: SocialIQA: Commonsense Reasoning about Social Interactions. In: EMNLP (2019)
work page 2019
-
[37]
Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: NAACL (2019)
work page 2019
-
[38]
Tang, L.: A Trivial Jailbreak Against Llama 3.https://github.com/haizelabs/ llama3-jailbreak(2024)
work page 2024
-
[39]
Wang, X., Hu, C., Ma, B., Röttger, P., Plank, B.: Look at the Text: Instruction- Tuned Language Models are More Robust Multiple Choice Selectors than You Think. In: COLM (2024)
work page 2024
-
[40]
Wang, X., Ma, B., Hu, C., Weber-Genzel, L., Röttger, P., Kreuter, F., Hovy, D., Plank, B.: “My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models. In: ACL (2024)
work page 2024
-
[41]
Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How Does LLM Safety Training Fail? In: NeurIPS (2023)
work page 2023
-
[42]
Emergent Abilities of Large Language Models
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent Abilities of Large Language Models. arXiv preprint arXiv:2206.07682 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., et al.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: NeurIPS (2022)
work page 2022
-
[44]
Crowdsourcing Multiple Choice Science Questions
Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing Multiple Choice Science Questions. arXiv preprint arXiv:1707.06209 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., et al.: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In: NeurIPS (2023)
work page 2023
-
[46]
Yu, Q., Zheng, Z., Song, S., Li, Z., Xiong, F., et al.: xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation. In: ICLR (2025)
work page 2025
-
[47]
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a Machine Really Finish Your Sentence? In: ACL (2019)
work page 2019
-
[48]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Lin, X.V., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate Before Use: Improving Few-Shot Performance of Language Models. In: ICML (2021)
work page 2021
-
[50]
Zhou, B., Khashabi, D., Ning, Q., Roth, D.: "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding. In: EMNLP (2019)
work page 2019
-
[51]
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023) Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling Supplementary Material Silvia Cappelletti1⋆ , Tobia Poppi1,2⋆ , Samuele P...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.