pith. sign in

arxiv: 2505.15323 · v2 · submitted 2025-05-21 · 💻 cs.CL

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Pith reviewed 2026-05-22 14:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsmultiple-choice question answeringfirst-token probabilityoutput prefillingmodel steeringevaluation reliability
0
0 comments X

The pith

Prepending a natural-language prefix like 'The correct option is:' makes first-token probability more reliable for LLM multiple-choice answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models evaluated on multiple-choice questions via first-token probability often fail because they assign high likelihood to unrelated tokens or treat a valid token as part of a generic preamble. The paper shows that prepending a short structured prefix to the output steers the model toward emitting a clean answer choice as the first token. This change improves accuracy, calibration of the probabilities, and consistency of results across many models and benchmarks. The method requires no parameter updates and runs faster than full answer generation followed by external classification.

Core claim

By prepending a structured natural-language prefix such as 'The correct option is:' to the model's output, the first-token probability method can be steered to select a valid answer option rather than suffering from misalignment or misinterpretation, yielding higher accuracy, better calibration, and more consistent outputs while remaining computationally lighter than open-ended generation approaches.

What carries the argument

The prefilling attack: a fixed natural-language prefix prepended to the model output that forces the first generated token to function as an explicit answer choice.

If this is right

  • Accuracy rises substantially on multiple-choice benchmarks for a wide range of LLMs.
  • Probability estimates become better calibrated to actual correctness.
  • Answer selections grow more consistent across repeated runs.
  • Performance reaches levels close to full open-ended generation plus classification while using far less computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prefix idea could be tested on other constrained-output tasks where the initial token must carry specific meaning.
  • Replacing the hand-written prefix with one discovered through search might yield further gains on particular model families.
  • The approach may reduce reliance on post-hoc answer extraction in automated evaluation pipelines.

Load-bearing premise

A fixed natural-language prefix will reliably steer the model to treat its first token as an explicit answer choice instead of generic preamble text without creating new misalignments.

What would settle it

Measuring first-token accuracy on a held-out MCQA benchmark after adding the prefix and finding no gain or a drop relative to the no-prefix baseline on models or domains outside those tested.

Figures

Figures reproduced from arXiv: 2505.15323 by Diego Garcia-Olano, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, Samuele Poppi, Silvia Cappelletti, Tobia Poppi, Zheng-Xin Yong.

Figure 1
Figure 1. Figure 1: We show that a simple output prefilling template, which directs an LLM’s [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual examples of our prefilling strategy. The prefilling template is added [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of model accuracy on Open￾BookQA, Social IQa, and SciQ using standard FTP, FTP with prefilling, and open-ended genera￾tion with different classifiers. FTP with prefilling consistently outperforms standard FTP and often surpasses more expensive open-ended approaches. To further contextualize our results, we compare our approach against an open￾ended generation setting, which we consider a gold￾st… view at source ↗
Figure 4
Figure 4. Figure 4: Calibration curves comparing standard FTP and FTP with prefilling. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Accuracy comparison on all the benchmarks using standard FTP, FTP [PITH_FULL_IMAGE:figures/full_fig_p022_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy comparison on all the benchmarks using standard FTP, FTP with [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes repurposing the 'prefilling attack'—prepending a fixed natural-language prefix such as 'The correct option is:' to the model output—as a way to improve first-token probability (FTP) evaluation on multiple-choice question answering (MCQA) tasks. It claims this steers LLMs to emit a valid answer choice as the very first token, thereby reducing misalignment and misinterpretation, and reports substantial gains in accuracy, calibration, and output consistency across LLMs and benchmarks while remaining more efficient than full open-ended generation.

Significance. If the empirical results hold under rigorous controls, the work offers a simple, parameter-free, and computationally cheap intervention that could meaningfully improve the reliability of symbolic FTP-based MCQA evaluation without retraining or external classifiers. The reuse of a technique from AI safety for evaluation purposes is a modest but practical contribution to benchmarking methodology.

major comments (2)
  1. [§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).
  2. [§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.
minor comments (2)
  1. [§2] The notation for first-token probability (FTP) is introduced in the abstract but would benefit from an explicit equation or pseudocode definition in §2 or §3 for readers unfamiliar with the baseline.
  2. Figure captions and table headers should explicitly state the exact prefix string used in each condition to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and describe the revisions made to improve the rigor and clarity of the work.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).

    Authors: We agree that systematic robustness tests strengthen the central claim. In the revised manuscript we have expanded Section 3 with a new robustness subsection that evaluates the prefilling prefix across additional model families (including Llama-3, Mistral, and Gemma variants), models with differing post-training regimes (base, instruction-tuned, and those emphasizing explanatory or chain-of-thought outputs), and alternative option-labeling schemes (letters, numbers, and free-form). These experiments confirm that the prefix reliably produces a valid first token in the large majority of cases while also documenting the remaining edge cases where explanatory continuations still occur. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.

    Authors: We acknowledge that these reporting details are necessary for verification and generalization. The revised manuscript now contains an explicit table listing each benchmark together with its size and the number of examples evaluated. We have added statistical significance tests (paired t-tests), error bars derived from multiple random seeds, and a prompt-sensitivity ablation that varies the wording of the prefilling prefix. These additions allow readers to assess the reliability and scope of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical study proposing the prefilling attack as a simple natural-language prefix to improve first-token probability evaluation on MCQA tasks. It reports accuracy, calibration, and consistency gains via direct experiments across models and benchmarks, without any mathematical derivation chain, equations, fitted parameters, or predictions that reduce to inputs by construction. The method is defined explicitly as a repurposed technique from AI safety literature (no author-overlapping self-citation load-bearing the central claim), and results are presented as comparative measurements rather than self-referential outputs. This is a standard empirical contribution with independent content against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that a fixed prefix steers token probabilities without side effects. No free parameters, invented entities, or non-standard axioms are stated in the abstract.

axioms (1)
  • domain assumption LLMs respond coherently to natural-language prefixes in their output stream
    Invoked when the prefilling attack is presented as a reliable steering method.

pith-pipeline@v0.9.0 · 5779 in / 1219 out tokens · 23885 ms · 2026-05-22T14:16:39.442991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. arXiv preprint arXiv:1905.13319 (2019)

  3. [3]

    In: ICLR (2025)

    Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. In: ICLR (2025)

  4. [4]

    In: ACL (2025)

    Balepur, N., Rudinger, R., Boyd-Graber, J.L.: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. In: ACL (2025)

  5. [5]

    In: NeurIPS (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language Models are Few-Shot Learners. In: NeurIPS (2020)

  6. [6]

    ACM TIST (2024)

    Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A Survey on Evaluation of Large Language Models. ACM TIST (2024)

  7. [7]

    JMLR (2023) 14 S

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling Language Modeling with Pathways. JMLR (2023) 14 S. Cappelletti et al

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457 (2018)

  9. [9]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)

  10. [10]

    Hwang, Maxwell Forbes, and Yejin Choi

    Emelin, D., Bras, R.L., Hwang, J.D., Forbes, M., Choi, Y.: Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. arXiv preprint arXiv:2012.15738 (2020)

  11. [11]

    Monthly Weather Review (1950)

    Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly Weather Review (1950)

  12. [12]

    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning (2009)

  13. [13]

    In: ICLR (2021)

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding. In: ICLR (2021)

  14. [14]

    In: EMNLP (2021)

    Holtzman, A., West, P., Zettlemoyer, L., Choi, Y.: Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. In: EMNLP (2021)

  15. [15]

    In: ACL Findings (2023)

    Huang, J., Chang, K.C.C.: Towards Reasoning in Large Language Models: A Survey. In: ACL Findings (2023)

  16. [16]

    In: ACL (2023)

    Kamalloo, E., Dziri, N., Clarke, C.L., Rafiei, D.: Evaluating Open-Domain Question Answering in the Era of Large Language Models. In: ACL (2023)

  17. [17]

    In: NeurIPS (2022)

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large Language Models are Zero-Shot Reasoners. In: NeurIPS (2022)

  18. [18]

    In: EMNLP (2017)

    Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: Large-scale ReAding Compre- hension Dataset From Examinations. In: EMNLP (2017)

  19. [19]

    Nature521(2015)

    LeCun, Y., Bengio, Y., Hinton, G.: Deep Learning. Nature521(2015)

  20. [20]

    Li, W., Li, L., Xiang, T., Liu, X., Deng, W., Garcia, N.: Can multiple-choice questions really be useful in detecting the abilities of LLMs? In: LREC-COLING (2024)

  21. [21]

    Holistic Evaluation of Language Models

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110 (2022)

  22. [22]

    In: IJCAI (2020)

    Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y.: LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In: IJCAI (2020)

  23. [23]

    arXiv preprint arXiv:2402.02392 (2024)

    Liu, O., Fu, D., Yogatama, D., Neiswanger, W.: DeLLMa: Decision Making Under Uncertainty with Large Language Models. arXiv preprint arXiv:2402.02392 (2024)

  24. [24]

    arXiv preprint arXiv:2504.04141 (2025)

    Lyu, Y., Ren, S., Feng, Y., Wang, Z., Chen, Z., Ren, Z., de Rijke, M.: Cognitive Debi- asing Large Language Models for Decision-Making. arXiv preprint arXiv:2504.04141 (2025)

  25. [25]

    In: EMNLP (2018)

    Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In: EMNLP (2018)

  26. [26]

    Min, S., Lewis, M., Hajishirzi, H., Zettlemoyer, L.: Rethinking the Role of Demon- strations: What Makes In-Context Learning Work? In: EMNLP (2022)

  27. [27]

    arXiv preprint arXiv:2503.14996 (2025)

    Molfese, F.M., Moroni, L., Gioffré, L., Scirè, A., Conia, S., Navigli, R.: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering. arXiv preprint arXiv:2503.14996 (2025)

  28. [28]

    In: NAACL (2016)

    Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In: NAACL (2016)

  29. [29]

    In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

    Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining Well Calibrated Probabilities Using Bayesian Binning. In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

  30. [30]

    A Comprehensive Overview of Large Language Models

    Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2023)

  31. [31]

    In: CVPR Workshops (2019)

    Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring Calibration in Deep Learning. In: CVPR Workshops (2019)

  32. [32]

    In: NeurIPS (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

  33. [33]

    Reasoning with large language models, a survey, 2024

    Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., Back, T.: Reasoning with Large Language Models, a Survey. arXiv preprint arXiv:2407.11511 (2024)

  34. [34]

    In: EMNLP (2023)

    Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and Narrowing the Compositionality Gap in Language Models. In: EMNLP (2023)

  35. [35]

    Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T.: Whose Opinions Do Language Models Reflect? In: ICML (2023)

  36. [36]

    In: EMNLP (2019)

    Sap, M., Rashkin, H., Chen, D., LeBras, R., Choi, Y.: SocialIQA: Commonsense Reasoning about Social Interactions. In: EMNLP (2019)

  37. [37]

    In: NAACL (2019)

    Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: NAACL (2019)

  38. [38]

    Tang, L.: A Trivial Jailbreak Against Llama 3.https://github.com/haizelabs/ llama3-jailbreak(2024)

  39. [39]

    In: COLM (2024)

    Wang, X., Hu, C., Ma, B., Röttger, P., Plank, B.: Look at the Text: Instruction- Tuned Language Models are More Robust Multiple Choice Selectors than You Think. In: COLM (2024)

  40. [40]

    My Answer is C

    Wang, X., Ma, B., Hu, C., Weber-Genzel, L., Röttger, P., Kreuter, F., Hovy, D., Plank, B.: “My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models. In: ACL (2024)

  41. [41]

    Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How Does LLM Safety Training Fail? In: NeurIPS (2023)

  42. [42]

    Emergent Abilities of Large Language Models

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent Abilities of Large Language Models. arXiv preprint arXiv:2206.07682 (2022)

  43. [43]

    In: NeurIPS (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., et al.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: NeurIPS (2022)

  44. [44]

    Crowdsourcing Multiple Choice Science Questions

    Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing Multiple Choice Science Questions. arXiv preprint arXiv:1707.06209 (2017)

  45. [45]

    In: NeurIPS (2023)

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., et al.: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In: NeurIPS (2023)

  46. [46]

    In: ICLR (2025)

    Yu, Q., Zheng, Z., Song, S., Li, Z., Xiong, F., et al.: xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation. In: ICLR (2025)

  47. [47]

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a Machine Really Finish Your Sentence? In: ACL (2019)

  48. [48]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Lin, X.V., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)

  49. [49]

    In: ICML (2021)

    Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate Before Use: Improving Few-Shot Performance of Language Models. In: ICML (2021)

  50. [50]

    Going on a vacation

    Zhou, B., Khashabi, D., Ning, Q., Roth, D.: "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding. In: EMNLP (2019)

  51. [51]

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023) Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling Supplementary Material Silvia Cappelletti1⋆ , Tobia Poppi1,2⋆ , Samuele P...