Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Diego Garcia-Olano; Lorenzo Baraldi; Marcella Cornia; Rita Cucchiara; Samuele Poppi; Silvia Cappelletti; Tobia Poppi; Zheng-Xin Yong

arxiv: 2505.15323 · v2 · submitted 2025-05-21 · 💻 cs.CL

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Silvia Cappelletti , Tobia Poppi , Samuele Poppi , Zheng-Xin Yong , Diego Garcia-Olano , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara This is my paper

Pith reviewed 2026-05-22 14:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsmultiple-choice question answeringfirst-token probabilityoutput prefillingmodel steeringevaluation reliability

0 comments

The pith

Prepending a natural-language prefix like 'The correct option is:' makes first-token probability more reliable for LLM multiple-choice answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models evaluated on multiple-choice questions via first-token probability often fail because they assign high likelihood to unrelated tokens or treat a valid token as part of a generic preamble. The paper shows that prepending a short structured prefix to the output steers the model toward emitting a clean answer choice as the first token. This change improves accuracy, calibration of the probabilities, and consistency of results across many models and benchmarks. The method requires no parameter updates and runs faster than full answer generation followed by external classification.

Core claim

By prepending a structured natural-language prefix such as 'The correct option is:' to the model's output, the first-token probability method can be steered to select a valid answer option rather than suffering from misalignment or misinterpretation, yielding higher accuracy, better calibration, and more consistent outputs while remaining computationally lighter than open-ended generation approaches.

What carries the argument

The prefilling attack: a fixed natural-language prefix prepended to the model output that forces the first generated token to function as an explicit answer choice.

If this is right

Accuracy rises substantially on multiple-choice benchmarks for a wide range of LLMs.
Probability estimates become better calibrated to actual correctness.
Answer selections grow more consistent across repeated runs.
Performance reaches levels close to full open-ended generation plus classification while using far less computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix idea could be tested on other constrained-output tasks where the initial token must carry specific meaning.
Replacing the hand-written prefix with one discovered through search might yield further gains on particular model families.
The approach may reduce reliance on post-hoc answer extraction in automated evaluation pipelines.

Load-bearing premise

A fixed natural-language prefix will reliably steer the model to treat its first token as an explicit answer choice instead of generic preamble text without creating new misalignments.

What would settle it

Measuring first-token accuracy on a held-out MCQA benchmark after adding the prefix and finding no gain or a drop relative to the no-prefix baseline on models or domains outside those tested.

Figures

Figures reproduced from arXiv: 2505.15323 by Diego Garcia-Olano, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, Samuele Poppi, Silvia Cappelletti, Tobia Poppi, Zheng-Xin Yong.

**Figure 2.** Figure 2: Visual examples of our prefilling strategy. The prefilling template is added [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of model accuracy on OpenBookQA, Social IQa, and SciQ using standard FTP, FTP with prefilling, and open-ended generation with different classifiers. FTP with prefilling consistently outperforms standard FTP and often surpasses more expensive open-ended approaches. To further contextualize our results, we compare our approach against an openended generation setting, which we consider a goldst… view at source ↗

**Figure 4.** Figure 4: Calibration curves comparing standard FTP and FTP with prefilling. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 1.** Figure 1: Accuracy comparison on all the benchmarks using standard FTP, FTP [PITH_FULL_IMAGE:figures/full_fig_p022_1.png] view at source ↗

**Figure 2.** Figure 2: Accuracy comparison on all the benchmarks using standard FTP, FTP with [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prefilling a simple prefix like 'The correct option is:' looks like a workable low-cost fix for first-token misalignment in MCQA, but the gains rest on an assumption about prefix reliability that needs more checks.

read the letter

The main takeaway is that this paper takes the prefilling technique from AI safety and applies it to make first-token probability a steadier signal for multiple-choice evaluation in LLMs. By prepending a short natural-language phrase, the model is steered to treat its first output token as an explicit answer choice rather than part of a preamble or unrelated high-probability token. The reported results show gains in accuracy, calibration, and consistency over plain FTP, sometimes approaching the performance of full open-ended generation while staying much cheaper.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes repurposing the 'prefilling attack'—prepending a fixed natural-language prefix such as 'The correct option is:' to the model output—as a way to improve first-token probability (FTP) evaluation on multiple-choice question answering (MCQA) tasks. It claims this steers LLMs to emit a valid answer choice as the very first token, thereby reducing misalignment and misinterpretation, and reports substantial gains in accuracy, calibration, and output consistency across LLMs and benchmarks while remaining more efficient than full open-ended generation.

Significance. If the empirical results hold under rigorous controls, the work offers a simple, parameter-free, and computationally cheap intervention that could meaningfully improve the reliability of symbolic FTP-based MCQA evaluation without retraining or external classifiers. The reuse of a technique from AI safety for evaluation purposes is a modest but practical contribution to benchmarking methodology.

major comments (2)

[§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).
[§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.

minor comments (2)

[§2] The notation for first-token probability (FTP) is introduced in the abstract but would benefit from an explicit equation or pseudocode definition in §2 or §3 for readers unfamiliar with the baseline.
Figure captions and table headers should explicitly state the exact prefix string used in each condition to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and describe the revisions made to improve the rigor and clarity of the work.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that a single fixed prefix reliably forces the first generated token to be a valid option letter (rather than a preamble or unrelated token) is load-bearing for attributing all reported gains to the prefilling mechanism, yet the manuscript provides no systematic robustness tests across model families, post-training regimes that encourage explanatory continuations, or option-labeling conventions (A/B/C vs. 1/2/3 vs. free-form).

Authors: We agree that systematic robustness tests strengthen the central claim. In the revised manuscript we have expanded Section 3 with a new robustness subsection that evaluates the prefilling prefix across additional model families (including Llama-3, Mistral, and Gemma variants), models with differing post-training regimes (base, instruction-tuned, and those emphasizing explanatory or chain-of-thought outputs), and alternative option-labeling schemes (letters, numbers, and free-form). These experiments confirm that the prefix reliably produces a valid first token in the large majority of cases while also documenting the remaining edge cases where explanatory continuations still occur. revision: yes
Referee: [§5] §5 (Experiments and Results): The abstract and results claim 'substantial' improvements in accuracy, calibration, and consistency, but the manuscript supplies no dataset sizes, number of examples per benchmark, statistical significance tests, error bars, or controls for prompt sensitivity; without these, the headline empirical claims cannot be verified or generalized.

Authors: We acknowledge that these reporting details are necessary for verification and generalization. The revised manuscript now contains an explicit table listing each benchmark together with its size and the number of examples evaluated. We have added statistical significance tests (paired t-tests), error bars derived from multiple random seeds, and a prompt-sensitivity ablation that varies the wording of the prefilling prefix. These additions allow readers to assess the reliability and scope of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical study proposing the prefilling attack as a simple natural-language prefix to improve first-token probability evaluation on MCQA tasks. It reports accuracy, calibration, and consistency gains via direct experiments across models and benchmarks, without any mathematical derivation chain, equations, fitted parameters, or predictions that reduce to inputs by construction. The method is defined explicitly as a repurposed technique from AI safety literature (no author-overlapping self-citation load-bearing the central claim), and results are presented as comparative measurements rather than self-referential outputs. This is a standard empirical contribution with independent content against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that a fixed prefix steers token probabilities without side effects. No free parameters, invented entities, or non-standard axioms are stated in the abstract.

axioms (1)

domain assumption LLMs respond coherently to natural-language prefixes in their output stream
Invoked when the prefilling attack is presented as a reliable steering method.

pith-pipeline@v0.9.0 · 5779 in / 1219 out tokens · 23885 ms · 2026-05-22T14:16:39.442991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. arXiv preprint arXiv:1905.13319 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

In: ICLR (2025)

Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. In: ICLR (2025)

work page 2025
[4]

In: ACL (2025)

Balepur, N., Rudinger, R., Boyd-Graber, J.L.: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. In: ACL (2025)

work page 2025
[5]

In: NeurIPS (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language Models are Few-Shot Learners. In: NeurIPS (2020)

work page 2020
[6]

ACM TIST (2024)

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A Survey on Evaluation of Large Language Models. ACM TIST (2024)

work page 2024
[7]

JMLR (2023) 14 S

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling Language Modeling with Pathways. JMLR (2023) 14 S. Cappelletti et al

work page 2023
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Hwang, Maxwell Forbes, and Yejin Choi

Emelin, D., Bras, R.L., Hwang, J.D., Forbes, M., Choi, Y.: Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. arXiv preprint arXiv:2012.15738 (2020)

work page arXiv 2012
[11]

Monthly Weather Review (1950)

Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly Weather Review (1950)

work page 1950
[12]

Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning (2009)

work page 2009
[13]

In: ICLR (2021)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding. In: ICLR (2021)

work page 2021
[14]

In: EMNLP (2021)

Holtzman, A., West, P., Zettlemoyer, L., Choi, Y.: Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. In: EMNLP (2021)

work page 2021
[15]

In: ACL Findings (2023)

Huang, J., Chang, K.C.C.: Towards Reasoning in Large Language Models: A Survey. In: ACL Findings (2023)

work page 2023
[16]

In: ACL (2023)

Kamalloo, E., Dziri, N., Clarke, C.L., Rafiei, D.: Evaluating Open-Domain Question Answering in the Era of Large Language Models. In: ACL (2023)

work page 2023
[17]

In: NeurIPS (2022)

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large Language Models are Zero-Shot Reasoners. In: NeurIPS (2022)

work page 2022
[18]

In: EMNLP (2017)

Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: Large-scale ReAding Compre- hension Dataset From Examinations. In: EMNLP (2017)

work page 2017
[19]

Nature521(2015)

LeCun, Y., Bengio, Y., Hinton, G.: Deep Learning. Nature521(2015)

work page 2015
[20]

Li, W., Li, L., Xiang, T., Liu, X., Deng, W., Garcia, N.: Can multiple-choice questions really be useful in detecting the abilities of LLMs? In: LREC-COLING (2024)

work page 2024
[21]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

In: IJCAI (2020)

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y.: LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In: IJCAI (2020)

work page 2020
[23]

arXiv preprint arXiv:2402.02392 (2024)

Liu, O., Fu, D., Yogatama, D., Neiswanger, W.: DeLLMa: Decision Making Under Uncertainty with Large Language Models. arXiv preprint arXiv:2402.02392 (2024)

work page arXiv 2024
[24]

arXiv preprint arXiv:2504.04141 (2025)

Lyu, Y., Ren, S., Feng, Y., Wang, Z., Chen, Z., Ren, Z., de Rijke, M.: Cognitive Debi- asing Large Language Models for Decision-Making. arXiv preprint arXiv:2504.04141 (2025)

work page arXiv 2025
[25]

In: EMNLP (2018)

Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In: EMNLP (2018)

work page 2018
[26]

Min, S., Lewis, M., Hajishirzi, H., Zettlemoyer, L.: Rethinking the Role of Demon- strations: What Makes In-Context Learning Work? In: EMNLP (2022)

work page 2022
[27]

arXiv preprint arXiv:2503.14996 (2025)

Molfese, F.M., Moroni, L., Gioffré, L., Scirè, A., Conia, S., Navigli, R.: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering. arXiv preprint arXiv:2503.14996 (2025)

work page arXiv 2025
[28]

In: NAACL (2016)

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In: NAACL (2016)

work page 2016
[29]

In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining Well Calibrated Probabilities Using Bayesian Binning. In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

work page 2015
[30]

A Comprehensive Overview of Large Language Models

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: CVPR Workshops (2019)

Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring Calibration in Deep Learning. In: CVPR Workshops (2019)

work page 2019
[32]

In: NeurIPS (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

work page 2022
[33]

Reasoning with large language models, a survey, 2024

Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., Back, T.: Reasoning with Large Language Models, a Survey. arXiv preprint arXiv:2407.11511 (2024)

work page arXiv 2024
[34]

In: EMNLP (2023)

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and Narrowing the Compositionality Gap in Language Models. In: EMNLP (2023)

work page 2023
[35]

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T.: Whose Opinions Do Language Models Reflect? In: ICML (2023)

work page 2023
[36]

In: EMNLP (2019)

Sap, M., Rashkin, H., Chen, D., LeBras, R., Choi, Y.: SocialIQA: Commonsense Reasoning about Social Interactions. In: EMNLP (2019)

work page 2019
[37]

In: NAACL (2019)

Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: NAACL (2019)

work page 2019
[38]

Tang, L.: A Trivial Jailbreak Against Llama 3.https://github.com/haizelabs/ llama3-jailbreak(2024)

work page 2024
[39]

In: COLM (2024)

Wang, X., Hu, C., Ma, B., Röttger, P., Plank, B.: Look at the Text: Instruction- Tuned Language Models are More Robust Multiple Choice Selectors than You Think. In: COLM (2024)

work page 2024
[40]

My Answer is C

Wang, X., Ma, B., Hu, C., Weber-Genzel, L., Röttger, P., Kreuter, F., Hovy, D., Plank, B.: “My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models. In: ACL (2024)

work page 2024
[41]

Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How Does LLM Safety Training Fail? In: NeurIPS (2023)

work page 2023
[42]

Emergent Abilities of Large Language Models

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent Abilities of Large Language Models. arXiv preprint arXiv:2206.07682 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

In: NeurIPS (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., et al.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: NeurIPS (2022)

work page 2022
[44]

Crowdsourcing Multiple Choice Science Questions

Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing Multiple Choice Science Questions. arXiv preprint arXiv:1707.06209 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

In: NeurIPS (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., et al.: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In: NeurIPS (2023)

work page 2023
[46]

In: ICLR (2025)

Yu, Q., Zheng, Z., Song, S., Li, Z., Xiong, F., et al.: xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation. In: ICLR (2025)

work page 2025
[47]

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a Machine Really Finish Your Sentence? In: ACL (2019)

work page 2019
[48]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Lin, X.V., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

In: ICML (2021)

Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate Before Use: Improving Few-Shot Performance of Language Models. In: ICML (2021)

work page 2021
[50]

Going on a vacation

Zhou, B., Khashabi, D., Ning, Q., Roth, D.: "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding. In: EMNLP (2019)

work page 2019
[51]

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023) Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling Supplementary Material Silvia Cappelletti1⋆ , Tobia Poppi1,2⋆ , Samuele P...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. arXiv preprint arXiv:1905.13319 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905

[3] [3]

In: ICLR (2025)

Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. In: ICLR (2025)

work page 2025

[4] [4]

In: ACL (2025)

Balepur, N., Rudinger, R., Boyd-Graber, J.L.: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above. In: ACL (2025)

work page 2025

[5] [5]

In: NeurIPS (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language Models are Few-Shot Learners. In: NeurIPS (2020)

work page 2020

[6] [6]

ACM TIST (2024)

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A Survey on Evaluation of Large Language Models. ACM TIST (2024)

work page 2024

[7] [7]

JMLR (2023) 14 S

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling Language Modeling with Pathways. JMLR (2023) 14 S. Cappelletti et al

work page 2023

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Hwang, Maxwell Forbes, and Yejin Choi

Emelin, D., Bras, R.L., Hwang, J.D., Forbes, M., Choi, Y.: Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. arXiv preprint arXiv:2012.15738 (2020)

work page arXiv 2012

[11] [11]

Monthly Weather Review (1950)

Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly Weather Review (1950)

work page 1950

[12] [12]

Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning (2009)

work page 2009

[13] [13]

In: ICLR (2021)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding. In: ICLR (2021)

work page 2021

[14] [14]

In: EMNLP (2021)

Holtzman, A., West, P., Zettlemoyer, L., Choi, Y.: Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. In: EMNLP (2021)

work page 2021

[15] [15]

In: ACL Findings (2023)

Huang, J., Chang, K.C.C.: Towards Reasoning in Large Language Models: A Survey. In: ACL Findings (2023)

work page 2023

[16] [16]

In: ACL (2023)

Kamalloo, E., Dziri, N., Clarke, C.L., Rafiei, D.: Evaluating Open-Domain Question Answering in the Era of Large Language Models. In: ACL (2023)

work page 2023

[17] [17]

In: NeurIPS (2022)

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large Language Models are Zero-Shot Reasoners. In: NeurIPS (2022)

work page 2022

[18] [18]

In: EMNLP (2017)

Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: Large-scale ReAding Compre- hension Dataset From Examinations. In: EMNLP (2017)

work page 2017

[19] [19]

Nature521(2015)

LeCun, Y., Bengio, Y., Hinton, G.: Deep Learning. Nature521(2015)

work page 2015

[20] [20]

Li, W., Li, L., Xiang, T., Liu, X., Deng, W., Garcia, N.: Can multiple-choice questions really be useful in detecting the abilities of LLMs? In: LREC-COLING (2024)

work page 2024

[21] [21]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

In: IJCAI (2020)

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y.: LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In: IJCAI (2020)

work page 2020

[23] [23]

arXiv preprint arXiv:2402.02392 (2024)

Liu, O., Fu, D., Yogatama, D., Neiswanger, W.: DeLLMa: Decision Making Under Uncertainty with Large Language Models. arXiv preprint arXiv:2402.02392 (2024)

work page arXiv 2024

[24] [24]

arXiv preprint arXiv:2504.04141 (2025)

Lyu, Y., Ren, S., Feng, Y., Wang, Z., Chen, Z., Ren, Z., de Rijke, M.: Cognitive Debi- asing Large Language Models for Decision-Making. arXiv preprint arXiv:2504.04141 (2025)

work page arXiv 2025

[25] [25]

In: EMNLP (2018)

Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In: EMNLP (2018)

work page 2018

[26] [26]

Min, S., Lewis, M., Hajishirzi, H., Zettlemoyer, L.: Rethinking the Role of Demon- strations: What Makes In-Context Learning Work? In: EMNLP (2022)

work page 2022

[27] [27]

arXiv preprint arXiv:2503.14996 (2025)

Molfese, F.M., Moroni, L., Gioffré, L., Scirè, A., Conia, S., Navigli, R.: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering. arXiv preprint arXiv:2503.14996 (2025)

work page arXiv 2025

[28] [28]

In: NAACL (2016)

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In: NAACL (2016)

work page 2016

[29] [29]

In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

Naeini, M.P., Cooper, G., Hauskrecht, M.: Obtaining Well Calibrated Probabilities Using Bayesian Binning. In: AAAI (2015) LLM First-Token Predictions via Output Prefilling 15

work page 2015

[30] [30]

A Comprehensive Overview of Large Language Models

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A Comprehensive Overview of Large Language Models. arXiv preprint arXiv:2307.06435 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In: CVPR Workshops (2019)

Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring Calibration in Deep Learning. In: CVPR Workshops (2019)

work page 2019

[32] [32]

In: NeurIPS (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

work page 2022

[33] [33]

Reasoning with large language models, a survey, 2024

Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., Back, T.: Reasoning with Large Language Models, a Survey. arXiv preprint arXiv:2407.11511 (2024)

work page arXiv 2024

[34] [34]

In: EMNLP (2023)

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and Narrowing the Compositionality Gap in Language Models. In: EMNLP (2023)

work page 2023

[35] [35]

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T.: Whose Opinions Do Language Models Reflect? In: ICML (2023)

work page 2023

[36] [36]

In: EMNLP (2019)

Sap, M., Rashkin, H., Chen, D., LeBras, R., Choi, Y.: SocialIQA: Commonsense Reasoning about Social Interactions. In: EMNLP (2019)

work page 2019

[37] [37]

In: NAACL (2019)

Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: NAACL (2019)

work page 2019

[38] [38]

Tang, L.: A Trivial Jailbreak Against Llama 3.https://github.com/haizelabs/ llama3-jailbreak(2024)

work page 2024

[39] [39]

In: COLM (2024)

Wang, X., Hu, C., Ma, B., Röttger, P., Plank, B.: Look at the Text: Instruction- Tuned Language Models are More Robust Multiple Choice Selectors than You Think. In: COLM (2024)

work page 2024

[40] [40]

My Answer is C

Wang, X., Ma, B., Hu, C., Weber-Genzel, L., Röttger, P., Kreuter, F., Hovy, D., Plank, B.: “My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models. In: ACL (2024)

work page 2024

[41] [41]

Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How Does LLM Safety Training Fail? In: NeurIPS (2023)

work page 2023

[42] [42]

Emergent Abilities of Large Language Models

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent Abilities of Large Language Models. arXiv preprint arXiv:2206.07682 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

In: NeurIPS (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., et al.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: NeurIPS (2022)

work page 2022

[44] [44]

Crowdsourcing Multiple Choice Science Questions

Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing Multiple Choice Science Questions. arXiv preprint arXiv:1707.06209 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

In: NeurIPS (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., et al.: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In: NeurIPS (2023)

work page 2023

[46] [46]

In: ICLR (2025)

Yu, Q., Zheng, Z., Song, S., Li, Z., Xiong, F., et al.: xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation. In: ICLR (2025)

work page 2025

[47] [47]

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a Machine Really Finish Your Sentence? In: ACL (2019)

work page 2019

[48] [48]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Lin, X.V., et al.: OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

In: ICML (2021)

Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate Before Use: Improving Few-Shot Performance of Language Models. In: ICML (2021)

work page 2021

[50] [50]

Going on a vacation

Zhou, B., Khashabi, D., Ning, Q., Roth, D.: "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding. In: EMNLP (2019)

work page 2019

[51] [51]

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023) Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling Supplementary Material Silvia Cappelletti1⋆ , Tobia Poppi1,2⋆ , Samuele P...

work page internal anchor Pith review Pith/arXiv arXiv 2023