Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Daehui Kim; Deokhyung Kang; Gary Geunbae Lee; Hyounghun Kim; Seonjeong Hwang

arxiv: 2510.27269 · v3 · submitted 2025-10-31 · 💻 cs.CL · cs.AI· cs.LG

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Deokhyung Kang , Seonjeong Hwang , Daehui Kim , Hyounghun Kim , Gary Geunbae Lee This is my paper

Pith reviewed 2026-05-18 03:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords multilingual reasoningreasoning language modelslanguage understanding failuresselective translationmultilingual gapsinput translation

0 comments

The pith

Multilingual reasoning gaps in language models stem mainly from failures to translate inputs into English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning language models perform better on complex tasks in high-resource languages than in low-resource ones. The paper argues that this gap arises because models often fail to understand non-English inputs and convert them into English, the language used for their reasoning traces. Detection methods can identify these understanding failures to a useful degree. The authors introduce Selective Translation, which adds an English version of the input only when a failure is spotted. Experiments show this closes nearly the entire gap while translating only about 20 percent of inputs.

Core claim

The multilingual reasoning gap primarily stems from failures in language understanding—specifically, the model's inability to translate multilingual inputs into the language dominating its reasoning traces, typically English. Understanding failures are detectable to a meaningful extent, with supervised approaches working best. Selective Translation bridges the gap by incorporating an English translation into the initial reasoning trace only when an understanding failure is detected, achieving near full-translation performance while translating only about 20% of inputs.

What carries the argument

Selective Translation, a strategy that detects understanding failures and adds an English translation to the reasoning trace only for affected inputs.

If this is right

Focusing mitigation on input understanding allows most of the multilingual gap to close without translating every case.
Detection of understanding failures supports efficient, targeted fixes instead of blanket translation.
Supervised detection outperforms other methods for spotting when selective translation is needed.
The approach preserves nearly all benefits of full translation at much lower translation volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar selective detection could address other multilingual issues such as uneven cultural knowledge across languages.
Strengthening core language understanding in the base model might reduce the frequency of needed translations over time.
Checking whether the low translation rate holds on larger models would test how the method scales.

Load-bearing premise

That performance differences are driven by language understanding failures rather than deficits in reasoning capability once the input is understood or by gaps in cultural or domain knowledge.

What would settle it

If performance in low-resource languages stays lower even after all inputs receive accurate English translations before reasoning begins, the claim that understanding failures are the main cause would be weakened.

Figures

Figures reproduced from arXiv: 2510.27269 by Daehui Kim, Deokhyung Kang, Gary Geunbae Lee, Hyounghun Kim, Seonjeong Hwang.

**Figure 1.** Figure 1: Understanding failure in Qwen3-4B: the model shows confusion when interpreting the Swahili input (e.g., “This is confusing. . . ”) and ignores the “1 bad orange” condition, leading to an incorrect answer. Despite these advances, RLMs still exhibit a multilingual reasoning gap, performing much better on queries in high-resource languages (e.g., English) than in low-resource languages (Wang et al., 2025b).… view at source ↗

**Figure 2.** Figure 2: Weighted shares of Understanding, Reasoning, and Generation in the input language to the overall multilingual reasoning gap. Across models and datasets, failures in Understanding generally dominate the gap. Dataset Qwen3-4B gpt-oss-20b Base w / U Base w / U Low 0.82±0.21 0.95±0.03 0.91±0.05 0.94±0.03 Medium 0.89±0.11 0.96±0.04 0.97±0.04 0.99±0.02 High 0.85±0.14 0.95±0.02 0.92±0.05 0.98±0.03 [PITH_FULL_IMA… view at source ↗

**Figure 3.** Figure 3: Scatter plot of Reasoning Performance Ratio [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: F1 scores for understanding failure detection [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: F1 score of understanding failure detection on [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Weighted shares of Understanding, Reasoning, and Generation in the input language to the overall multilingual reasoning gap on Qwen3-4B. Across different prefix variants, failures in Understanding dominate the gap. Polymath-Low Method de es ar ja ko th bn sw te Random baseline 16.7 ± 28.9 0.0 ± 0.0 5.6 ± 9.6 3.2 ± 5.5 0.0 ± 0.0 9.4 ± 9.1 11.3 ± 10.5 68.0 ± 2.4 15.5 ± 5.6 Avg confidence 32.2 ± 20.7 16.5 ± 5… view at source ↗

**Figure 7.** Figure 7: Distributions of three token-probability–based [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Language distributions of reasoning traces and final responses for Qwen3-4B across Polymath (low/medi [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Language distributions of reasoning traces and final responses for gpt-oss-20b across Polymath (low/medi [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Language distributions of reasoning traces and final responses for Qwen3-1.7B across Polymath [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Language distributions of reasoning traces and final responses for Qwen3-8B across Polymath (low/medi [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Language distributions of reasoning traces and final responses for Qwen3-14B across Polymath [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Language-specific Stage-wise Attribution Analysis for Qwen3-1.7B. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Language-specific Stage-wise Attribution Analysis for Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Language-specific Stage-wise Attribution Analysis for Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Language-specific Stage-wise Attribution Analysis for Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Language-specific Stage-wise Attribution Analysis for gpt-oss-20b. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

read the original abstract

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding-specifically, the model's inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace only when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins multilingual reasoning gaps on understanding failures rather than reasoning deficits and shows selective translation closes most of the gap at low cost on Qwen3-4B.

read the letter

The main takeaway is that the multilingual reasoning gap comes mostly from the model failing to understand low-resource inputs and map them into its English reasoning process. The authors test detection methods and find supervised approaches work best, then show that translating only the detected failure cases gets near full-translation performance while touching just 20% of inputs on Qwen3-4B. Releasing the code and data is useful for checking the numbers. What they do well is give a practical, low-overhead mitigation tied to a clear causal story instead of another broad scaling experiment. The soft spots are around whether understanding failures are cleanly separated from other issues. The experiments may not fully rule out that low-resource languages carry missing domain or cultural knowledge that external translation does not supply, or that detection is partly picking up overall task difficulty. Results are reported on a single model with limited detail on baselines and statistical controls in the abstract, so the attribution to understanding alone needs tighter checks in the full runs. This paper is for researchers and engineers working on multilingual deployment of reasoning models who want targeted fixes rather than general improvements. A reader focused on efficiency and language equity would get concrete value from the selective strategy. It deserves peer review because the idea is testable and the results show a measurable effect, even if revisions would strengthen the isolation of causes.

Referee Report

2 major / 1 minor

Summary. The paper investigates the multilingual reasoning gap in reasoning language models (RLMs), claiming it primarily stems from failures in language understanding—specifically, the inability to translate non-English inputs into the model's dominant reasoning language (typically English). The authors evaluate detection methods for these failures (supervised approaches perform best), and introduce Selective Translation, which adds an English translation to the reasoning trace only upon detected failure. Experiments on Qwen3-4B show this closes most of the gap while translating only ~20% of inputs, with code and data released publicly.

Significance. If the results hold, the work provides a mechanistic explanation for multilingual gaps in RLMs and a practical, low-overhead mitigation strategy. It suggests understanding and reasoning can be partially decoupled in these models and offers a path to more equitable performance. The public code and data release is a clear strength for reproducibility.

major comments (2)

Abstract and experimental results: The central claim that the gap 'primarily stems from failures in language understanding' and that selective translation reaches 'near full-translation performance' requires explicit evidence that translated low-resource inputs yield reasoning performance equivalent to high-resource cases. Without controls for domain/cultural knowledge gaps or weaker internal representations that persist after surface translation, the attribution to understanding failures alone is not fully isolated.
Detection methods section: Supervised detection is reported as best, but the manuscript must clarify how 'understanding failure' labels are constructed (e.g., whether they derive from downstream performance on the same tasks). If labels correlate with overall difficulty rather than language-specific comprehension, this risks circularity when using detection to explain performance gaps.

minor comments (1)

The abstract and methods would benefit from specifying the exact languages tested, dataset sizes, and statistical tests (with confidence intervals) supporting the 'substantially bridges' and 'near full-translation' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment below in a point-by-point manner, indicating where we will make revisions to strengthen the paper.

read point-by-point responses

Referee: Abstract and experimental results: The central claim that the gap 'primarily stems from failures in language understanding' and that selective translation reaches 'near full-translation performance' requires explicit evidence that translated low-resource inputs yield reasoning performance equivalent to high-resource cases. Without controls for domain/cultural knowledge gaps or weaker internal representations that persist after surface translation, the attribution to understanding failures alone is not fully isolated.

Authors: We appreciate the referee's emphasis on isolating the causal factors. Our experiments demonstrate that full translation of low-resource inputs to English yields performance levels approaching those observed on high-resource languages for the same tasks, as reflected in the near full-translation results. To further support the attribution to understanding failures, our evaluations focus on reasoning benchmarks such as mathematical and logical problems, which minimize cultural and domain-specific knowledge dependencies. We acknowledge that surface translation may not address all potential internal representation issues. In the revised manuscript, we will add explicit comparisons of translated low-resource performance against high-resource baselines, along with a discussion of these potential confounds and any supporting analyses. revision: partial
Referee: Detection methods section: Supervised detection is reported as best, but the manuscript must clarify how 'understanding failure' labels are constructed (e.g., whether they derive from downstream performance on the same tasks). If labels correlate with overall difficulty rather than language-specific comprehension, this risks circularity when using detection to explain performance gaps.

Authors: We agree that explicit clarification of the labeling process is essential to address concerns about circularity. The understanding failure labels are derived from a proxy comparison of model behavior on the same queries presented in English (where comprehension is assumed successful due to the model's dominant reasoning language) versus the original language, using a held-out set of examples distinct from the primary evaluation tasks. This construction targets language-specific comprehension rather than overall task difficulty. We will revise the detection methods section to include a detailed description of the labeling procedure, including steps taken to ensure independence from downstream performance metrics and to mitigate correlation with general difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new experiments

full rationale

The paper derives its central claim—that multilingual reasoning gaps primarily arise from language understanding failures (inability to internally translate inputs to English)—through direct empirical evaluations of detection methods and the Selective Translation intervention on Qwen3-4B. These results are obtained from observable performance metrics and mitigation outcomes rather than any self-definitional equations, fitted parameters presented as predictions, or load-bearing self-citations. The derivation chain is self-contained against external benchmarks, with the proposed strategy's effectiveness measured independently via translation ratios and gap closure, without reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the domain assumption that reasoning traces are dominated by English and on the ability to isolate understanding failures from other performance factors; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Reasoning traces in RLMs are dominated by English
Invoked when describing the target language for translation of multilingual inputs.

pith-pipeline@v0.9.0 · 5774 in / 1123 out tokens · 35634 ms · 2026-05-18T03:12:44.681893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

, month = jul, year =

Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428. Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dong- mei Zhang, and Jia Li. 2024. Breaking language barriers in multilingual mathematical reasoning: In- sights and observations. InFindings of the Associa- tion for Computation...

work page arXiv 2024
[2]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto

mmbert: A modern multilingual encoder with annealed language learning.arXiv preprint arXiv:2509.06888. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling.Preprint, arXiv:2501.19393. OpenAI. 2025. Gpt-4.1. ...

work page arXiv 2025
[3]

InThe Eleventh International Conference on Learning Representations

Language models are multilingual chain-of- thought reasoners. InThe Eleventh International Conference on Learning Representations. Anthony F Shorrocks and 1 others. 2013. Decomposi- tion procedures for distributional analysis: a unified framework based on the shapley value.Journal of Economic Inequality, 11(1):99–126. Guijin Son, Jiwoo Hong, Hyunwoo Ko, a...

work page arXiv 2013
[4]

Bernard L Welch

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Ad- vances in Neural Information Processing Systems, 37:95266–95290. Bernard L Welch. 1947. The generalization of ‘stu- dent’s’problem when several different population var- lances are involved.Biometrika, 34(1-2):28–35. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, ...

work page arXiv 1947
[5]

Okay, I understand the question as: ’{xdom}’. I will solve the problem based on this understanding

Code-switching in-context learning for cross- lingual transfer of large language models.arXiv preprint arXiv:2510.05678. Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Lang- Bridge: Multilingual reasoning without multilingual supervision. InProceedings of the 62nd Annual Meeting of the Association for Computa...

work page arXiv 2024
[6]

Note: Please put the final answer in the \boxed{}

and four additional languages from P- MMeval (Zhang et al., 2024). • Medium: Consists of exam-style problems from college mathematics, China’s Gaokao, and postgraduate entrance exams, along with entry-level competition questions from AMC and CNMO provincial contests, all collected from official sources. • High: Focuses on mid- to high-difficulty com- peti...

work page 2024
[7]

the answer is (X)

is a multilingual benchmark extending the reasoning-focused English benchmark MMLU- Pro (Wang et al., 2024) to 29 typologically diverse languages, enabling cross-linguistic comparison of reasoning ability with fully parallel question sets. Each language version of the full benchmark contains 11,829 multiple-choice questions cover- ing 57 subjects, with ea...

work page 2024
[8]

- Possible types include: Numeric scalar, Comparison/Ordering among variables, Set/List, Interval/Inequality, Coordinate/- Tuple, Algebraic expression, or Multiple- choice letter

Read the Question and determine the ex- pected final answer type. - Possible types include: Numeric scalar, Comparison/Ordering among variables, Set/List, Interval/Inequality, Coordinate/- Tuple, Algebraic expression, or Multiple- choice letter. - Decide the most appropriate type for THIS Question

work page
[9]

Therefore

Carefully scan the Reasoning trace and identify the final/conclusive answer consis- tent with the expected type. - Prefer the final/most conclusive statement (e.g., “Therefore. . . ”, “Thus. . . ”, “Final an- swer. . . ”, or the last decisive equation). - If multiple candidates appear, choose the last one that is self-consistent. - Ignore exploratory or c...

work page
[10]

- Do not include any explanation or extra symbols outside\boxed{}

Output EXACTLY in the format: \boxed{FINAL_ANSWER} Formatting rules: - Put ONLY the final answer inside \boxed{} (no units, words, or explana- tions). - Do not include any explanation or extra symbols outside\boxed{}. - If no conclusive final answer is present in the trace, choose the last consistent can- didate stated as final; if still impossible, outpu...

work page
[11]

Therefore

Carefully scan the Reasoning trace and identify the final multiple-choice option an- swer. - Valid answers are only single capital let- ters from [A-J]. - If the final answer in the Reasoning trace is given as option text instead of a letter, use the provided multiple-choice options to map it to the corresponding letter from [A-J]. - Prefer the final/most...

work page
[12]

- Do not include any explanation, units, or extra text

Output EXACTLY in the format: Answer: X Formatting rules: - Replace X with the chosen letter from [A- J]. - Do not include any explanation, units, or extra text. Now, the inputs are given below. Inputs: - Multiple-choice options (corresponding to the Question): {options_block} - Reasoning trace: {reasoning_trace} Output: Answer Verification.We evaluate th...

work page 2025
[13]

Reason" field should be one or two sentences. {

benchmark’s test set into 14 languages. Prompt for LLM-based detector You are given a problem (question and possi- bly options) and a model’s reasoning trace. Your task is to decide whether the model correctly understood the problem. Do not solve the problem yourself. Return the output strictly in the following JSON format, with no extra text. The "Reason...

work page 2025
[14]

not understood

samples. As hypothesized,average confidence andminimum confidencetend to be lower for not- understood samples, indicating their usefulness as understanding-failure signals. In contrast,input negative log-likelihoodshows no clear correlation with the labels. 23 20 25 30 35 40 45 Score not understood (POS) understood (NEG) =25.524 Overall Average Confidence...

work page 2019

[1] [1]

, month = jul, year =

Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428. Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dong- mei Zhang, and Jia Li. 2024. Breaking language barriers in multilingual mathematical reasoning: In- sights and observations. InFindings of the Associa- tion for Computation...

work page arXiv 2024

[2] [2]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto

mmbert: A modern multilingual encoder with annealed language learning.arXiv preprint arXiv:2509.06888. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling.Preprint, arXiv:2501.19393. OpenAI. 2025. Gpt-4.1. ...

work page arXiv 2025

[3] [3]

InThe Eleventh International Conference on Learning Representations

Language models are multilingual chain-of- thought reasoners. InThe Eleventh International Conference on Learning Representations. Anthony F Shorrocks and 1 others. 2013. Decomposi- tion procedures for distributional analysis: a unified framework based on the shapley value.Journal of Economic Inequality, 11(1):99–126. Guijin Son, Jiwoo Hong, Hyunwoo Ko, a...

work page arXiv 2013

[4] [4]

Bernard L Welch

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Ad- vances in Neural Information Processing Systems, 37:95266–95290. Bernard L Welch. 1947. The generalization of ‘stu- dent’s’problem when several different population var- lances are involved.Biometrika, 34(1-2):28–35. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, ...

work page arXiv 1947

[5] [5]

Okay, I understand the question as: ’{xdom}’. I will solve the problem based on this understanding

Code-switching in-context learning for cross- lingual transfer of large language models.arXiv preprint arXiv:2510.05678. Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Lang- Bridge: Multilingual reasoning without multilingual supervision. InProceedings of the 62nd Annual Meeting of the Association for Computa...

work page arXiv 2024

[6] [6]

Note: Please put the final answer in the \boxed{}

and four additional languages from P- MMeval (Zhang et al., 2024). • Medium: Consists of exam-style problems from college mathematics, China’s Gaokao, and postgraduate entrance exams, along with entry-level competition questions from AMC and CNMO provincial contests, all collected from official sources. • High: Focuses on mid- to high-difficulty com- peti...

work page 2024

[7] [7]

the answer is (X)

is a multilingual benchmark extending the reasoning-focused English benchmark MMLU- Pro (Wang et al., 2024) to 29 typologically diverse languages, enabling cross-linguistic comparison of reasoning ability with fully parallel question sets. Each language version of the full benchmark contains 11,829 multiple-choice questions cover- ing 57 subjects, with ea...

work page 2024

[8] [8]

- Possible types include: Numeric scalar, Comparison/Ordering among variables, Set/List, Interval/Inequality, Coordinate/- Tuple, Algebraic expression, or Multiple- choice letter

Read the Question and determine the ex- pected final answer type. - Possible types include: Numeric scalar, Comparison/Ordering among variables, Set/List, Interval/Inequality, Coordinate/- Tuple, Algebraic expression, or Multiple- choice letter. - Decide the most appropriate type for THIS Question

work page

[9] [9]

Therefore

Carefully scan the Reasoning trace and identify the final/conclusive answer consis- tent with the expected type. - Prefer the final/most conclusive statement (e.g., “Therefore. . . ”, “Thus. . . ”, “Final an- swer. . . ”, or the last decisive equation). - If multiple candidates appear, choose the last one that is self-consistent. - Ignore exploratory or c...

work page

[10] [10]

- Do not include any explanation or extra symbols outside\boxed{}

Output EXACTLY in the format: \boxed{FINAL_ANSWER} Formatting rules: - Put ONLY the final answer inside \boxed{} (no units, words, or explana- tions). - Do not include any explanation or extra symbols outside\boxed{}. - If no conclusive final answer is present in the trace, choose the last consistent can- didate stated as final; if still impossible, outpu...

work page

[11] [11]

Therefore

Carefully scan the Reasoning trace and identify the final multiple-choice option an- swer. - Valid answers are only single capital let- ters from [A-J]. - If the final answer in the Reasoning trace is given as option text instead of a letter, use the provided multiple-choice options to map it to the corresponding letter from [A-J]. - Prefer the final/most...

work page

[12] [12]

- Do not include any explanation, units, or extra text

Output EXACTLY in the format: Answer: X Formatting rules: - Replace X with the chosen letter from [A- J]. - Do not include any explanation, units, or extra text. Now, the inputs are given below. Inputs: - Multiple-choice options (corresponding to the Question): {options_block} - Reasoning trace: {reasoning_trace} Output: Answer Verification.We evaluate th...

work page 2025

[13] [13]

Reason" field should be one or two sentences. {

benchmark’s test set into 14 languages. Prompt for LLM-based detector You are given a problem (question and possi- bly options) and a model’s reasoning trace. Your task is to decide whether the model correctly understood the problem. Do not solve the problem yourself. Return the output strictly in the following JSON format, with no extra text. The "Reason...

work page 2025

[14] [14]

not understood

samples. As hypothesized,average confidence andminimum confidencetend to be lower for not- understood samples, indicating their usefulness as understanding-failure signals. In contrast,input negative log-likelihoodshows no clear correlation with the labels. 23 20 25 30 35 40 45 Score not understood (POS) understood (NEG) =25.524 Overall Average Confidence...

work page 2019