Multilingual Reasoning Cascades Need More Context

Arnav Mazumder; Dengjia Zhang; Niyati Bafna; Shuyue Stella Li; Yulia Tsvetkov

arxiv: 2606.27306 · v1 · pith:2SPNRU67new · submitted 2026-06-25 · 💻 cs.CL

Multilingual Reasoning Cascades Need More Context

Arnav Mazumder , Dengjia Zhang , Shuyue Stella Li , Yulia Tsvetkov , Niyati Bafna This is my paper

Pith reviewed 2026-06-26 04:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual reasoningtranslation cascadescontext preservationmachine translationopen-ended generationerror propagationlow-resource languages

0 comments

The pith

Adding the original question to the final translation step improves multilingual reasoning cascades.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Translation cascades convert a query to English for reasoning then back to the source language, but each step can discard cues needed later such as cultural context or disambiguation. The paper tests a training-free fix that feeds the original question, its English version, and the reasoning trace into the final translation module. This change produces measurable gains on open-ended generation across nine benchmarks, three models, and 285 languages spanning high to low resource settings. Most of the benefit traces to keeping the original-language question available until the end. The work therefore argues that cascades should be redesigned to preserve source information rather than letting it drop at each stage.

Core claim

A context-aware translation cascade that supplies the original question, its English translation, and the reasoning trace to the final translation module produces strong performance gains on open-ended generation tasks across models and language resource levels, with the original question supplying the majority of the useful context.

What carries the argument

The context-aware translation cascade, which augments the input to the final translation module with the original question and prior reasoning trace to reduce information loss.

If this is right

Gains appear consistently for open-ended generation across the tested models and resource regimes.
The original-language question accounts for most of the observed benefit.
Preserving the source question through the full pipeline offers a simple default strategy for reducing error propagation in cascades.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preservation tactic could be tested in other multi-stage translation or reasoning pipelines outside the nine benchmarks.
If the original question is the dominant signal, future cascade designs might prioritize early-stage retention over later-stage additions.
The approach might lower the need for model-specific fine-tuning when moving across languages.

Load-bearing premise

That the measured gains come from the added context itself rather than from longer prompts or other uncontrolled variables, and that results on the nine benchmarks will hold for broader real-world multilingual reasoning.

What would settle it

A controlled experiment that matches prompt length exactly while adding the original question and finds no remaining performance difference on the same benchmarks.

Figures

Figures reproduced from arXiv: 2606.27306 by Arnav Mazumder, Dengjia Zhang, Niyati Bafna, Shuyue Stella Li, Yulia Tsvetkov.

**Figure 1.** Figure 1: Standard cascade (Cstd, orange) vs. contextaware cascade (Cctx, orange+blue). In Cstd, MT2 translates only the English answer ae. In Cctx, MT2 additionally receives the original target-language question qt, the English question qe, and the English reasoning trace re, allowing it to output more grounded responses as well as perform error recovery based on context discarded by the standard cascade. propag… view at source ↗

**Figure 2.** Figure 2: Gains of Cctx over Cstd across resource levels on Global-PIQA-OE. inal question qt , (b) the English question qe, and (c) the reasoning trace re, alongside the English answer ae, in order to understand which component provides the most benefit. We find that providing MT2 with the original question and ae alone is competitive with, if not superior to, full Cctx on openended datasets with both Mistral and … view at source ↗

**Figure 3.** Figure 3: E2E prompt for Open-ended QA. Translate the following question to English: {question} Required Format: <translation> [English translation of the question goes here] </translation> Output [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: MT1 prompt for Open-ended QA. You are an advanced reasoning assistant. You will be provided with a question. Your task is to analyze the query step-by-step and provide a direct answer in English. Instructions: 1. Reasoning Process: Before answering, break down the problem logically. Analyze the constraints, perform necessary calculations, or outline your arguments. Enclose this entire thought process withi… view at source ↗

**Figure 5.** Figure 5: LLM prompt for Open-ended QA. You are an advanced translation assistant. You will be provided with a sentence in English. You need to translate the sentence to {language}. Required Format: <answer> [Final answer in {language} goes here] </answer> [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: MT2 Cstd prompt for Open-ended QA. You are an advanced reasoning assistant. You will be provided with a question, the English question, the English thinking process and answer. You need to answer the question in the {language}. You need to answer the question in cultural context of the {language}. Required Format: <answer> [Final answer in {language} goes here] </answer> Input:{question} English Question: … view at source ↗

**Figure 7.** Figure 7: MT2 Cctx prompt for Open-ended QA. A.2 Multiple-choice Figures 8 and 11 show the MT2 prompts for Cctx and Cctx on multiple-choice tasks. We provide the available options (e.g., A, B, C, D) when specifying the answer format; this varies over datasets since datasets may have different numbers of options. You are an advanced reasoning assistant. You will be provided with a multiple-choice question in {languag… view at source ↗

**Figure 8.** Figure 8: E2E prompt for Multiple-choice. Translate the following multiple-choice question and all its options to English. Required Format: <question_translation> [English translation of the question goes here] </question_translation> {options translation format} Input: {question} {options text} Output [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: MT1 prompt for Multiple-choice [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: LLM prompt for Multiple-choice. You are an advanced reasoning assistant. You will be provided with a multiple-choice question, the English question, the English thinking process and answer. You need to select the correct option for the question. You need to answer the question in the cultural context of the {language}. Required Format: <answer> [{possible answer choice letters}] </answer> Input: {question… view at source ↗

**Figure 11.** Figure 11: MT2 Cctx prompt for Multiple-choice. A.3 Math [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: E2E prompt for Math [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: MT1 prompts for Math. You are an advanced mathematics reasoning assistant. You will be provided with a math problem. Your task is to solve it step-by-step and present the final answer. Instructions: 1. Reasoning Process: Solve the problem step-by-step in English, showing all calculations and logical deductions. Enclose this entire process within <think> tags. 2. Final Answer: Place only the final answer i… view at source ↗

**Figure 14.** Figure 14: LLM prompts for Math. You are an advanced mathematics reasoning assistant. You will be provided with a math problem in {language}, along with its English translation and full English solution. Present the final answer in {language}. Required Format: <answer> \\boxed{{[Final answer]}} </answer> Input: {question} English Question: {english question} English Thinking Process: {english thinking process} Engli… view at source ↗

**Figure 15.** Figure 15: MT2 Cctx prompts for Math. A.4 Translation quality You are a translation quality auditor. You will be given a source-language text and a candidate translation into English. The input comes in two forms: some items are a question only; others bundle the answer options into the same sentence as the question. In both cases, compare the candidate translation against the source and decide whether it contains a… view at source ↗

**Figure 16.** Figure 16: E2E prompt for analyzing translation error. [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 17.** Figure 17: Gains of the context-aware cascade (Cctx) over the standard cascade (Cstd) across resource levels on Global-MMLU and Belebele for Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Improvements over Cstd on open-ended datasets for Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. Bars show the change in chrF when MT2 is given full context (Cctx) or ablated context variants: qt + ae, qe + ae, and re + ae. G.1 Llama-3.1-8B-Instruct Dataset Cstd Cctx qt + ae qe + ae re + ae Open-ended generation (chrF) Aya 9.56 14.26 13.48 11.08 11.73 BLEnD 9.47 12.14 11.88 10.59 8.76 Global-PIQA-OE … view at source ↗

read the original abstract

Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but structurally lossy, since each stage discards information later stages may need, including cues for cultural grounding, register, and disambiguation. We examine the benefits of a simple and training-free intervention: a context-aware translation cascade, which additionally provides the original question, the English translated question, and the reasoning trace to the context of the final translation module. We evaluate gains across nine multilingual benchmarks including various task types, three backbone models, and 285 high-, mid-, and low-resource languages, and demonstrate strong gains for open-ended generation across models and resource regimes. We show that the original language question carries most of the beneficial context. Our study emphasizes the need to better design information flow in machine translation cascades for mitigating error propagation, and provides a simple and actionable default strategy: preserve the original user question until the end of the pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that routing the original-language question to the final translation step improves multilingual reasoning cascades, but the gains could come from longer prompts rather than the added context.

read the letter

The main thing to know is that adding the original question, its English version, and the reasoning trace to the last translation module in a cascade produces clear gains on open-ended tasks, and the original-language question accounts for most of it.

The work is a direct, training-free tweak on existing cascades. Standard pipelines lose cultural and disambiguation cues when they translate to English for reasoning and back out. This version keeps more of that information until the end. The evaluation covers nine benchmarks, three models, and 285 languages across resource levels, which is a solid range. They also break out which piece of the added context drives the result.

The soft spot is the absence of a length-matched control. The intervention lengthens the prompt to the final module, and the paper credits the semantic content of the original question. Without a baseline that pads with neutral tokens or repeated text to the same length, the improvement might reflect changes in attention or generation behavior from extra tokens instead. The abstract does not describe such a check.

This is for teams already running multilingual reasoning pipelines who want a cheap default adjustment. The scale of the test and the practical framing make it worth a referee's time, even if the causal story needs tightening on length effects.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard translation cascades for multilingual reasoning are structurally lossy because each stage discards information (e.g., cultural cues, register, disambiguation) needed later. It proposes a simple, training-free context-aware cascade that appends the original-language question, the English translated question, and the reasoning trace to the final translation module. Evaluation across nine multilingual benchmarks, three backbone models, and 285 high/mid/low-resource languages shows strong gains for open-ended generation tasks, with the original-language question supplying most of the benefit; the work recommends preserving the original question through the pipeline.

Significance. If the gains are robustly due to semantic context rather than prompt length or other factors, the work supplies a practical default strategy for reducing error propagation in multilingual MT cascades. The scale of the evaluation (285 languages, multiple models and task types) is a clear strength that would make the result broadly relevant if the attribution holds.

major comments (2)

[§4 (Experiments)] §4 (Experiments): the context-aware cascade necessarily lengthens the prompt to the final translation module, yet no length-matched control (padding with neutral tokens, repeated text, or random strings of equal token count) is described. This directly undermines the claim that 'the original language question carries most of the beneficial context,' because observed gains on open-ended generation could arise from changes in input length or attention distribution rather than informational content.
[§5 (Results)] §5 (Results): the paper reports 'strong gains' across models and resource regimes but supplies limited statistical detail (variance, significance tests, or exact effect sizes) and no explicit controls for prompt-length confounds. This weakens support for the generalization claim over 285 languages and makes the attribution to context load-bearing for the central recommendation.

minor comments (1)

[Abstract] The abstract lists 'nine multilingual benchmarks' without naming them or the task types; adding this would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on potential confounds and statistical reporting. We address the two major comments point by point below, acknowledging where controls were missing and outlining revisions.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): the context-aware cascade necessarily lengthens the prompt to the final translation module, yet no length-matched control (padding with neutral tokens, repeated text, or random strings of equal token count) is described. This directly undermines the claim that 'the original language question carries most of the beneficial context,' because observed gains on open-ended generation could arise from changes in input length or attention distribution rather than informational content.

Authors: We agree this is a valid concern and a limitation of the current experiments: no length-matched controls were performed, so length or attention effects cannot be fully ruled out as alternative explanations for gains on open-ended tasks. Our existing ablations (comparing original-question context against English-question or reasoning-trace context) provide some evidence that content matters, as the original-question variant outperformed the others despite comparable or shorter added length in many languages. To strengthen attribution, we will add length-matched controls (padding with repeated neutral text of equal token count) in the revised manuscript and report the results. revision: yes
Referee: [§5 (Results)] §5 (Results): the paper reports 'strong gains' across models and resource regimes but supplies limited statistical detail (variance, significance tests, or exact effect sizes) and no explicit controls for prompt-length confounds. This weakens support for the generalization claim over 285 languages and makes the attribution to context load-bearing for the central recommendation.

Authors: We acknowledge the limited statistical detail in the current version. In revision we will add per-model variance, statistical significance tests (where sample sizes permit), and effect sizes, along with the length-matched controls described in response to the first comment. These changes will better support the generalization across 285 languages and the recommendation to preserve the original question. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations

full rationale

The paper describes an empirical intervention (context-aware translation cascade) and reports benchmark results across models, languages, and tasks. No equations, fitted parameters, predictions, or uniqueness theorems are present that could reduce to inputs by construction. Claims rest on direct comparisons to baselines, with no self-citation load-bearing steps or ansatzes smuggled in. This is the standard case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study of an intervention in existing cascade pipelines; introduces no new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5718 in / 1096 out tokens · 59725 ms · 2026-06-26T04:03:27.690663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references

[1]

In2019 4th International Conference on Me- chanical, Control and Computer Engineering (ICM- CCE), pages 39–393

A survey of low resource neural machine trans- lation. In2019 4th International Conference on Me- chanical, Control and Computer Engineering (ICM- CCE), pages 39–393. IEEE. Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse benchmark for mul- tilingual open domain question answering. Transac- tions of the Association for Compu...

2021
[2]

In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand

mCSQA: Multilingual commonsense reason- ing dataset with unified creation strategy by language models and humans. In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand. Association for Computational Linguistics. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, ...

arXiv 2024
[3]

In Proceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing , pages 13340–13358

How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. In Proceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing , pages 13340–13358. Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Pro...

2025
[5]

Final Output: Provide only the final, concise result in {language} within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis in {language} goes here] </think> <answer> [Final answer in {language} goes here] </answer> Input: {question} Output: Figure 3: E2E prompt for Open-ended QA. Translate the following question to English...
[6]

Analyze the constraints, perform necessary calculations, or outline your arguments

Reasoning Process: Before answering, break down the problem logically. Analyze the constraints, perform necessary calculations, or outline your arguments. Enclose this entire thought process within <think> tags
[7]

Final Output: Provide only the final, concise result in English within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis in English goes here] </think> <answer> [Final answer in English goes here] </answer> Input: {question} Output: Figure 5: LLM prompt for Open-ended QA. You are an advanced translation assistant. You will ...
[9]

Final Output: Provide only the letter of the correct option ({possible answer choice letters}) within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis of each option in {language} goes here] </think> <answer> [{possible answer choice letters}] </answer> Input: {question} {options text} Output: Figure 8: E2E prompt for Mult...
[10]

Evaluate each option and explain why it is correct or incorrect

Reasoning Process: Before answering, break down the problem logically. Evaluate each option and explain why it is correct or incorrect. Enclose this entire thought process within <think> tags
[11]

Final Output: Provide only the letter of the correct option ({possible answer choice letters}) within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis of each option in English goes here] </think> <answer> [{possible answer choice letters}] </answer> Input: {question} {options text} Output: Figure 10: LLM prompt for Multip...
[12]

Enclose this entire process within <think> tags

Reasoning Process: Solve the problem step-by-step in {language}, showing all calculations and logical deductions. Enclose this entire process within <think> tags
[13]

Required Format: <think> [Detailed step-by-step solution in {language} goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 12: E2E prompt for Math

Final Answer: Place only the final answer inside \\boxed{{}} immediately after the closing </think> tag. Required Format: <think> [Detailed step-by-step solution in {language} goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 12: E2E prompt for Math. Translate the following question to English: {question} Required Format: <tra...
[14]

Enclose this entire process within <think> tags

Reasoning Process: Solve the problem step-by-step in English, showing all calculations and logical deductions. Enclose this entire process within <think> tags
[15]

Required Format: <think> [Detailed step-by-step solution in English goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 14: LLM prompts for Math

Final Answer: Place only the final answer inside \\boxed{{}} immediately after the closing </think> tag. Required Format: <think> [Detailed step-by-step solution in English goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 14: LLM prompts for Math. You are an advanced mathematics reasoning assistant. You will be provided with ...
[16]

Source = `` 正确翻译（ Cho tam giác ABCABC ABC nội tiếp trên đường tròn ωω ω

Structural / formatting error The output is not a clean translation --- it leaks annotation templates, repeats identical translations for distinct options, drops the question body, uses placeholders, outputs meta text, or leaves untranslated source-script text inside the English output. Source = `` 正确翻译（ Cho tam giác ABCABC ABC nội tiếp trên đường tròn ω...
[17]

This includes person, team, place, organization, object, instrument, food item, artifact, or other central noun phrase

Referent / entity substitution The overall topic changes because a key referent is swapped for a different one. This includes person, team, place, organization, object, instrument, food item, artifact, or other central noun phrase. Source = `` ラムズはいつスーパーボウルでプレーしましたか'' Correct: ``When did the Rams play in the Super Bowl?'' Error: ``When was the Super Bowl ...
[18]

Event / constraint distortion The same core referents remain, but who-did-what, the key action/relation, negation, condition, comparison, or quantity is changed or dropped. Source = ``Akeredolu fofin de awọn ọlọkada l'Ondo.'' Correct: ``Akeredolu has given new motorcycles to the riders in Ondo.'' Error: ``Akeredolu appoints new Ondo commissioners'' --- th...
[19]

Cultural / local-term mistranslation A culture-specific food, idiom, institution, festival, clothing item, household item, or local artifact is translated literally or mapped to the wrong referent. Source = ``Um misto quente é um sanduíche feito com pão.'' Correct: `A ``misto quente'' is a sandwich made with bread.' Error: ``A hot mix is a sandwich made w...
[20]

Source (a math problem) = ``A group of 7 friends split a bill of $84 equally

Hallucination / over-answering The model invents content not in the source, replaces the source with an unrelated question or statement, or solves/explains the task instead of translating it. Source (a math problem) = ``A group of 7 friends split a bill of $84 equally. How much does each pay?'' Correct: preserves the question. Error: ``Each person pays $1...
[21]

question: the source-language text, copied verbatim. 2. translation: the candidate English translation, copied verbatim. 3. error type: one of 1, 2, 3, 4, 5, or OK if the translation is faithful. 4. For items with bundled options, a single dropped or merged option still counts as category 1. 5. A correct answer value does not make the translation correct;...

1945

[1] [1]

In2019 4th International Conference on Me- chanical, Control and Computer Engineering (ICM- CCE), pages 39–393

A survey of low resource neural machine trans- lation. In2019 4th International Conference on Me- chanical, Control and Computer Engineering (ICM- CCE), pages 39–393. IEEE. Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse benchmark for mul- tilingual open domain question answering. Transac- tions of the Association for Compu...

2021

[2] [2]

In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand

mCSQA: Multilingual commonsense reason- ing dataset with unified creation strategy by language models and humans. In Findings of the Associa- tion for Computational Linguistics: ACL 2024 , pages 14182–14214, Bangkok, Thailand. Association for Computational Linguistics. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, ...

arXiv 2024

[3] [3]

In Proceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing , pages 13340–13358

How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. In Proceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing , pages 13340–13358. Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Pro...

2025

[4] [5]

Final Output: Provide only the final, concise result in {language} within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis in {language} goes here] </think> <answer> [Final answer in {language} goes here] </answer> Input: {question} Output: Figure 3: E2E prompt for Open-ended QA. Translate the following question to English...

[5] [6]

Analyze the constraints, perform necessary calculations, or outline your arguments

Reasoning Process: Before answering, break down the problem logically. Analyze the constraints, perform necessary calculations, or outline your arguments. Enclose this entire thought process within <think> tags

[6] [7]

Final Output: Provide only the final, concise result in English within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis in English goes here] </think> <answer> [Final answer in English goes here] </answer> Input: {question} Output: Figure 5: LLM prompt for Open-ended QA. You are an advanced translation assistant. You will ...

[7] [9]

Final Output: Provide only the letter of the correct option ({possible answer choice letters}) within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis of each option in {language} goes here] </think> <answer> [{possible answer choice letters}] </answer> Input: {question} {options text} Output: Figure 8: E2E prompt for Mult...

[8] [10]

Evaluate each option and explain why it is correct or incorrect

Reasoning Process: Before answering, break down the problem logically. Evaluate each option and explain why it is correct or incorrect. Enclose this entire thought process within <think> tags

[9] [11]

Final Output: Provide only the letter of the correct option ({possible answer choice letters}) within <answer> tags. Required Format: <think> [Detailed step-by-step logic and analysis of each option in English goes here] </think> <answer> [{possible answer choice letters}] </answer> Input: {question} {options text} Output: Figure 10: LLM prompt for Multip...

[10] [12]

Enclose this entire process within <think> tags

Reasoning Process: Solve the problem step-by-step in {language}, showing all calculations and logical deductions. Enclose this entire process within <think> tags

[11] [13]

Required Format: <think> [Detailed step-by-step solution in {language} goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 12: E2E prompt for Math

Final Answer: Place only the final answer inside \\boxed{{}} immediately after the closing </think> tag. Required Format: <think> [Detailed step-by-step solution in {language} goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 12: E2E prompt for Math. Translate the following question to English: {question} Required Format: <tra...

[12] [14]

Enclose this entire process within <think> tags

Reasoning Process: Solve the problem step-by-step in English, showing all calculations and logical deductions. Enclose this entire process within <think> tags

[13] [15]

Required Format: <think> [Detailed step-by-step solution in English goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 14: LLM prompts for Math

Final Answer: Place only the final answer inside \\boxed{{}} immediately after the closing </think> tag. Required Format: <think> [Detailed step-by-step solution in English goes here] </think> \\boxed{{[Final answer]}} Input: {question} Output: Figure 14: LLM prompts for Math. You are an advanced mathematics reasoning assistant. You will be provided with ...

[14] [16]

Source = `` 正确翻译（ Cho tam giác ABCABC ABC nội tiếp trên đường tròn ωω ω

Structural / formatting error The output is not a clean translation --- it leaks annotation templates, repeats identical translations for distinct options, drops the question body, uses placeholders, outputs meta text, or leaves untranslated source-script text inside the English output. Source = `` 正确翻译（ Cho tam giác ABCABC ABC nội tiếp trên đường tròn ω...

[15] [17]

This includes person, team, place, organization, object, instrument, food item, artifact, or other central noun phrase

Referent / entity substitution The overall topic changes because a key referent is swapped for a different one. This includes person, team, place, organization, object, instrument, food item, artifact, or other central noun phrase. Source = `` ラムズはいつスーパーボウルでプレーしましたか'' Correct: ``When did the Rams play in the Super Bowl?'' Error: ``When was the Super Bowl ...

[16] [18]

Event / constraint distortion The same core referents remain, but who-did-what, the key action/relation, negation, condition, comparison, or quantity is changed or dropped. Source = ``Akeredolu fofin de awọn ọlọkada l'Ondo.'' Correct: ``Akeredolu has given new motorcycles to the riders in Ondo.'' Error: ``Akeredolu appoints new Ondo commissioners'' --- th...

[17] [19]

Cultural / local-term mistranslation A culture-specific food, idiom, institution, festival, clothing item, household item, or local artifact is translated literally or mapped to the wrong referent. Source = ``Um misto quente é um sanduíche feito com pão.'' Correct: `A ``misto quente'' is a sandwich made with bread.' Error: ``A hot mix is a sandwich made w...

[18] [20]

Source (a math problem) = ``A group of 7 friends split a bill of $84 equally

Hallucination / over-answering The model invents content not in the source, replaces the source with an unrelated question or statement, or solves/explains the task instead of translating it. Source (a math problem) = ``A group of 7 friends split a bill of $84 equally. How much does each pay?'' Correct: preserves the question. Error: ``Each person pays $1...

[19] [21]

question: the source-language text, copied verbatim. 2. translation: the candidate English translation, copied verbatim. 3. error type: one of 1, 2, 3, 4, 5, or OK if the translation is faithful. 4. For items with bundled options, a single dropped or merged option still counts as category 1. 5. A correct answer value does not make the translation correct;...

1945