arxiv: 2602.10732 · v2 · submitted 2026-02-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Alaa Elsetohy , Sama Hadhoud , Haryo Akbarianto Wibowo , Chenxi Whitehouse , Genta Indra Winata , Fajri Koto , Alham Fikri Aji This is my paper

Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual benchmarkmulticultural reasoningtemplate fillinglarge language modelscross-lingual evaluationreasoning typescultural aspects

0 comments

The pith

Reasoning-mode LLMs achieve 80.8% accuracy and language parity on a new multicultural reasoning benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Macaron, a benchmark that uses 100 language-agnostic templates to separate reasoning types from cultural contexts. Native annotators fill these templates to generate multiple-choice and true/false questions in 20 languages from 20 cultural contexts. This design allows testing whether models can reason the same way when the cultural premises change. Evaluations show reasoning-mode models perform best and maintain similar scores in English and local languages, while open-weight models drop in local languages and perform near chance on true/false items. Mathematical and counting questions are the most difficult for all models.

Core claim

By creating 100 language-agnostic templates that cover seven reasoning types and 22 cultural aspects, and having native annotators produce scenario-aligned questions in English and local languages, Macaron provides a controlled way to measure multilingual and multicultural reasoning without the biases of direct translation or uncontrolled cultural datasets. The resulting 11,862 instances show that reasoning-mode models reach 80.8 percent overall accuracy with near parity between languages.

What carries the argument

The set of 100 language-agnostic templates that factorize reasoning type from cultural aspect, which native annotators fill to produce aligned questions across languages.

If this is right

Reasoning-mode models demonstrate greater robustness to language and cultural variation than open-weight models.
Culture-grounded mathematical and counting tasks remain challenging even for top models.
Systematically derived true/false questions expose guessing behavior in weaker models.
Template-filling enables scalable creation of balanced multilingual test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this template method to additional low-resource languages could further test generalization.
The performance gap suggests that pretraining data imbalances affect reasoning over non-English cultural content.
Future work could apply similar controls to other reasoning benchmarks to isolate cultural effects.

Load-bearing premise

Native annotators can fill the templates to produce questions that test exactly the same reasoning type and difficulty across cultures without introducing uncontrolled differences in phrasing or interpretation.

What would settle it

If native speakers rate the local-language questions as having different difficulty or cultural fit compared to English versions for the same template, despite matching reasoning type, the control over equivalence would be shown to fail.

read the original abstract

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions, and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages and dialects (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance (80.8% overall) and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Macaron's template-first construction that separates reasoning type from cultural content via language-agnostic templates is a clear step forward from translated or culture-first benchmarks, and the model evaluations show consistent patterns, but the absence of any checks on whether filled templates preserve equivalent difficulty leaves the performance claims only moderately supported.

read the letter

Macaron uses 100 language-agnostic templates across seven reasoning types and 22 cultural aspects. Native annotators then fill them to produce matched English and local-language multiple-choice questions plus systematically derived true/false items, yielding 11,862 instances over 20 languages and 20 countries, including low-resource ones like Amharic and Yoruba. The zero-shot results on 21 models indicate that reasoning-mode models reach 80.8% overall with near parity between English and local languages, while open-weight models drop sharply on local versions and often approach chance on true/false tasks; math and counting templates are hardest throughout. This factorization approach is new relative to the benchmarks cited in the abstract, and the scale plus public release of the data are practical strengths. The main gap is the lack of any reported validation that the filled templates actually hold reasoning demand and difficulty constant across languages. No inter-annotator agreement, difficulty ratings, or pilot calibration is described, so cultural reinterpretation of the same template could alter effective hardness without detection, especially on counting or math items. That makes the headline performance differences harder to interpret cleanly. The work is aimed at researchers building or testing multilingual LLMs who need more controlled cultural reasoning data. It deserves peer review because the construction method is systematic and the reported patterns are worth testing further, even if the controls need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Macaron, a template-first benchmark for multilingual and multicultural reasoning that uses 100 language-agnostic templates covering 7 reasoning types and 22 cultural aspects. Native annotators fill these to create scenario-aligned English and local-language multiple-choice questions plus derived true/false items, yielding 11,862 instances across 20 languages, 10 scripts, and 20 countries. Zero-shot evaluation of 21 LLMs shows reasoning-mode models reaching 80.8% overall with near English-local parity, while open-weight models degrade on local languages and approach chance on T/F tasks; culture-grounded math and counting templates are hardest.

Significance. If the template-filling process demonstrably preserves equivalent reasoning demand across languages, the benchmark would be a valuable controlled resource for the field, addressing the gap between translated English-centric datasets and uncontrolled culture-first collections. The public Hugging Face release and explicit factorization of reasoning type from cultural aspect support reproducible evaluation of multicultural LLM capabilities.

major comments (2)

[Abstract / Construction Process] Abstract and construction description: the headline claims (80.8% for reasoning-mode models, near-parity, and substantial degradation for open-weight models on local languages and T/F) rest on the unverified assumption that the 100 templates, once filled by natives, produce questions of matched reasoning type and difficulty. No inter-annotator agreement, pilot calibration, difficulty ratings, or quantitative checks on reasoning fidelity are reported, leaving open the possibility that cultural reinterpretation (especially in counting/math templates) alters effective difficulty without detection.
[Evaluation] Evaluation section: the reported performance gaps between model classes and between English vs. local languages cannot be confidently interpreted without evidence that template fillings maintain equivalent reasoning demand. The systematic T/F derivation is described but not validated for consistency across the 22 cultural aspects.

minor comments (2)

[Methods] The manuscript would benefit from including one or two concrete template examples in the main text to illustrate the language-agnostic design.
[Results] Table or figure captions could more explicitly state the number of instances per language and per reasoning type to aid quick assessment of balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of validating reasoning equivalence across languages and cultural contexts. We address the major comments point by point below, clarifying our approach and committing to revisions that strengthen the evidence for template fidelity without altering the core claims.

read point-by-point responses

Referee: [Abstract / Construction Process] Abstract and construction description: the headline claims (80.8% for reasoning-mode models, near-parity, and substantial degradation for open-weight models on local languages and T/F) rest on the unverified assumption that the 100 templates, once filled by natives, produce questions of matched reasoning type and difficulty. No inter-annotator agreement, pilot calibration, difficulty ratings, or quantitative checks on reasoning fidelity are reported, leaving open the possibility that cultural reinterpretation (especially in counting/math templates) alters effective difficulty without detection.

Authors: We acknowledge that the current manuscript does not include quantitative inter-annotator agreement or pilot calibration statistics. The templates were iteratively designed by the authors to isolate reasoning types from cultural aspects, with explicit instructions to native annotators to preserve logical structure while localizing only the scenario details. In revision, we will add an expanded construction section that includes the full annotation guidelines, a post-hoc consistency analysis on a stratified sample of 300 instances (with two independent native speakers per language verifying reasoning type preservation), and annotator-provided difficulty ratings. This will directly address the concern for counting and math templates. revision: yes
Referee: [Evaluation] Evaluation section: the reported performance gaps between model classes and between English vs. local languages cannot be confidently interpreted without evidence that template fillings maintain equivalent reasoning demand. The systematic T/F derivation is described but not validated for consistency across the 22 cultural aspects.

Authors: The evaluation interprets gaps relative to the shared template origin, which by design holds reasoning demand constant between English and local-language versions. True/False items are derived deterministically from the multiple-choice correct answer by converting it to a declarative statement and generating false variants through minimal factual alterations that retain cultural grounding. We agree that explicit validation examples would improve interpretability. In the revised manuscript we will add an appendix subsection with representative T/F derivations across at least five distinct cultural aspects, plus a table summarizing annotator confirmation that reasoning demand was preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with direct empirical evaluation

full rationale

This is a benchmark paper that defines 100 language-agnostic templates, describes native-annotator filling to create 11,862 instances, and reports zero-shot model accuracies on the resulting dataset. No equations, fitted parameters, or predictions are present. Results (e.g., 80.8% for reasoning-mode models) are direct measurements on the constructed data, not quantities that reduce to internal definitions or self-citations by construction. The central claim rests on the external validity of the template-filling process, which is not a mathematical derivation and therefore cannot exhibit circularity of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that native annotators can reliably produce culturally accurate and reasoning-equivalent questions from the templates; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Native annotators can create scenario-aligned English and local-language questions that preserve the intended reasoning type and cultural aspect without introducing uncontrolled bias or difficulty shifts.
This assumption underpins the entire data creation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1203 out tokens · 49107 ms · 2026-05-16T05:50:12.909809+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Culture-grounded mathematical and counting templates are consistently the hardest

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.