arxiv: 2604.04720 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

Recognition: no theorem link

What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Dayeon Ki , Kevin Duh , Marine Carpuat

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual reasoningreasoning traceslarge reasoning modelslogistic regressionsparse autoencoderstest-time steeringmathematical reasoninglanguage variation

0 comments

The pith

Reasoning features that aid accuracy in one language can reduce it in another for large models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks what actually makes reasoning traces effective when large reasoning models work in languages other than English. It defines a set of measurable features that track multilingual alignment, the quality of individual reasoning steps, and the overall flow of the trace. Logistic regression then measures how strongly each feature predicts whether the final answer is correct. The results across math benchmarks, four models, and ten languages show that most features link positively to accuracy, yet the size of the link changes sharply by language and can turn negative in some cases. Sparse autoencoders are used to find additional hidden concepts in the traces, and the features are tested as simple selection rules at inference time.

Core claim

Effective multilingual reasoning does not reduce to making every language copy English reasoning patterns. A suite of measurable features covering multilingual alignment, reasoning step properties, and reasoning flow shows positive but highly variable associations with final-answer accuracy when logistic regression is applied to traces from two mathematical benchmarks, four large reasoning models, and ten languages. Some associations even reverse sign. Sparse autoencoders trained on the same traces surface latent reasoning concepts that either match or extend the hand-defined features. Treating the features as test-time selection policies demonstrates their practical use for steering models.

What carries the argument

The suite of measurable reasoning features (multilingual alignment, reasoning step, and reasoning flow) together with logistic regression that quantifies each feature's association with accuracy and sparse autoencoders that extract latent concepts from the traces.

If this is right

Reward designs for training large reasoning models should move from English-centric templates to adaptive objectives that respect language-specific patterns.
Multilingual benchmarks should score reasoning traces according to the features that actually predict success in each language rather than a single universal standard.
Test-time selection policies based on these features can improve accuracy without retraining by preferring stronger traces per language.
Latent concepts discovered by sparse autoencoders supply additional interpretable dimensions that go beyond the hand-defined feature list.
Persistent performance gaps between English and other languages may arise from mismatched reasoning styles rather than data volume alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reversals in feature-accuracy links imply that directly optimizing models to produce English-like traces could actively lower performance in certain languages.
The same feature-analysis pipeline could be applied to non-mathematical tasks to check whether language-specific reasoning patterns appear outside math.
Models could be trained to detect the input language and then generate traces that match the high-value feature profile for that language.
Language-aware routers that pick reasoning strategies according to detected language and feature scores become a natural next engineering step.

Load-bearing premise

The defined set of measurable features captures the main drivers of successful reasoning without leaving out major hidden factors, and the statistical links found can be turned into useful selection rules.

What would settle it

Selecting traces that score highest on the measured features at test time fails to raise accuracy in any language or model tested.

Figures

Figures reproduced from arXiv: 2604.04720 by Dayeon Ki, Kevin Duh, Marine Carpuat.

**Figure 1.** Figure 1: Overview of our method. We define 16 measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow. We estimate each feature’s effect on accuracy y via regression (Feature Analysis), validate and discover additional features using sparse autoencoders (SAE Analysis), and use these features to select reasoning traces at inference (Test-Time Selection). This raises a centra… view at source ↗

**Figure 2.** Figure 2: English versus non-English feature analysis results. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-language feature analysis results. Top: MGSM-Rev2; bottom: AIME. Raw feature values and accuracies for each language are provided in Appendix D.4. MGSM-Rev2 queries are easy enough in English that LRMs frequently solve problems via latent reasoning with minimal reliance on explicit trace behaviors. Indeed, this is consistent with prior evidence that models can compute answers directly in their latent r… view at source ↗

**Figure 4.** Figure 4: Pass@1 per model using each feature as test-time selection policy. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt templates used for sampling generations for each language. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Per-language feature analysis results with multivariate logistic regression. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Screenshots of task instructions provided to human annotators. [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows reasoning features like step counts and flow associate with accuracy differently across languages and can reverse, but tokenizer and generation differences likely confound the logistic regressions.

read the letter

The main thing to know is that this work finds most of the defined reasoning features link positively to final accuracy on math tasks, yet the strength shifts a lot by language and sometimes flips sign entirely. That directly undercuts the idea that you can just port English-style reasoning traces everywhere else. They run this on four LRMs, two benchmarks, and ten languages, which gives the claim some breadth. The regressions plus the SAE latent concept discovery and the test-time policy experiments form a coherent pipeline for diagnosing multilingual gaps. What stands out is the move from hand-defined features to unsupervised discovery and then to an actual steering application; that combination feels fresh and practical for people trying to build better multilingual rewards. The paper does a clean job of framing the English-centric assumption and then testing it head-on with measurable traces. The soft spot is exactly the one the stress-test flags. Raw step counts and flow metrics extracted from model outputs will embed language-specific tokenization lengths, punctuation habits, and generation statistics. Without normalization or explicit language-by-feature terms in the logistic models, the reported reversals could be surface artifacts rather than genuine differences in reasoning quality. The abstract does not mention those controls, so the full paper needs to show they are handled or the central claim weakens. This is aimed at researchers working on multilingual evaluation and reward design rather than pure theory. Anyone building benchmarks or alignment objectives that span languages will get concrete diagnostics and a starting feature set to adapt. It deserves a serious referee because the empirical scope is decent and the question matters, even though the confounds will require revision to pin down. Send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that effective multilingual reasoning in large reasoning models (LRMs) cannot be assumed to resemble English reasoning. It defines a suite of measurable features spanning multilingual alignment, reasoning step counts, and flow patterns in model-generated traces. Logistic regression quantifies each feature's association with final-answer accuracy; sparse autoencoders (SAEs) are trained to discover latent reasoning concepts; and the features are deployed as test-time selection policies. Experiments across two mathematical reasoning benchmarks, four LRMs, and 10 languages show that most features are positively associated with accuracy, yet the strength of these associations varies substantially across languages and can reverse in some cases, challenging English-centric reward and benchmark designs.

Significance. If the reported associations prove robust to language-specific confounds, the work would be significant for multilingual NLP and reasoning research. It supplies concrete, measurable features and an SAE-based discovery pipeline that can be reused for analyzing non-English traces, while providing direct evidence that uniform English-derived metrics are insufficient. The test-time selection results offer a practical starting point for adaptive objectives, with clear implications for multilingual benchmark construction and reward modeling.

major comments (2)

[Methods / logistic regression analysis] The logistic regressions that underpin the central claim (varying and reversing feature-accuracy associations) do not report language fixed effects, language-by-feature interaction terms, or explicit normalization for tokenization and generation statistics that differ systematically by language (e.g., longer sequences in agglutinative languages). Without these controls, the observed cross-language variation may be driven by surface-level artifacts rather than reasoning quality, directly threatening the headline finding.
[Feature definition and experimental setup] Exact operational definitions of the core measurable features (multilingual alignment, reasoning step counts, flow patterns), including preprocessing, data-selection criteria, and any post-hoc choices, are insufficiently specified. The absence of error bars, statistical significance tests, or robustness checks for the regressions further reduces verifiability of the reported associations and reversals.

minor comments (2)

[Abstract / Methods] The abstract and methods would benefit from an explicit enumeration of the full feature set and the precise benchmarks/models used, to improve reproducibility.
[SAE results] Figure captions and SAE analysis sections could clarify how discovered latent concepts map back to the hand-defined features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where we agree that revisions are needed to strengthen the analysis and where we provide additional clarification or defense of our approach.

read point-by-point responses

Referee: [Methods / logistic regression analysis] The logistic regressions that underpin the central claim (varying and reversing feature-accuracy associations) do not report language fixed effects, language-by-feature interaction terms, or explicit normalization for tokenization and generation statistics that differ systematically by language (e.g., longer sequences in agglutinative languages). Without these controls, the observed cross-language variation may be driven by surface-level artifacts rather than reasoning quality, directly threatening the headline finding.

Authors: We agree that language fixed effects and language-by-feature interactions would provide a more rigorous test of whether the reported associations hold after accounting for language-specific baselines. In the revised manuscript we will re-estimate all logistic regressions with language fixed effects included and with explicit interaction terms between each feature and language indicator variables. We will also add controls for sequence length (token count) and generation statistics (e.g., average tokens per step) to normalize for systematic differences across languages. These additions will be reported in a new table and discussed in the methods and results sections. We believe the core finding of varying and sometimes reversing associations will remain, but the controls will make the evidence substantially stronger. revision: yes
Referee: [Feature definition and experimental setup] Exact operational definitions of the core measurable features (multilingual alignment, reasoning step counts, flow patterns), including preprocessing, data-selection criteria, and any post-hoc choices, are insufficiently specified. The absence of error bars, statistical significance tests, or robustness checks for the regressions further reduces verifiability of the reported associations and reversals.

Authors: We acknowledge that the current manuscript provides only high-level descriptions of the feature extraction pipelines. In the revision we will expand the methods section and add a dedicated appendix that gives precise operational definitions for each feature, including tokenization and preprocessing steps, exact data-selection filters, and any post-hoc decisions (e.g., threshold choices for alignment scores). We will also attach error bars (standard errors) to all logistic-regression coefficients, report p-values from Wald tests, and include a set of robustness checks (alternative specifications, subsample analyses, and permutation tests). These changes will be accompanied by updated figures and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feature-accuracy associations are measured, not derived by construction

full rationale

The paper defines a fixed suite of reasoning features (multilingual alignment, step counts, flow patterns) upfront, extracts them from model-generated traces, and applies logistic regression to measure associations with accuracy across languages and models. This is standard empirical analysis; the regression coefficients are outputs of the data, not forced by the feature definitions themselves. Test-time selection policies are a downstream application of the observed associations rather than a claim that reduces to the inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central finding (varying association strengths, including reversals) is falsifiable and does not equate to its own measurement procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's core contribution is the construction of new measurable features rather than fitting free parameters or postulating new entities. It relies on standard assumptions of logistic regression and autoencoder training.

axioms (2)

domain assumption Logistic regression can quantify associations between defined reasoning features and final answer accuracy
Invoked when using regression to measure how each feature associates with accuracy.
domain assumption Sparse autoencoders can discover latent reasoning concepts that instantiate or extend the defined features
Invoked when training SAEs over multilingual traces.

pith-pipeline@v0.9.0 · 5503 in / 1297 out tokens · 44211 ms · 2026-05-10T19:13:54.145738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 5 canonical work pages · 2 internal anchors

[1]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp

work page doi:10.18653/v1/2024.findings-emnlp 2024
[2]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URLhttps://aclanthology.org/2024.findings-emnlp.411/. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URLhttps://a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

doi: 10.18653/v1/2022.acl-long.62

URLhttps://arxiv.org/abs/2603.10793. Fahim Faisal, Kaiqiang Song, Song Wang, Simin Ma, Shujian Liu, Haoyun Deng, and Sathish Reddy Indurthi. Aligning multilingual reasoning with verifiable semantics from a high-resource expert model, 2025. URLhttps://arxiv.org/abs/2509.25543. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Lang...

work page doi:10.18653/v1/2022.acl-long.62 2025
[4]

GPT-4o System Card

URLhttps://openreview.net/forum?id=4tYckHNVXV. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Final Answer Emission

URLhttps://arxiv.org/abs/2511.05162. Archiki Prasad, Swarnadeep Saha, Xiang Zhou, and Mohit Bansal. ReCEval: Evaluating reasoning chains via correctness and informativeness. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10066–10086, Singapore, December 2023. ...

work page doi:10.18653/v1/2023.emnlp-main.622 2023
[6]

problem setup: Parsing or rephrasing the problem (initial reading or comprehension)
[7]

plan generation: Stating or deciding on a plan of action (often meta-reasoning)
[8]

fact retrieval: Recalling facts, formulas, problem details (without immediate computation)
[9]

active computation: Performing algebra, calculations, manipulations toward the answer
[10]

result consolidation: Aggregating intermediate results, summarizing, or preparing final answer
[11]

uncertainty management: Expressing confusion, re-evaluating, proposing alternative plans (includes backtracking)
[12]

final answer emission: Explicit statement of the final boxed answer or earlier sentences that contain the final answer
[13]

self checking: Verifying previous steps, checking calculations, and re-confirmations
[14]

function tags

unknown: Use only if the sentence does not fit any of the above tags or is purely stylistic or semantic. Dependencies: For each sentence, include a list of earlier sentence indices that the reasoning in this sentence uses. For example: - If sentence 9 performs a computation based on a plan in sentence 4 and a recalled rule in sentence 5, then depends on: ...