AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Ai Ti Aw; Bowei Zou; Nancy F. Chen; Roy Ka-Wei Lee; Tarun Kumar Vangani; Weihua Zheng; Xin Huang; Xiyan Tao; Yuhao Wu; Zhengyuan Liu

arxiv: 2501.16154 · v4 · submitted 2025-01-27 · 💻 cs.CL · cs.AI

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Weihua Zheng , Xin Huang , Zhengyuan Liu , Tarun Kumar Vangani , Bowei Zou , Xiyan Tao , Yuhao Wu , Ai Ti Aw

show 2 more authors

Nancy F. Chen Roy Ka-Wei Lee

This is my paper

Pith reviewed 2026-05-23 04:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual reasoningchain-of-thoughtcross-lingual consistencylow-resource languagesadaptive routingfactual reasoninglanguage models

0 comments

The pith

AdaMCoT improves cross-lingual factual reasoning by routing thoughts through selected intermediary languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AdaMCoT as a way to handle uneven multilingual performance in large language models caused by imbalanced training data. It dynamically routes chain-of-thought reasoning through intermediary thinking languages chosen by a reward-based mechanism before producing the final answer in the target language. This uses the model's existing language-agnostic core and requires no extra pretraining or large-scale translation. A sympathetic reader would care because it promises better factual accuracy and consistency in low-resource languages through a lightweight adaptation rather than costly retraining.

Core claim

AdaMCOT enhances multilingual factual reasoning by dynamically routing thought processes in intermediary thinking languages before generating target-language responses. It leverages a language-agnostic core and incorporates an adaptive reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model's hidden states and semantic space further elucidates the underlying mechanism.

What carries the argument

The adaptive reward-based mechanism that selects optimal intermediary reasoning languages for chain-of-thought before target response generation.

If this is right

Factual reasoning quality rises across benchmarks while preserving linguistic nuances.
Cross-lingual consistency improves without additional pretraining or translation steps.
Performance gaps narrow between high-resource and low-resource languages.
Hidden-state analysis reveals how the language-agnostic core supports the selected pathways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing approach could extend to non-factual tasks such as mathematical problem solving.
Explicit translation pipelines may become less necessary for cross-lingual reasoning transfer.
Joint optimization of the reward model with the base LLM could further strengthen selection accuracy.
Similar adaptive selection might reduce the need for language-specific fine-tuning in other multilingual applications.

Load-bearing premise

A reward-based adaptive mechanism can reliably identify and utilize optimal intermediary reasoning languages to bridge performance gaps based solely on the language-agnostic core of the model.

What would settle it

A controlled test showing that forcing the model to reason in its native language or a fixed high-resource language yields equal or higher factual accuracy than the adaptive selection process.

Figures

Figures reproduced from arXiv: 2501.16154 by Ai Ti Aw, Bowei Zou, Nancy F. Chen, Roy Ka-Wei Lee, Tarun Kumar Vangani, Weihua Zheng, Xin Huang, Xiyan Tao, Yuhao Wu, Zhengyuan Liu.

**Figure 1.** Figure 1: Overview of the AdaMCOT framework. The input in the example shown in the figure is a question in Malay. that the final answer maintains linguistic consistency with the input instruction while preserving the diverse knowledge embedded in the intermediate reasoning processes, we introduce an integration model(GPT-4o) to transform the intermediate reasoning into a final answer in the target language. Forma… view at source ↗

**Figure 2.** Figure 2: Distribution of Reasoning Pathway Selections on the mTruth Dataset: LLaMA3.1-8B- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Logit lens maps for Llama-3.1-8B-AdaMCOT (a and b) and answers of Llama-3.1-8B-Instruction and Llama-3.1-8B-AdaMCOT (c), when prompted with “Which countries have longer working hours than Mexico? A.Mexicans work long hours, but Cambodians might work even longer ... F: Mexicans work fewer hours than the Chinese." in Chinese. (a) and (b) highlight key changes in the model’s intermediate layers with green box… view at source ↗

**Figure 4.** Figure 4: Distribution of Reasoning Pathway Selections on the mTruth Dataset: Qwen2.5-7B-AdaMCOT vs. Base [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Reasoning Pathway Selections on the CrossMMLU & CrossLogiQA Dataset: Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Logit lens maps for Llama-3.1-8B-Instruction, when prompted with question [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Logit lens maps for Qwen2.5-7B-AdaMCOT, when prompted with question Which animals can add up numbers just as well as humans? A: No animals can add up numbers just as well as humans ... K: Parrots can add up numbers just as well as humans [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: UMAP Visualization of Semantic Spaces of Llama-3.1-8B-Instruction and Llama-3.1-8B-AdaMCOT [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: UMAP Visualization of Semantic Spaces of Qwen-2.5-7B-Instruction and Qwen-2.5-7B-AdaMCOT [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. Although these models show strong reasoning abilities, their performance varies significantly between languages due to the imbalanced distribution of training data. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaMCOT (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary "thinking languages" before generating target-language responses. AdaMCOT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model's hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaMCoT's adaptive intermediary-language routing for CoT is a practical idea but the abstract supplies zero numbers or baselines to judge whether it delivers.

read the letter

The main takeaway is that this paper proposes routing chain-of-thought through dynamically chosen intermediary languages via a reward-based selector on a language-agnostic core, without any additional pretraining, to close gaps in low-resource factual reasoning. If the experiments actually show this works, it would be a useful engineering step rather than a big conceptual leap. The framing of the problem with existing translation and tuning methods is clear and the hidden-state analysis angle is a reasonable way to probe the mechanism. The focus on maintaining nuances while bridging resource gaps is also on target. The central weakness is that the abstract asserts substantial gains in quality and consistency but gives no benchmarks, baselines, error breakdowns, or even rough effect sizes. Without those, the reward selector's ability to reliably pick optimal paths remains an untested assumption rather than demonstrated evidence. The full paper may contain the missing results, but nothing in the provided text lets a reader assess whether the gains exceed simpler prompting or fine-tuning alternatives. This is aimed at researchers working on multilingual LLM reasoning and low-resource settings. Someone already following CoT and routing work could extract the method description, but the lack of data limits immediate value. I would send it for peer review because the problem is real and the approach is grounded enough to deserve referee scrutiny if the experiments are solid.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces AdaMCoT, a framework for adaptive multilingual chain-of-thought reasoning in LLMs. It dynamically routes thought processes through intermediary thinking languages using a reward-based adaptive mechanism on a language-agnostic core, without additional pretraining. The paper claims this leads to substantial improvements in factual reasoning quality and cross-lingual consistency, especially for low-resource languages, as shown in comprehensive evaluations on multiple benchmarks, with additional analysis of hidden states and semantic space.

Significance. If the empirical results hold, the method offers a potentially scalable alternative to resource-intensive multilingual pretraining by leveraging adaptive routing within existing model capabilities. The focus on low-resource language gains and the inclusion of hidden-state analysis for mechanistic insight are notable strengths that could inform future work on cross-lingual consistency.

major comments (1)

[Abstract] Abstract: The central claim of 'substantial improvements in both factual reasoning quality and cross-lingual consistency' with 'particularly strong performance gains in low-resource language settings' is asserted without any quantitative results, specific benchmarks, baselines, error analysis, or tables. This absence makes the primary empirical contribution unevaluable from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address it point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'substantial improvements in both factual reasoning quality and cross-lingual consistency' with 'particularly strong performance gains in low-resource language settings' is asserted without any quantitative results, specific benchmarks, baselines, error analysis, or tables. This absence makes the primary empirical contribution unevaluable from the provided text.

Authors: We agree that the abstract would benefit from including concrete quantitative highlights to make the empirical claims immediately evaluable. The full manuscript reports detailed results across benchmarks (including XNLI, MMLU, and others), with comparisons to baselines and analysis of low-resource gains, but these are not summarized numerically in the abstract. We will revise the abstract to incorporate key performance metrics, such as average accuracy improvements and specific gains for low-resource languages, while keeping the abstract concise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and high-level description contain no equations, derivations, fitted parameters presented as predictions, or self-citation chains. The framework is introduced conceptually as leveraging existing language-agnostic core capabilities with an adaptive reward mechanism, without any reduction of outputs to inputs by construction. Evaluation is described as empirical across benchmarks, with no load-bearing mathematical steps that could exhibit circularity. This matches the default expectation for papers lacking derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5766 in / 1004 out tokens · 25869 ms · 2026-05-23T04:52:40.571285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

[1]

A Multitask, Multilingual, Multimodal Evalu- ation of ChatGPT on Reasoning, Hallucination, and Interactivity. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 675–718. Blevins, T.; Limisiewicz,...

work page arXiv 2024
[2]

Gao, C.; Hu, H.; Hu, P.; Chen, J.; Li, J.; and Huang, S

How Do Multilingual Language Models Re- member Facts? arXiv:2410.14387. Gao, C.; Hu, H.; Hu, P.; Chen, J.; Li, J.; and Huang, S

work page arXiv
[3]

Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly. In Duh, K.; Gomez, H.; and Bethard, S., eds.,Proceedings of the 2024 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6101–6117. Mexico City, Mexico:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

arXiv:2307.16039

Okapi: Instruction-tuned Large Language Mod- els in Multiple Languages with Reinforcement Learn- ing from Human Feedback. arXiv:2307.16039. Li, C.; Wang, S.; Zhang, J.; and Zong, C. 2024a. Im- proving In-context Learning of Multilingual Genera- tive Language Models with Cross-lingual Alignment. In Duh, K.; Gomez, H.; and Bethard, S., eds.,Proceed- ings of...

work page arXiv 2024
[5]

Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J

Bactrian-x: Multilingual replicable instruction- following models with low-rank adaptation.arXiv preprint arXiv:2305.15011. Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J. 2024b. PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment.arXiv preprint arXiv:2407.16222. Lin, G.; Wang, B.; Liu, Z.; and Chen, N. F. 2024. ...

work page arXiv 2024
[6]

Liu, H.; Liu, J.; Cui, L.; Teng, Z.; Duan, N.; Zhou, M.; and Zhang, Y

Is translation all you need? a study on solving multilingual tasks with large language models.arXiv preprint arXiv:2403.10258. Liu, H.; Liu, J.; Cui, L.; Teng, Z.; Duan, N.; Zhou, M.; and Zhang, Y . 2023. LogiQA 2.0—An Improved Dataset for Logical Reasoning in Natural Language Understanding.IEEE/ACM Transactions on Audio, Speech, and Language Processing, ...

work page arXiv 2023
[7]

Instruction Tuning with GPT-4

Instruction Tuning with GPT-4.arXiv preprint arXiv:2304.03277. Qi, J.; Fernández, R.; and Bisazza, A. 2023. Cross- Lingual Consistency of Factual Knowledge in Mul- tilingual Language Models. In Bouamor, H.; Pino, J.; and Bali, K., eds.,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, 10650–10666. Singapore: Associa...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qin, L.; Chen, Q.; Zhou, Y .; Chen, Z.; Li, Y .; Liao, L.; Li, M.; Che, W.; and Yu, P

Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages.arXiv preprint arXiv:2310.14799. Qin, L.; Chen, Q.; Zhou, Y .; Chen, Z.; Li, Y .; Liao, L.; Li, M.; Che, W.; and Yu, P. S. 2025. A survey of multilingual large language models.Patterns, 6(1). Qwen; :; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; ...

work page arXiv 2025
[9]

Gemma 2: Improving Open Language Models at a Practical Size

Bangkok, Thailand: Association for Computa- tional Linguistics. Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca. Team, G. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

InThe Twelfth International Conference on Learn- ing Representations

A Paradigm Shift in Machine Translation: Boost- ing Translation Performance of Large Language Mod- els. InThe Twelfth International Conference on Learn- ing Representations. Yang, J. 2023. Firefly. https://github.com/ yangjianxin1/Firefly. Yang, W.; Li, C.; Zhang, J.; and Zong, C. 2023. Big- translate: Augmenting large language models with multilingual tr...

work page arXiv 2023
[11]

The accuracy of machine translation for low-resource languages

work page
[12]

The ability to paraphrase sentences

work page
[13]

哪些国家的工作时间比墨西哥人更长？ (Which countries have longer working hours than Mexico?)

The model’s reasoning ability. The evaluation of translation accuracy aimed to assess the model’s cross-lingual mapping capability, ensuring that GPT-4o could accurately produce answers in one language based on intermediate reasoning in another. We selected English, Malay, Tamil, Indonesian, Vietnamese, Chinese, and French as the evaluation languages. For...

work page 2022

[1] [1]

A Multitask, Multilingual, Multimodal Evalu- ation of ChatGPT on Reasoning, Hallucination, and Interactivity. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 675–718. Blevins, T.; Limisiewicz,...

work page arXiv 2024

[2] [2]

Gao, C.; Hu, H.; Hu, P.; Chen, J.; Li, J.; and Huang, S

How Do Multilingual Language Models Re- member Facts? arXiv:2410.14387. Gao, C.; Hu, H.; Hu, P.; Chen, J.; Li, J.; and Huang, S

work page arXiv

[3] [3]

Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly. In Duh, K.; Gomez, H.; and Bethard, S., eds.,Proceedings of the 2024 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), 6101–6117. Mexico City, Mexico:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

arXiv:2307.16039

Okapi: Instruction-tuned Large Language Mod- els in Multiple Languages with Reinforcement Learn- ing from Human Feedback. arXiv:2307.16039. Li, C.; Wang, S.; Zhang, J.; and Zong, C. 2024a. Im- proving In-context Learning of Multilingual Genera- tive Language Models with Cross-lingual Alignment. In Duh, K.; Gomez, H.; and Bethard, S., eds.,Proceed- ings of...

work page arXiv 2024

[5] [5]

Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J

Bactrian-x: Multilingual replicable instruction- following models with low-rank adaptation.arXiv preprint arXiv:2305.15011. Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J. 2024b. PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment.arXiv preprint arXiv:2407.16222. Lin, G.; Wang, B.; Liu, Z.; and Chen, N. F. 2024. ...

work page arXiv 2024

[6] [6]

Liu, H.; Liu, J.; Cui, L.; Teng, Z.; Duan, N.; Zhou, M.; and Zhang, Y

Is translation all you need? a study on solving multilingual tasks with large language models.arXiv preprint arXiv:2403.10258. Liu, H.; Liu, J.; Cui, L.; Teng, Z.; Duan, N.; Zhou, M.; and Zhang, Y . 2023. LogiQA 2.0—An Improved Dataset for Logical Reasoning in Natural Language Understanding.IEEE/ACM Transactions on Audio, Speech, and Language Processing, ...

work page arXiv 2023

[7] [7]

Instruction Tuning with GPT-4

Instruction Tuning with GPT-4.arXiv preprint arXiv:2304.03277. Qi, J.; Fernández, R.; and Bisazza, A. 2023. Cross- Lingual Consistency of Factual Knowledge in Mul- tilingual Language Models. In Bouamor, H.; Pino, J.; and Bali, K., eds.,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, 10650–10666. Singapore: Associa...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Qin, L.; Chen, Q.; Zhou, Y .; Chen, Z.; Li, Y .; Liao, L.; Li, M.; Che, W.; and Yu, P

Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages.arXiv preprint arXiv:2310.14799. Qin, L.; Chen, Q.; Zhou, Y .; Chen, Z.; Li, Y .; Liao, L.; Li, M.; Che, W.; and Yu, P. S. 2025. A survey of multilingual large language models.Patterns, 6(1). Qwen; :; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; ...

work page arXiv 2025

[9] [9]

Gemma 2: Improving Open Language Models at a Practical Size

Bangkok, Thailand: Association for Computa- tional Linguistics. Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca. Team, G. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

InThe Twelfth International Conference on Learn- ing Representations

A Paradigm Shift in Machine Translation: Boost- ing Translation Performance of Large Language Mod- els. InThe Twelfth International Conference on Learn- ing Representations. Yang, J. 2023. Firefly. https://github.com/ yangjianxin1/Firefly. Yang, W.; Li, C.; Zhang, J.; and Zong, C. 2023. Big- translate: Augmenting large language models with multilingual tr...

work page arXiv 2023

[11] [11]

The accuracy of machine translation for low-resource languages

work page

[12] [12]

The ability to paraphrase sentences

work page

[13] [13]

哪些国家的工作时间比墨西哥人更长？ (Which countries have longer working hours than Mexico?)

The model’s reasoning ability. The evaluation of translation accuracy aimed to assess the model’s cross-lingual mapping capability, ensuring that GPT-4o could accurately produce answers in one language based on intermediate reasoning in another. We selected English, Malay, Tamil, Indonesian, Vietnamese, Chinese, and French as the evaluation languages. For...

work page 2022