Enhancing Multilingual Reasoning via Steerable Model Merging
Pith reviewed 2026-06-26 20:35 UTC · model grok-4.3
The pith
Gated cross-attention in model merging lets the system adaptively weight multilingual and reasoning models per input to resolve conflicts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that inserting a gated cross-attention mechanism into the merging pipeline allows the merged model to modulate the contribution of each source model in an adaptive, input-dependent manner, thereby resolving conflicts that arise under one-size-fits-all merging and producing higher accuracy on multilingual reasoning tasks across many languages.
What carries the argument
The gated cross-attention mechanism that weights or filters the attended representations from the two source models according to the current input.
If this is right
- ST-Merge produces higher scores than multiple strong baselines on four multilingual reasoning benchmarks.
- The gains hold across 21 languages.
- The method requires no task-specific fine-tuning beyond the initial merging step itself.
- The same steerable merging step can be applied to other pairs of models whose capabilities conflict on different inputs.
Where Pith is reading between the lines
- The gating could be generalized to merge more than two source models by adding additional cross-attention heads.
- If the gating proves stable, it might reduce reliance on very large fine-tuned models by allowing dynamic composition at inference time.
- Similar input-dependent routing ideas could be tested in other settings where model capabilities must be traded off, such as capability versus safety alignment.
Load-bearing premise
The gated cross-attention can reliably decide which source model to favor for every possible input without creating new failure modes or requiring further task-specific tuning.
What would settle it
A set of test inputs on which the ST-Merge model scores lower than either the unmerged multilingual model, the unmerged reasoning model, or a standard non-steerable merge.
Figures
read the original abstract
Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Steerable Model Merging (ST-Merge), a framework that augments standard model merging of a multilingual model and a reasoning model by inserting a gated cross-attention mechanism. This gate is intended to adaptively weight or filter the contributions of the two source models on a per-input basis, thereby resolving conflicts that arise under a one-size-fits-all merge. The central empirical claim is that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks spanning 21 languages.
Significance. If the reported gains are reproducible and generalize beyond the four benchmarks, the approach would supply a lightweight, post-hoc steering method for combining heterogeneous models without task-specific fine-tuning. The gated cross-attention idea directly targets a known limitation of static merging and could be relevant to other multi-capability composition settings.
major comments (2)
- [Abstract] Abstract: the claim that ST-Merge 'consistently outperforms multiple strong baselines' is presented without naming the baselines, the four benchmarks, the evaluation metrics, error bars, or any statistical tests. Because the central empirical claim cannot be assessed from the given text, the soundness of the result remains unevaluable.
- [Abstract] Abstract (method description): the gated cross-attention is asserted to 'weight or filter the two attended source models in an adaptive manner,' yet no information is supplied on the gate's training distribution, no failure-case analysis is described (e.g., inputs where the gate selects the inferior source model), and no comparison is reported showing that the steered model never falls below the better of the two source models alone. These omissions directly affect the weakest assumption that the mechanism reliably resolves conflicts on arbitrary inputs.
Simulated Author's Rebuttal
We thank the referee for the detailed comments on the abstract. The points raised are valid regarding the level of detail provided for assessing the central claims. We will revise the abstract to incorporate the requested specifics while maintaining conciseness. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that ST-Merge 'consistently outperforms multiple strong baselines' is presented without naming the baselines, the four benchmarks, the evaluation metrics, error bars, or any statistical tests. Because the central empirical claim cannot be assessed from the given text, the soundness of the result remains unevaluable.
Authors: We agree that the abstract would benefit from greater specificity to allow immediate evaluation of the empirical claim. In the revised version we will name the baselines (standard model merging, task arithmetic, and TIES), specify the four benchmarks, note the primary metric (accuracy), and reference the error bars and statistical tests reported in the experimental sections. revision: yes
-
Referee: [Abstract] Abstract (method description): the gated cross-attention is asserted to 'weight or filter the two attended source models in an adaptive manner,' yet no information is supplied on the gate's training distribution, no failure-case analysis is described (e.g., inputs where the gate selects the inferior source model), and no comparison is reported showing that the steered model never falls below the better of the two source models alone. These omissions directly affect the weakest assumption that the mechanism reliably resolves conflicts on arbitrary inputs.
Authors: The abstract summarizes the method at a high level; full details on the gate's training (a held-out mixture of multilingual reasoning examples) appear in Section 3.2. We will add a brief clause to the abstract describing the training distribution. The main results already show ST-Merge outperforming both source models individually across all benchmarks, providing evidence that conflicts are resolved in the evaluated settings. An explicit per-input failure-case breakdown and a formal guarantee against falling below the stronger source are not present; we will add a short discussion of these points and, if space allows, an auxiliary experiment in the revision. revision: partial
Circularity Check
No circularity: empirical method with no derivation or fitted predictions
full rationale
The paper proposes an empirical ST-Merge framework that introduces a gated cross-attention mechanism and validates it via experiments on multilingual reasoning benchmarks. No equations, mathematical derivations, parameter-fitting steps presented as predictions, or self-citation chains that reduce the central claim to its own inputs are present. The outperformance is demonstrated through direct comparison to baselines rather than by construction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The Eleventh International Conference on Learning Representations,
Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
2023
-
[2]
Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =
Pith/arXiv arXiv 2021
-
[3]
MindMerger: Efficiently Boosting LLM Reasoning in non-English Languages , url =
Huang, Zixian and Zhu, Wenhao and Cheng, Gong and Li, Lei and Yuan, Fei , booktitle =. MindMerger: Efficiently Boosting LLM Reasoning in non-English Languages , url =
-
[4]
L ang B ridge: Multilingual Reasoning Without Multilingual Supervision
Yoon, Dongkeun and Jang, Joel and Kim, Sungdong and Kim, Seungone and Shafayat, Sheikh and Seo, Minjoon. L ang B ridge: Multilingual Reasoning Without Multilingual Supervision. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.405
-
[5]
Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =
Longhui Yu and Weisen Jiang and Han Shi and Jincheng Yu and Zhengying Liu and Yu Zhang and James T. Kwok and Zhenguo Li and Adrian Weller and Weiyang Liu , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
2024
-
[6]
Question Translation Training for Better Multilingual Reasoning
Zhu, Wenhao and Huang, Shujian and Yuan, Fei and She, Shuaijie and Chen, Jiajun and Birch, Alexandra. Question Translation Training for Better Multilingual Reasoning. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.498
-
[7]
Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen
Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =
2022
-
[8]
Nuo Chen and Zinan Zheng and Ning Wu and Ming Gong and Dongmei Zhang and Jia Li , editor =. Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-EMNLP.411 , timestamp =
-
[9]
Lin, Bill Yuchen and Lee, Seyeon and Qiao, Xiaoyang and Ren, Xiang. Common Sense Beyond E nglish: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long...
-
[10]
XNLI : Evaluating Cross-lingual Sentence Representations
Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel and Schwenk, Holger and Stoyanov, Veselin. XNLI : Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1269
-
[11]
Are NLP Models really able to Solve Simple Math Word Problems?
Patel, Arkil and Bhattamishra, Satwik and Goyal, Navin. Are NLP Models really able to Solve Simple Math Word Problems?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.168
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[12]
m T 5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin. m T 5: A Massively Multilingual Pre-trained Text-to-Text Transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2...
-
[13]
Manning and Stefano Ermon and Chelsea Finn , editor =
Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =
2023
-
[14]
Terry , journal =
Ralph Allan Bradley and Milton E. Terry , journal =. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , urldate =
-
[15]
MAPO : Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization
She, Shuaijie and Zou, Wei and Huang, Shujian and Zhu, Wenhao and Liu, Xiang and Geng, Xiang and Chen, Jiajun. MAPO : Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.539
-
[16]
The Thirteenth International Conference on Learning Representations,
Lucas Bandarkar and Benjamin Muller and Pritish Yuvraj and Rui Hou and Nayan Singhal and Hongjiang Lv and Bing Liu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[17]
Orca 2: Teaching Small Language Models How to Reason , journal =
Arindam Mitra and Luciano Del Corro and Shweti Mahajan and Andr. Orca 2: Teaching Small Language Models How to Reason , journal =. 2023 , url =. doi:10.48550/ARXIV.2311.11045 , eprinttype =. 2311.11045 , timestamp =
-
[18]
In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yiyang Du and Xiaochen Wang and Chi Chen and Jiabo Ye and Yiru Wang and Peng Li and Ming Yan and Ji Zhang and Fei Huang and Zhifang Sui and Maosong Sun and Yang Liu , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00879 , timestamp =
-
[19]
Model Composition for Multimodal Large Language Models , booktitle =
Chi Chen and Yiyang Du and Zheng Fang and Ziyue Wang and Fuwen Luo and Peng Li and Ming Yan and Ji Zhang and Fei Huang and Maosong Sun and Yang Liu , editor =. Model Composition for Multimodal Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.606 , timestamp =
-
[20]
An Empirical Study of Multimodal Model Merging , booktitle =
Yi. An Empirical Study of Multimodal Model Merging , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.105 , timestamp =
-
[21]
John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =
Pith/arXiv arXiv 2017
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton. Llama 2: Open Foundation and Fine-Tuned Chat Models , journal =. 2023 , url =. doi:10.48550/ARXIV.2307.09288 , eprinttype ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[23]
Ruan, Zhiwen and Li, Yixia and Zhu, He and Wang, Longyue and Luo, Weihua and Zhang, Kaifu and Chen, Yun and Chen, Guanhua. L ay A lign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.81
-
[24]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[25]
L ego- MT : Learning Detachable Models for Massively Multilingual Machine Translation
Yuan, Fei and Lu, Yinquan and Zhu, Wenhao and Kong, Lingpeng and Li, Lei and Qiao, Yu and Xu, Jingjing. L ego- MT : Learning Detachable Models for Massively Multilingual Machine Translation. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.731
-
[26]
Gated-Attention Architectures for Task-Oriented Language Grounding , booktitle =
Devendra Singh Chaplot and Kanthashree Mysore Sathyendra and Rama Kumar Pasumarthi and Dheeraj Rajagopal and Ruslan Salakhutdinov , editor =. Gated-Attention Architectures for Task-Oriented Language Grounding , booktitle =. 2018 , url =. doi:10.1609/AAAI.V32I1.11832 , timestamp =
-
[27]
David Ortiz. CogniAlign: Word-level multimodal speech alignment with gated cross-attention for Alzheimer's detection , journal =. 2025 , url =. doi:10.1016/J.KNOSYS.2025.114264 , timestamp =
-
[28]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu and Zekun Wang and Bo Zheng and Zeyu Huang and Kaiyue Wen and Songlin Yang and Rui Men and Le Yu and Fei Huang and Suozhi Huang and Dayiheng Liu and Jingren Zhou and Junyang Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.06708 , eprinttype =. 2505.06708 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06708 2025
-
[29]
In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Boseung Jeong and Jicheol Park and Sungyeon Kim and Suha Kwak , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.02440 , timestamp =
-
[30]
An Interpretable Framework for Drug-Target Interaction with Gated Cross Attention , booktitle =
Yeachan Kim and Bonggun Shin , editor =. An Interpretable Framework for Drug-Target Interaction with Gated Cross Attention , booktitle =. 2021 , url =
2021
-
[31]
Jun. Leaky Gated Cross-Attention for Weakly Supervised Multi-Modal Temporal Action Localization , booktitle =. 2022 , url =. doi:10.1109/WACV51458.2022.00089 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.