Token-Level LLM Collaboration via FusionRoute
Pith reviewed 2026-05-22 12:21 UTC · model grok-4.3
The pith
A router that selects an expert LLM at each token and adds its own corrective logit expands the reachable policies beyond what pure routing can achieve.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pure expert-only routing at the token level is fundamentally limited because it cannot realize the optimal decoding policy unless the experts satisfy strong global coverage conditions. Adding a trainable complementary generator that contributes logits for addition to the selected expert's distribution expands the effective policy class, enabling recovery of optimal value functions under mild conditions while delivering higher empirical performance than sequence-level or token-level baselines, model merging, and fine-tuning across multiple model families and task domains.
What carries the argument
The complementary generator inside the router, which produces a logit vector added directly to the selected expert's next-token logits at every decoding step.
If this is right
- The method outperforms both sequence-level and token-level collaboration baselines on mathematical reasoning, code generation, and instruction following.
- It also exceeds model merging and direct fine-tuning of a single model while staying competitive with specialized domain experts.
- The gains appear across both Llama-3 and Gemma-2 model families without requiring changes to the underlying expert architectures.
- The framework remains efficient because the router is lightweight and operates at the token level during decoding.
Where Pith is reading between the lines
- The same router-plus-correction pattern could be tested on mixtures that include models from different training regimes or sizes to see whether the policy-class expansion still holds.
- If the complementary logits prove especially useful on out-of-distribution prompts, the approach might reduce reliance on ever-larger single models for broad coverage.
- One could measure how much of the gain comes from the correction term versus the routing decision by ablating each component separately on held-out tasks.
Load-bearing premise
The mild conditions that allow the complementary generator to recover optimal value functions must actually hold during training and evaluation.
What would settle it
Removing the complementary logit addition and observing no performance drop on the same benchmarks would indicate that the added generator is not necessary for the reported gains.
read the original abstract
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FusionRoute, a token-level multi-LLM collaboration framework in which a lightweight router both selects the most suitable expert at each decoding step and contributes a trainable complementary logit that is added to the expert's logits to refine the next-token distribution. It provides a theoretical analysis arguing that pure expert-only routing cannot realize the optimal decoding policy unless strong global coverage assumptions hold, while augmenting with the complementary generator expands the policy class to recover optimal value functions under mild conditions. Empirically, FusionRoute is shown to outperform sequence- and token-level collaboration, model merging, and direct fine-tuning across Llama-3 and Gemma-2 families on mathematical reasoning, code generation, and instruction-following benchmarks.
Significance. If the theoretical claims hold, the work offers a principled mechanism for token-level collaboration that addresses a fundamental limitation of routing-only approaches, potentially enabling more efficient use of specialized models without full merging or fine-tuning. The empirical results across multiple model families and task domains suggest practical gains in performance while remaining competitive with domain experts. The combination of theory and broad empirical evaluation strengthens the contribution relative to purely heuristic collaboration methods.
major comments (2)
- [§3] §3 (Theoretical Analysis): The mild conditions under which the complementary generator enables recovery of optimal value functions are stated at a high level but lack an explicit characterization or proof sketch showing they are strictly weaker than the global coverage assumptions critiqued for pure routing. It remains unclear whether the lightweight router's capacity suffices to approximate arbitrary corrections over the full vocabulary at every token without additional coverage or expressivity assumptions.
- [§5] §5 (Experiments): The reported outperformance lacks accompanying statistical significance tests (e.g., p-values or confidence intervals) and details on dataset sizes, splits, or variance across runs for the mathematical reasoning, code generation, and instruction-following benchmarks. This weakens the strength of the cross-method and cross-model claims.
minor comments (2)
- [§2] Notation for the router and complementary generator (e.g., how the logit addition is normalized or scaled) should be introduced earlier and used consistently in the method and theory sections to improve readability.
- [Figure 2] Figure 2 (or equivalent architecture diagram) would benefit from explicit labeling of the trainable components versus frozen experts to clarify the lightweight nature of the router.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to strengthen the theoretical analysis and experimental reporting.
read point-by-point responses
-
Referee: [§3] §3 (Theoretical Analysis): The mild conditions under which the complementary generator enables recovery of optimal value functions are stated at a high level but lack an explicit characterization or proof sketch showing they are strictly weaker than the global coverage assumptions critiqued for pure routing. It remains unclear whether the lightweight router's capacity suffices to approximate arbitrary corrections over the full vocabulary at every token without additional coverage or expressivity assumptions.
Authors: We agree that the presentation of the mild conditions would benefit from greater explicitness. In the revised manuscript, we will add a proof sketch that formally characterizes these conditions (local per-step correction sufficiency between the expert policy and the optimal policy) and demonstrates they are strictly weaker than the global coverage assumptions required for pure routing. We will also clarify that the complementary generator is not required to approximate arbitrary corrections; it only needs to recover the specific residual needed for optimality under the mild conditions. Regarding router capacity, we will include a brief analysis showing that the lightweight trainable network suffices for these targeted corrections without demanding full expressivity over the vocabulary at every step. revision: yes
-
Referee: [§5] §5 (Experiments): The reported outperformance lacks accompanying statistical significance tests (e.g., p-values or confidence intervals) and details on dataset sizes, splits, or variance across runs for the mathematical reasoning, code generation, and instruction-following benchmarks. This weakens the strength of the cross-method and cross-model claims.
Authors: We acknowledge that additional statistical details and reporting would strengthen the empirical claims. In the revised manuscript, we will add p-values and confidence intervals for the main performance comparisons. We will also include explicit information on dataset sizes, train/validation/test splits, and variance across multiple runs for the mathematical reasoning, code generation, and instruction-following benchmarks. revision: yes
Circularity Check
No significant circularity; theoretical claims and empirical results remain independent of inputs.
full rationale
The paper's derivation begins with the observation that pure expert routing cannot realize optimal policies without strong global coverage assumptions, then introduces a trainable complementary generator to expand the policy class and recover optimal value functions under milder conditions. This theoretical step is presented as an independent analysis rather than a definitional equivalence or fitted input renamed as prediction. No equations reduce the claimed recovery to the router's own parameters by construction, and no self-citation chain or ansatz smuggling is invoked to justify the mild conditions. Empirical outperformance on Llama-3 and Gemma-2 benchmarks across math, code, and instruction tasks supplies external falsifiability. The overall chain is therefore self-contained against external benchmarks, with the complementary logit serving as an additive correction whose benefit is tested rather than presupposed.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.
Reference graph
Works this paper leans on
-
[1]
Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and Zheng Wang. Ensemble learning for large language models in text and code generation: A survey.arXiv preprint arXiv:2503.13505,
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of agents for llm alignment.arXiv preprint arXiv:2503.21720,
-
[4]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, et al. Harnessing multiple large language models: A survey on llm ensemble.arXiv preprint arXiv:2502.18036,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600,
-
[9]
Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,
Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,
-
[10]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,
-
[12]
Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
16 Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code ge...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541,
-
[15]
Encouraging divergent thinking in large language models through multi-agent debate
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904,
work page 2024
-
[16]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, and Jing Li. Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.arXiv preprint arXiv:2409.14051,
-
[18]
Quantifying generalization complexity for large language models.arXiv preprint arXiv:2410.01769,
Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, and James Glass. Quantifying generalization complexity for large language models.arXiv preprint arXiv:2410.01769,
-
[19]
Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601,
-
[20]
Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. Measuring inductive biases of in-context learning with underspecified demonstrations.arXiv preprint arXiv:2305.13299,
-
[21]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.https://huggingface. co/datasets/teknium/OpenHermes-2.5. Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXi...
-
[24]
The perfect blend: Redefining rlhf with mixture of judges.arXiv preprint arXiv:2409.20370,
Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges.arXiv preprint arXiv:2409.20370,
-
[25]
Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You
Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739,
-
[26]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
S’more: Structural mixture of residual experts for parameter-efficient llm fine-tuning
Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang, Xiangjun Fan, and Benyu Zhang. S’more: Structural mixture of residual experts for parameter-efficient llm fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, ...
-
[30]
Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,
Xutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He, Sinong Wang, et al. Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,
-
[31]
Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733,
Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733,
-
[32]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al. Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.