pith. sign in

arxiv: 2601.05106 · v4 · pith:IQKPYCCRnew · submitted 2026-01-08 · 💻 cs.AI · cs.CL· cs.LG

Token-Level LLM Collaboration via FusionRoute

Pith reviewed 2026-05-22 12:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM collaborationtoken-level routingexpert selectionlogit additionmulti-model decodingpolicy recoverycomplementary generator
0
0 comments X

The pith

A router that selects an expert LLM at each token and adds its own corrective logit expands the reachable policies beyond what pure routing can achieve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a token-level collaboration method in which a lightweight router both chooses the best expert model for the next token and supplies an additional logit vector that is added to the expert's output distribution. This dual role addresses a limitation in standard routing: without strong assumptions that the experts together cover every possible optimal choice, simply switching between fixed expert outputs cannot reach the best decoding policy. By training the router to generate these corrections, the approach enlarges the set of policies that can be realized and recovers optimal value functions under milder conditions. Experiments across Llama-3 and Gemma-2 models show gains on mathematical reasoning, code generation, and instruction-following benchmarks compared with pure routing, sequence-level mixing, model merging, and direct fine-tuning.

Core claim

Pure expert-only routing at the token level is fundamentally limited because it cannot realize the optimal decoding policy unless the experts satisfy strong global coverage conditions. Adding a trainable complementary generator that contributes logits for addition to the selected expert's distribution expands the effective policy class, enabling recovery of optimal value functions under mild conditions while delivering higher empirical performance than sequence-level or token-level baselines, model merging, and fine-tuning across multiple model families and task domains.

What carries the argument

The complementary generator inside the router, which produces a logit vector added directly to the selected expert's next-token logits at every decoding step.

If this is right

  • The method outperforms both sequence-level and token-level collaboration baselines on mathematical reasoning, code generation, and instruction following.
  • It also exceeds model merging and direct fine-tuning of a single model while staying competitive with specialized domain experts.
  • The gains appear across both Llama-3 and Gemma-2 model families without requiring changes to the underlying expert architectures.
  • The framework remains efficient because the router is lightweight and operates at the token level during decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same router-plus-correction pattern could be tested on mixtures that include models from different training regimes or sizes to see whether the policy-class expansion still holds.
  • If the complementary logits prove especially useful on out-of-distribution prompts, the approach might reduce reliance on ever-larger single models for broad coverage.
  • One could measure how much of the gain comes from the correction term versus the routing decision by ablating each component separately on held-out tasks.

Load-bearing premise

The mild conditions that allow the complementary generator to recover optimal value functions must actually hold during training and evaluation.

What would settle it

Removing the complementary logit addition and observing no performance drop on the same benchmarks would indicate that the added generator is not necessary for the reported gains.

read the original abstract

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FusionRoute, a token-level multi-LLM collaboration framework in which a lightweight router both selects the most suitable expert at each decoding step and contributes a trainable complementary logit that is added to the expert's logits to refine the next-token distribution. It provides a theoretical analysis arguing that pure expert-only routing cannot realize the optimal decoding policy unless strong global coverage assumptions hold, while augmenting with the complementary generator expands the policy class to recover optimal value functions under mild conditions. Empirically, FusionRoute is shown to outperform sequence- and token-level collaboration, model merging, and direct fine-tuning across Llama-3 and Gemma-2 families on mathematical reasoning, code generation, and instruction-following benchmarks.

Significance. If the theoretical claims hold, the work offers a principled mechanism for token-level collaboration that addresses a fundamental limitation of routing-only approaches, potentially enabling more efficient use of specialized models without full merging or fine-tuning. The empirical results across multiple model families and task domains suggest practical gains in performance while remaining competitive with domain experts. The combination of theory and broad empirical evaluation strengthens the contribution relative to purely heuristic collaboration methods.

major comments (2)
  1. [§3] §3 (Theoretical Analysis): The mild conditions under which the complementary generator enables recovery of optimal value functions are stated at a high level but lack an explicit characterization or proof sketch showing they are strictly weaker than the global coverage assumptions critiqued for pure routing. It remains unclear whether the lightweight router's capacity suffices to approximate arbitrary corrections over the full vocabulary at every token without additional coverage or expressivity assumptions.
  2. [§5] §5 (Experiments): The reported outperformance lacks accompanying statistical significance tests (e.g., p-values or confidence intervals) and details on dataset sizes, splits, or variance across runs for the mathematical reasoning, code generation, and instruction-following benchmarks. This weakens the strength of the cross-method and cross-model claims.
minor comments (2)
  1. [§2] Notation for the router and complementary generator (e.g., how the logit addition is normalized or scaled) should be introduced earlier and used consistently in the method and theory sections to improve readability.
  2. [Figure 2] Figure 2 (or equivalent architecture diagram) would benefit from explicit labeling of the trainable components versus frozen experts to clarify the lightweight nature of the router.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to strengthen the theoretical analysis and experimental reporting.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical Analysis): The mild conditions under which the complementary generator enables recovery of optimal value functions are stated at a high level but lack an explicit characterization or proof sketch showing they are strictly weaker than the global coverage assumptions critiqued for pure routing. It remains unclear whether the lightweight router's capacity suffices to approximate arbitrary corrections over the full vocabulary at every token without additional coverage or expressivity assumptions.

    Authors: We agree that the presentation of the mild conditions would benefit from greater explicitness. In the revised manuscript, we will add a proof sketch that formally characterizes these conditions (local per-step correction sufficiency between the expert policy and the optimal policy) and demonstrates they are strictly weaker than the global coverage assumptions required for pure routing. We will also clarify that the complementary generator is not required to approximate arbitrary corrections; it only needs to recover the specific residual needed for optimality under the mild conditions. Regarding router capacity, we will include a brief analysis showing that the lightweight trainable network suffices for these targeted corrections without demanding full expressivity over the vocabulary at every step. revision: yes

  2. Referee: [§5] §5 (Experiments): The reported outperformance lacks accompanying statistical significance tests (e.g., p-values or confidence intervals) and details on dataset sizes, splits, or variance across runs for the mathematical reasoning, code generation, and instruction-following benchmarks. This weakens the strength of the cross-method and cross-model claims.

    Authors: We acknowledge that additional statistical details and reporting would strengthen the empirical claims. In the revised manuscript, we will add p-values and confidence intervals for the main performance comparisons. We will also include explicit information on dataset sizes, train/validation/test splits, and variance across multiple runs for the mathematical reasoning, code generation, and instruction-following benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical claims and empirical results remain independent of inputs.

full rationale

The paper's derivation begins with the observation that pure expert routing cannot realize optimal policies without strong global coverage assumptions, then introduces a trainable complementary generator to expand the policy class and recover optimal value functions under milder conditions. This theoretical step is presented as an independent analysis rather than a definitional equivalence or fitted input renamed as prediction. No equations reduce the claimed recovery to the router's own parameters by construction, and no self-citation chain or ansatz smuggling is invoked to justify the mild conditions. Empirical outperformance on Llama-3 and Gemma-2 benchmarks across math, code, and instruction tasks supplies external falsifiability. The overall chain is therefore self-contained against external benchmarks, with the complementary logit serving as an additive correction whose benefit is tested rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete free parameters, axioms, or invented entities; all such elements would need to be extracted from the full manuscript.

pith-pipeline@v0.9.0 · 5788 in / 1132 out tokens · 38625 ms · 2026-05-22T12:21:46.968438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

    cs.SE 2026-04 unverdicted novelty 6.0

    Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Ensemble learning for large language models in text and code generation: A survey.arXiv preprint arXiv:2503.13505,

    Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, and Zheng Wang. Ensemble learning for large language models in text and code generation: A survey.arXiv preprint arXiv:2503.13505,

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  3. [3]

    Collab: Controlled decoding using mixture of agents for llm alignment.arXiv preprint arXiv:2503.21720,

    Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of agents for llm alignment.arXiv preprint arXiv:2503.21720,

  4. [4]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201,

  5. [5]

    Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

    Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, et al. Harnessing multiple large language models: A survey on llm ensemble.arXiv preprint arXiv:2502.18036,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  7. [7]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,

  8. [8]

    R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600,

    Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600,

  9. [9]

    Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,

    Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. Mergebench: A benchmark for merging domain-specialized llms.arXiv preprint arXiv:2505.10833,

  10. [10]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  11. [11]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

    Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

  12. [12]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

  13. [13]

    Mixtral of Experts

    16 Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code ge...

  14. [14]

    The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541,

    Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541,

  15. [15]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904,

  16. [16]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  17. [17]

    Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.arXiv preprint arXiv:2409.14051,

    Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, and Jing Li. Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.arXiv preprint arXiv:2409.14051,

  18. [18]

    Quantifying generalization complexity for large language models.arXiv preprint arXiv:2410.01769,

    Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, and James Glass. Quantifying generalization complexity for large language models.arXiv preprint arXiv:2410.01769,

  19. [19]

    Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601,

    Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601,

  20. [20]

    Measuring inductive biases of in-context learning with underspecified demonstrations.arXiv preprint arXiv:2305.13299,

    Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. Measuring inductive biases of in-context learning with underspecified demonstrations.arXiv preprint arXiv:2305.13299,

  21. [21]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  22. [22]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  23. [23]

    Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.https://huggingface

    Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.https://huggingface. co/datasets/teknium/OpenHermes-2.5. Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXi...

  24. [24]

    The perfect blend: Redefining rlhf with mixture of judges.arXiv preprint arXiv:2409.20370,

    Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, et al. The perfect blend: Redefining rlhf with mixture of judges.arXiv preprint arXiv:2409.20370,

  25. [25]

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739,

  26. [26]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  27. [27]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

  28. [28]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

  29. [29]

    S’more: Structural mixture of residual experts for parameter-efficient llm fine-tuning

    Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang, Xiangjun Fan, and Benyu Zhang. S’more: Structural mixture of residual experts for parameter-efficient llm fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, ...

  30. [30]

    Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,

    Xutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He, Sinong Wang, et al. Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,

  31. [31]

    Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733,

    Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733,

  32. [32]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  33. [33]

    Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176,

    Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al. Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176,