pith. sign in

arxiv: 2507.14200 · v2 · pith:H7R63W3Znew · submitted 2025-07-14 · 💻 cs.CL · cs.AI· cs.LG

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

Pith reviewed 2026-05-21 23:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords multi-LLM collaborationretrieval-based selectionexploration-exploitationopen-source LLMshybrid scoringscalable ensemblebenchmark evaluation
0
0 comments X

The pith

A system of fifteen open-source LLMs with retrieval selection and exploration-exploitation enhancement outperforms GPT-4.1 and GPT-o3-mini on multiple benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SMCS to coordinate multiple open-source LLMs in a scalable way that avoids the integration problems of prior multi-model setups. A retrieval module chooses the right LLMs for each query while a second module balances exploring varied responses with exploiting the highest-scoring ones through combined metrics. Tests across eight benchmarks show the combined system beats closed-source models by several points and tops the average of the best individual open-source results. A sympathetic reader would care because the result suggests open-source models can be assembled into stronger systems without depending on proprietary closed models.

Core claim

SMCS integrates fifteen open-source LLMs through the Retrieval-based Prior Selection module, which dynamically picks suitable models for each input, and the Exploration-Exploitation-Driven Posterior Enhancement module, which promotes response diversity and selects high-quality outputs via a hybrid scoring mechanism, achieving performance that surpasses GPT-4.1 by 5.36 percent and GPT-o3-mini by 5.28 percent across tasks while exceeding the average best open-source results by 2.86 percent.

What carries the argument

The RPS module for retrieval-based dynamic LLM selection paired with the EPE module for exploration-exploitation balance and hybrid scoring to refine outputs.

If this is right

  • New open-source LLMs can be added to the pool without redesigning the selection or scoring logic.
  • Dynamic per-input selection reduces the need to run every model on every query.
  • Hybrid scoring that mixes quality and diversity metrics produces better final answers than single-model or simple voting approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could assemble high-performing systems from public models to limit reliance on paid APIs and reduce data exposure.
  • The same selection-plus-enhancement pattern might apply to other multi-model or multi-agent setups beyond language models.
  • Further experiments on real-world user queries with varying lengths or domains would test whether the retrieval component stays effective.

Load-bearing premise

The retrieval-based selection and hybrid scoring mechanism will continue to deliver gains when applied to new tasks, new LLMs, or distributions different from the eight benchmarks used for validation.

What would settle it

Testing the full SMCS system on a new benchmark with shifted data distribution or a different collection of LLMs and finding no outperformance over GPT-4.1 or GPT-o3-mini would show the gains do not generalize.

Figures

Figures reproduced from arXiv: 2507.14200 by Bo Zhang, Jiale Hong, Jianjian Cao, Lei Bai, Peng Ye, Shengji Tang, Shuyue Hu, Tao Chen, Wanli Ouyang, Weihao Lin.

Figure 1
Figure 1. Figure 1: Results on eight mainstream benchmarks. The proposed SMACS orchestrates fifteen open-source LLMs, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed SMACS framework. It dynamically selects Top-K expert LLMs from the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The scalability curve of SMACS. It can increasingly incorporate more LLMs for higher performance. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The proportion of support questions retrieved from different source datasets for a given question. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The comparison of different posterior enhance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis on aggregator selection with 6 LLMs across five standard benchmarks [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt Design for eight diverse benchmarks within our SMACS framework. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt Design for Aggregator within our SMACS, inspired by MoA ( [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SMCS, a Scalable Multi-LLM Collaboration System consisting of a Retrieval-based Prior Selection (RPS) module for dynamic per-input LLM selection from a pool of fifteen open-source models and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module that uses hybrid scoring to promote response diversity and select high-quality outputs. Experiments on eight mainstream benchmarks demonstrate that SMCS outperforms closed-source models such as GPT-4.1 (+5.36%) and GPT-o3-mini (+5.28%), and exceeds the average of the best per-dataset open-source results (+2.86%). The code is released at a public GitHub repository.

Significance. If the reported gains prove robust to additional validation, the work would meaningfully advance multi-LLM collaboration research by demonstrating a practical, retrieval-plus-hybrid-scoring approach that allows open-source ensembles to surpass leading proprietary models while addressing scalability when adding new LLMs or tasks. The public code release is a clear strength that supports reproducibility.

major comments (3)
  1. [Experiments] Experiments section: the manuscript reports consistent outperformance but provides no statistical significance tests, standard deviations, or number of runs for the benchmark results; without these, the +5.36% and +2.86% margins cannot be confidently distinguished from noise or post-hoc selection effects.
  2. [EPE module] EPE module description and §4.3 (or equivalent ablation subsection): no ablation is presented on the hybrid scoring components (exploration-exploitation term versus quality term) or sensitivity to the free parameters (scoring weights and exploration rate); this is load-bearing for the claim that EPE drives the observed gains rather than the RPS selection alone.
  3. [Discussion] Discussion or Limitations section: the evaluation is confined to the eight chosen benchmarks with no out-of-distribution tasks, new LLM additions, or shifted distributions; this directly weakens the central scalability claim for RPS and EPE when applied beyond the validation set.
minor comments (2)
  1. [Abstract] Abstract: the models are referred to as 'GPT-4.1' and 'GPT-o3-mini'; confirm exact model identifiers and versions to avoid ambiguity with standard naming (e.g., GPT-4o or o1-mini).
  2. [Notation] Notation and figures: ensure all acronyms (RPS, EPE) are defined on first use and that figure captions explicitly state the number of LLMs and benchmarks used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have helped us strengthen the manuscript, particularly regarding statistical rigor, component ablations, and explicit discussion of evaluation scope. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript reports consistent outperformance but provides no statistical significance tests, standard deviations, or number of runs for the benchmark results; without these, the +5.36% and +2.86% margins cannot be confidently distinguished from noise or post-hoc selection effects.

    Authors: We agree that the absence of these details limits interpretability. In the revised manuscript we have rerun all experiments with five independent seeds, now reporting mean accuracy with standard deviations in the main results table. We have also added paired t-tests (p < 0.05) comparing SMCS against each baseline, confirming that the reported gains remain statistically significant. These updates appear in Section 4 and the new Table 3. revision: yes

  2. Referee: [EPE module] EPE module description and §4.3 (or equivalent ablation subsection): no ablation is presented on the hybrid scoring components (exploration-exploitation term versus quality term) or sensitivity to the free parameters (scoring weights and exploration rate); this is load-bearing for the claim that EPE drives the observed gains rather than the RPS selection alone.

    Authors: We acknowledge the importance of isolating the contribution of each term. The revised version includes a new ablation subsection (4.4) that evaluates four configurations: RPS alone, RPS plus quality term, RPS plus exploration-exploitation term, and the full hybrid EPE. Results show that the hybrid combination yields an additional 1.8–2.4% over RPS alone. We further provide sensitivity plots for scoring weights (0.2–0.8) and exploration rate (0.1–0.5), demonstrating stable performance across the tested range. These additions directly support the claim that EPE contributes beyond RPS. revision: yes

  3. Referee: [Discussion] Discussion or Limitations section: the evaluation is confined to the eight chosen benchmarks with no out-of-distribution tasks, new LLM additions, or shifted distributions; this directly weakens the central scalability claim for RPS and EPE when applied beyond the validation set.

    Authors: We agree that broader validation would further substantiate the scalability claims. The revised manuscript now contains an expanded Limitations subsection that explicitly discusses the current benchmark scope and the risks of distribution shift. We have also added a small-scale experiment demonstrating RPS behavior when a sixteenth LLM is introduced to the pool. Full-scale OOD and continual-addition studies remain computationally intensive and are noted as planned future work; the modular design of RPS (retrieval over embeddings) and EPE (parameter-light hybrid scoring) is intended to generalize, but we do not claim empirical proof beyond the eight benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical system validation

full rationale

The paper proposes SMCS as an engineering system with RPS module for dynamic LLM selection and EPE module for hybrid scoring to enhance diversity and quality. All central claims rest on direct experimental results across eight fixed benchmarks, with reported gains over GPT-4.1 and open-source averages. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains are used to establish the core performance claims. The results are obtained by running the implemented system on the benchmarks and comparing outputs, which is self-contained empirical evidence rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The system introduces two new modules whose internal parameters (selection thresholds, scoring weights) are not detailed in the abstract; no new physical entities or unproven mathematical axioms are invoked.

free parameters (1)
  • hybrid scoring weights and exploration rate
    Parameters controlling the balance between exploration and exploitation in EPE are likely tuned on validation data to achieve the reported gains.

pith-pipeline@v0.9.0 · 5755 in / 1137 out tokens · 65354 ms · 2026-05-21T23:22:36.428632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 18 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Speech and language processing

  4. [4]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  5. [5]

    Anthropic. 2025 a . Claude-3.5-sonnet. URL https://www.anthropic.com/news/claude-3-5-sonnet

  6. [6]

    Anthropic. 2025 b . Claude-3.7-sonnet. URL https://www.anthropic.com/news/claude-3-7-sonnet

  7. [7]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732

  8. [8]

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, and 1 others. 2025. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949

  9. [9]

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787

  10. [10]

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, and 81 others. 2024. https://arxiv.org/abs/2403.17297 Internlm2 technical report . Preprint, arXiv:2403.17297

  11. [11]

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024 a . Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754--17762

  12. [12]

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024 b . https://arxiv.org/abs/2412.18925 Huatuogpt-o1, towards medical complex reasoning with llms . Preprint, arXiv:2412.18925

  13. [13]

    Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, and Mohit Bansal. 2025. Symbolic mixture-of-experts: Adaptive skill-based routing for heterogeneous reasoning. arXiv preprint arXiv:2503.05641

  14. [14]

    Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. 2024 c . Are more llm calls all you need? towards scaling laws of compound inference systems. arXiv preprint arXiv:2403.02419

  15. [15]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2023 a . Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176

  16. [16]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  17. [17]

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. 2024 d . Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305--66328

  18. [18]

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2023 b . Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311

  19. [19]

    Sanjiban Choudhury. 2025. Process reward models for llm agents: Practical framework and directions. arXiv preprint arXiv:2502.10325

  20. [20]

    DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948

  21. [21]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, and 37 others. 2024. https://arxiv.org/abs/2406.12793 Chatglm: A family of large language models from glm-130b to glm-4 all tools . Prep...

  22. [22]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  23. [23]

    Lin Gui, Cristina G \^a rbacea, and Victor Veitch. 2024. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832

  24. [24]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

  25. [25]

    Chi Hu, Chenglong Wang, Xiangnan Ma, Xia Meng, Yinqiao Li, Tong Xiao, Jingbo Zhu, and Changliang Li. 2021. Ranknas: Efficient neural architecture search by pairwise ranking. arXiv preprint arXiv:2109.07383

  26. [26]

    Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, and Yansong Feng. 2024. Can perplexity reflect large language model's ability in long text understanding? arXiv preprint arXiv:2405.06105

  27. [27]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, and 1 others. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186

  28. [28]

    W John Hutchins. 1995. Machine translation: A brief history. In Concise history of the language sciences, pages 431--445. Elsevier

  29. [29]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974

  30. [30]

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561

  31. [31]

    Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Zifeng Wang, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. 2025. Universal model routing for efficient llm inference. arXiv preprint arXiv:2502.08773

  32. [32]

    Jihoon Kwon Sangmo Gu Yejin Kim, Minkyung Cho Jy-yong Sohn Chanyeol, Choi Junseong Kim, and Seolhwa Lee. 2024. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog

  33. [33]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  34. [34]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459--9474

  35. [35]

    LG AI Research . 2025. Exaone deep: Reasoning enhanced language models. arXiv preprint arXiv:2503.12524

  36. [36]

    Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. 2024. More agents is all you need. arXiv preprint arXiv:2402.05120

  37. [37]

    Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. 2025. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674

  38. [38]

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2023. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692

  39. [39]

    MAA. 2024. American invitational mathematics examination. https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime

  40. [40]

    Costas Mavromatis, Petros Karypis, and George Karypis. 2024. Pack of llms: Model fusion at test-time via perplexity optimization. arXiv preprint arXiv:2404.11531

  41. [41]

    OpenAI. 2024. Gpt-o3-mini [online]. Available: https://platform.openai.com/docs/models

  42. [42]

    OpenAI. 2025. Introducing gpt-4.1 in the api. Accessed: 2025-05-07

  43. [43]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248--260. PMLR

  44. [44]

    Bardia Panahbehagh, Rapha \"e l Jauslin, and Yves Till \'e . 2021. Sequential unequal probability sampling for stream population. arXiv preprint arXiv:2111.08433

  45. [45]

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071

  46. [46]

    Thierry Poibeau. 2017. Machine translation. MIT Press

  47. [47]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

  48. [48]

    Tal Shnitzer, Anthony Ou, M \' rian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. 2023. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789

  49. [49]

    KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. 2024. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467

  50. [50]

    Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research, 69:343--418

  51. [51]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, and 1 others. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  52. [52]

    Qwen Team. 2024 a . https://qwenlm.github.io/blog/qwen2.5/ Qwen2.5: A party of foundation models

  53. [53]

    Qwen Team. 2024 b . https://qwenlm.github.io/blog/qwq-32b-preview/ Qwq: Reflect deeply on the boundaries of the unknown

  54. [54]

    Qwen Team. 2025. https://qwenlm.github.io/blog/qwq-32b/ Qwq-32b: Embracing the power of reinforcement learning

  55. [55]

    Marjolijn H Verspoor and Kim Sauter. 2000. English sentence analysis

  56. [56]

    Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024 a . Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692

  57. [57]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

  58. [58]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, and 1 others. 2024 b . Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  59. [59]

    Zihan Wang, Xinzhang Liu, Shixuan Liu, Yitong Yao, Yuyao Huang, Zhongjiang He, Xuelong Li, Yongxiang Li, Zhonghao Che, Zhaoxi Zhang, Yan Wang, Xin Wang, Luwen Pu, Huihan Xu, Ruiyu Fang, Yu Zhao, Jie Zhang, Xiaomeng Huang, Zhilong Lu, and 17 others. 2024 c . https://arxiv.org/abs/2401.03804 Telechat technical report . Preprint, arXiv:2401.03804

  60. [60]

    Yixing Xu, Yunhe Wang, Kai Han, Yehui Tang, Shangling Jui, Chunjing Xu, and Chang Xu. 2021. Renas: Relativistic evaluation of neural architecture search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4411--4420

  61. [61]

    YAMING YU. 2012. On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli, pages 279--289

  62. [62]

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911