pith. sign in

arxiv: 2606.22902 · v3 · pith:I53B5DSLnew · submitted 2026-06-22 · 💻 cs.AI

Agent-as-a-Router: Agentic Model Routing for Coding Tasks

Pith reviewed 2026-06-29 05:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic routingLLM routingmodel selectioncumulative regretcoding tasksinformation deficitCodeRouterBenchC-A-F loop
0
0 comments X

The pith

Framing model routing as an agentic Context-Action-Feedback loop that stores execution outcomes reduces cumulative regret by closing the information gap between available LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies information deficit as the core limit on routing each coding task to the best available LLM. Static classification routers cannot learn from how models actually perform on the exact task, even when given task-level performance statistics. The proposed framework turns routing into a repeated loop: the Orchestrator picks a model, the Verifier evaluates the output, and the Memory records the grounded result to refine the next Context. This matters because users routinely access several frontier models whose strengths differ by task, so better selection directly improves both quality and cost. On a new benchmark of roughly ten thousand tasks with verified scores from eight models, the resulting system records the lowest cumulative regret on in-distribution cases and continues to improve on out-of-distribution agentic programming tasks.

Core claim

The authors formalize routing as a C-A-F loop in which an Orchestrator selects an LLM, a Verifier checks the result, and a Memory module accumulates execution-grounded feedback to update the Context for the next task. This active accumulation closes the information gap that static routers face, as evidenced by the performance lift from even simple dimension-level statistics and the further gains from the full agentic system on both in-distribution and out-of-distribution coding tasks.

What carries the argument

The C-A-F loop (Context->Action->Feedback->Context) instantiated through an Orchestrator, Verifier, and Memory module that stores execution-grounded experience.

If this is right

  • ACRouter records the lowest cumulative regret on in-distribution tasks drawn from the CodeRouterBench distribution.
  • The same routing framework generalizes to out-of-distribution agentic-programming tasks without retraining.
  • Augmenting a vanilla LLM router with task-dimension performance statistics already yields a 15.3 percent relative gain over a heuristic router.
  • The CodeRouterBench environment of roughly ten thousand task instances with verified scores from eight frontier LLMs supports regret-based comparison on streaming tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback-accumulation pattern could be applied to non-coding domains where model strengths also vary by task type.
  • Over longer deployments the Memory module might allow the router to anticipate model suitability before any execution occurs.
  • Treating routing as an online learning process rather than a one-shot classification problem may reduce wasted API calls across any multi-model deployment.
  • The approach implies that routers can improve without access to model internals, relying only on observable task outcomes.

Load-bearing premise

Execution-grounded feedback stored in the Memory module can be used reliably by the Orchestrator and Verifier without introducing new errors or selection bias.

What would settle it

Running the full ACRouter system on a fresh stream of coding tasks and observing that its cumulative regret exceeds the regret of a static router supplied with the same task-level performance statistics would falsify the claim that the agentic loop closes the information gap.

read the original abstract

Real-world users typically have access to multiple Large Language Models (LLMs) from different providers, and these LLMs often excel at distinct domains, yet none dominate all. Consequently, routing each task to the most suitable model becomes critical for both performance and cost. Existing routers treat this as a static, one-off classification problem. However, we identify the performance bottleneck for these routers as information deficit: simply augmenting a vanilla LLM router with performance statistics at the task-dimension level yields a 15.3% relative gain, surpassing a heuristic router built on the same dimension-level priors. Motivated by this finding, we propose Agent-as-a-Router, a framework that formalizes routing as a C-A-F loop (Context->Action->Feedback->Context). It closes the information gap by accumulating execution-grounded experience during deployment. We instantiate this framework as ACRouter, composed of an Orchestrator, a Verifier, a Memory module, and introduce CodeRouterBench, an evaluation environment comprising ~10K task instances with verified scores from 8 frontier LLMs, enabling regret-based router comparison on streaming tasks. Experiments show that ACRouter achieves the lowest cumulative regret on in-distribution tasks and generalizes to out-of-distribution agentic-programming tasks, demonstrating that our routing framework actively closes the information gap. Codes and benchmarks are released at https://github.com/LanceZPF/agent-as-a-router.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that model routing for coding tasks is bottlenecked by information deficit, shows that augmenting a router with task-dimension performance statistics yields a 15.3% relative gain over a heuristic baseline, and introduces the Agent-as-a-Router (ACRouter) framework that formalizes routing as a Context-Action-Feedback loop. ACRouter comprises an Orchestrator, Verifier, and Memory module; it is evaluated on the new CodeRouterBench (~10K tasks across 8 LLMs) and reportedly achieves lowest cumulative regret on in-distribution tasks while generalizing to out-of-distribution agentic-programming tasks.

Significance. If the empirical claims hold after proper validation, the work would demonstrate that an agentic feedback loop can measurably close the information gap in LLM routing for coding, moving beyond static classification. The release of CodeRouterBench and code would also provide a reusable regret-based benchmark for streaming router evaluation.

major comments (3)
  1. [Abstract] Abstract: The reported 15.3% relative gain from dimension-level statistics is presented as the key motivation, yet the abstract supplies no implementation details on how the statistics were collected, how the heuristic router was constructed, the number of tasks or runs, or any statistical significance test; without these the gain cannot be assessed as load-bearing evidence for the information-deficit premise.
  2. [Framework description / Experiments] Framework and Experiments (implied sections): The central claim that the C-A-F loop reliably closes the information gap via execution-grounded feedback assumes the Verifier and Orchestrator ingest Memory without introducing new errors or selection bias, but the manuscript provides no ablation on the Verifier, no analysis of feedback-error propagation, and no comparison of regret with vs. without the Verifier component.
  3. [Experiments / CodeRouterBench] Evaluation: CodeRouterBench is introduced with verified scores, yet the text gives no external validation of the benchmark (e.g., correlation with human judgments or comparison to existing coding benchmarks) and reports lowest cumulative regret without specifying the number of streaming tasks, variance across runs, or regret curves for the baselines.
minor comments (2)
  1. [Abstract] The acronym C-A-F is introduced without an explicit expansion on first use in the abstract.
  2. [Abstract] The link to the GitHub repository is given but no commit hash or reproducibility instructions are supplied.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported 15.3% relative gain from dimension-level statistics is presented as the key motivation, yet the abstract supplies no implementation details on how the statistics were collected, how the heuristic router was constructed, the number of tasks or runs, or any statistical significance test; without these the gain cannot be assessed as load-bearing evidence for the information-deficit premise.

    Authors: We agree the abstract is too concise to include these details. The collection of task-dimension statistics, heuristic router construction, use of ~10K tasks across multiple runs, and statistical tests are described in the Experiments section. We will revise the abstract to add a brief clause referencing the experimental setup that supports the 15.3% gain. revision: yes

  2. Referee: [Framework description / Experiments] Framework and Experiments (implied sections): The central claim that the C-A-F loop reliably closes the information gap via execution-grounded feedback assumes the Verifier and Orchestrator ingest Memory without introducing new errors or selection bias, but the manuscript provides no ablation on the Verifier, no analysis of feedback-error propagation, and no comparison of regret with vs. without the Verifier component.

    Authors: This observation is correct; the manuscript lacks an ablation isolating the Verifier and any analysis of error propagation or selection bias. We will add an ablation study comparing cumulative regret with and without the Verifier, together with a short discussion of feedback-error propagation, in the revised manuscript. revision: yes

  3. Referee: [Experiments / CodeRouterBench] Evaluation: CodeRouterBench is introduced with verified scores, yet the text gives no external validation of the benchmark (e.g., correlation with human judgments or comparison to existing coding benchmarks) and reports lowest cumulative regret without specifying the number of streaming tasks, variance across runs, or regret curves for the baselines.

    Authors: The benchmark uses verified scores from 8 LLMs on ~10K tasks; external validation via human correlation or comparison to other coding benchmarks was not performed and will be noted as a limitation. The Experiments section already reports the number of streaming tasks, run-to-run variance, and regret curves for all methods; we will make these quantities and the corresponding figures more explicitly referenced in the revision. revision: partial

standing simulated objections not resolved
  • External validation of CodeRouterBench (human judgment correlation or comparison to existing coding benchmarks), as this requires new experiments outside the scope of the submitted work.

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper presents no equations, derivations, or self-citations that reduce its central claims to fitted quantities or prior inputs by construction. The 15.3% gain is reported as an independent observation from augmenting a baseline router with dimension-level statistics and is used only as motivation for the C-A-F framework; the headline experimental result (lowest cumulative regret on CodeRouterBench) is obtained from separate streaming-task evaluations of the full ACRouter system. No load-bearing premise collapses to a self-definition or renamed fit. The derivation chain is self-contained against the released benchmark and external model scores.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Review performed on abstract only; full paper text unavailable so ledger is limited to elements explicitly named.

axioms (1)
  • domain assumption LLMs from different providers excel at distinct domains and none dominate all tasks
    Opening sentence of abstract used to motivate the routing problem.
invented entities (3)
  • C-A-F loop no independent evidence
    purpose: Formalizes routing as iterative Context-Action-Feedback-Context process
    Core framework introduced to close the information gap.
  • ACRouter no independent evidence
    purpose: Concrete system with Orchestrator, Verifier, and Memory modules
    Instantiation of the proposed framework.
  • CodeRouterBench no independent evidence
    purpose: Evaluation environment of ~10K tasks with verified scores from 8 LLMs
    New benchmark enabling regret-based comparison.

pith-pipeline@v0.9.1-grok · 5812 in / 1444 out tokens · 32383 ms · 2026-06-29T05:00:39.287323+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Automix: Automatically mixing language models.Advances in Neural Information Processing Systems, 37: 131000–131034, 2024

    Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models.Advances in Neural Information Processing Systems, 37: 131000–131034, 2024

  2. [2]

    Thompson sampling for contextual bandits with linear payoffs

    Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InInternational Conference on Machine Learning, pages 127–135, 2013

  3. [3]

    Claude code: An agentic coding tool.https://docs.anthropic.com/en/docs/ agents-and-tools/claude-code/overview, 2025

    Anthropic. Claude code: An agentic coding tool.https://docs.anthropic.com/en/docs/ agents-and-tools/claude-code/overview, 2025

  4. [4]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

  5. [5]

    Introducing claude sonnet 4.6

    Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 11 Agent-as-a-Router: Agentic Model Routing for Coding Tasks

  7. [7]

    Can it edit? evaluating the ability of large language models to follow code editing instructions

    Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, et al. Can it edit? evaluating the ability of large language models to follow code editing instructions. InConference on Language Modeling

  8. [8]

    Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023

  9. [9]

    Swe-ci: Evaluating agent capa- bilities in maintaining codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

    Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capa- bilities in maintaining codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

  10. [10]

    Frugalgpt: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research

  11. [11]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  12. [12]

    Hybrid llm: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024

    DujianDing, AnkurMallick, ChiWang, RobertSim, SubhabrataMukherjee, VictorRuhle, LaksVS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024

  13. [13]

    Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint arXiv:2602.14337, 2026

    Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, et al. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces.arXiv preprint arXiv:2602.14337, 2026

  14. [14]

    Cruxeval: a benchmark for code reasoning, understanding and execution

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. InProceedings of the 41st International Conference on Machine Learning, pages 16568–16621, 2024

  15. [15]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  16. [16]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  17. [17]

    LiveCodeBench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations

  18. [18]

    SWE-Bench: Can language models resolve real-world GitHub issues? InInterna- tional Conference on Learning Representations, 2023

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-Bench: Can language models resolve real-world GitHub issues? InInterna- tional Conference on Learning Representations, 2023

  19. [19]

    DS-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345, 2023. 12 Agent-as-a-Router: Agentic Model Routing for Coding Tasks

  20. [20]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

  21. [21]

    LLMRouterBench: A massive benchmark and unified framework for llm routing.arXiv preprint arXiv:2601.07206, 2026

    Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, et al. LLMRouterBench: A massive benchmark and unified framework for llm routing.arXiv preprint arXiv:2601.07206, 2026

  22. [22]

    A contextual-bandit approach to personalized news article recommendation

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010

  23. [23]

    Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

    Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

  24. [24]

    CodeXGLUE: A machine learning bench- mark dataset for code understanding and generation

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. CodeXGLUE: A machine learning bench- mark dataset for code understanding and generation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  25. [25]

    Minimax m2.7.https://www.minimax.io/models/text/m27, 2026

    MiniMax. Minimax m2.7.https://www.minimax.io/models/text/m27, 2026

  26. [26]

    RouteLLM: Learning to route LLMs from preference data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. In The Thirteenth International Conference on Learning Representations

  27. [27]

    Introducing swe-bench verified

    OpenAI. Introducing swe-bench verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024

  28. [28]

    Codex cli: An agentic coding assistant

    OpenAI. Codex cli: An agentic coding assistant. https://github.com/openai/codex, 2025

  29. [29]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4, 2026

  30. [30]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

  31. [31]

    Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id= qwen3.5, 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id= qwen3.5, 2026

  32. [32]

    Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

    Cursor Research, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, et al. Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

  33. [33]

    Fly-swat or cannon? cost-effective language model choice via meta-modeling

    Marija Šakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606–615, 2024

  34. [34]

    Large language model routing with benchmark datasets

    Tal Shnitzer, Anthony Ou, Mirian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets. InConference on Language Modeling, 2024

  35. [35]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. 13 Agent-as-a-Router: Agentic Model Routing for Coding Tasks

  36. [36]

    An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology, 28(4):1–29, 2019

    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology, 28(4):1–29, 2019

  37. [37]

    Llm router: Rethinking routing with prefill activations.arXiv e-prints, pages arXiv–2603, 2026

    Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, and Davide Onofrio. Llm router: Rethinking routing with prefill activations.arXiv e-prints, pages arXiv–2603, 2026

  38. [38]

    Openhands: An open platform for ai soft- ware developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai soft- ware developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

  39. [39]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

  40. [40]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, ChengenHuang, ChenxuLv, etal. Qwen3technicalreport.arXivpreprintarXiv:2505.09388, 2025

  41. [41]

    Swe-agent: Agent-computer interfaces enable automated software engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  42. [42]

    Masrouter: Learning to route llms for multi-agent systems

    Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

  43. [43]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  44. [44]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

  45. [45]

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pages 5673–5684, 2023

  46. [46]

    Featurebench: Benchmarking agentic coding for complex feature development.arXiv preprint arXiv:2602.10975, 2026

    Qixing Zhou, Jiacheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, et al. Featurebench: Benchmarking agentic coding for complex feature development.arXiv preprint arXiv:2602.10975, 2026

  47. [47]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. 14 Agent-as-a-Router: Agentic Model Routing for Coding Tasks A. ACRout...

  48. [48]

    If fewer than 3 same-dimension examples are available, the remaining slots are filled from other dimensions

    Same-dimension priority: all 3 examples are sampled from tasks sharing the same dimension as the target task. If fewer than 3 same-dimension examples are available, the remaining slots are filled from other dimensions

  49. [49]

    Non-trivial examples only: tasks where all 8 models achieve identical scores are excluded, since they carry no routing signal

  50. [50]

    Oracle labels: each example shows the oracle-best model (the model with the highest score for that task, with ties broken by cost ascending then alphabetical) along with all 8 models’ scores, giving the LLM both the answer and the full score distribution

  51. [51]

    model":

    Prompt truncation: task prompts in examples are truncated to 300 characters to control input token cost; the target task’s prompt is included in full. 5.Fixed seed: examples are sampled with a fixed random seed (42) for reproducibility. 19 Agent-as-a-Router: Agentic Model Routing for Coding Tasks ## Examples ### Example 1 - Dimension: bug_fixing - Difficu...

  52. [52]

    Profileviaprobingset: UseCodeRouterBench(oryourowntasksviaC-A-F)tobuildadimension ×model 28 Agent-as-a-Router: Agentic Model Routing for Coding Tasks performance matrix

  53. [53]

    This is your baseline

    Start with DimensionBest: A static Memory + lookup achieves about 83% of oracle AvgPerf at near-zero overhead. This is your baseline

  54. [54]

    Trained classifiers achieve Perf/$ of 6.11–6.82 in Table 3

    Add a classifier: Swap the routing tool to a trained classifier (LogReg or RouteLLM) for cheaper deployment with comparable AvgPerf to DimensionBest. Trained classifiers achieve Perf/$ of 6.11–6.82 in Table 3

  55. [55]

    Initialize Memory with whatever priors you have; the C-A-F loop ensures convergence

    Complete the C-A-F loop (ACRouter): When deploying on new distributions, activate all three modules to close the feedback loop. Initialize Memory with whatever priors you have; the C-A-F loop ensures convergence

  56. [56]

    Customize tools: Swap evaluation tools in the Verifier (e.g., domain-specific tests), add custom routing tools

  57. [57]

    "" pairs_sum_to_zero takes a list of integers as an inp Qwen3.5 [score=1.00] ```python def pairs_sum_to_zero(l):

    Extend the benchmark: Add tasks via C-A-F. New models need responses + scoring; new dimensions need a task set + scoring function. E.3. Beyond Model Routing The C-A-F loop (observe context, act, receive feedback, update context) is not specific to model routing. The same paradigm applies to tool selection, API endpoint selection, prompt strategy selection...