pith. machine review for the scientific record. sign in

arxiv: 2605.09121 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL· cs.IT· math.IT

Recognition: no theorem link

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.ITmath.IT
keywords LLM reliabilitystochastic channel modelcommunication theorycost-quality tradeoffadaptive routingPareto frontierreliability operatorsagent frameworks
0
0 comments X

The pith

Viewing LLM sampling as a stochastic channel unifies reliability techniques and enables a router that dominates fixed methods on the quality-cost frontier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agents built on large language models currently use reliability techniques such as retry, majority voting, and self-consistency without a shared analytical structure. The paper shows that sampling an LLM at temperature T produces outputs from a discrete stochastic channel in the Shannon sense. This view recasts each technique as one of six classical reliability operators from communication theory. The authors derive analytical results on averaging and refinement, then introduce a cost-aware router with a single tuning parameter that adapts per task. Experiments on hard splits of MMLU, GSM8K, and HumanEval demonstrate that this router traces the full Pareto frontier between quality and cost, outperforming any static combination.

Core claim

We observe that an LLM sampled at temperature T is a discrete stochastic channel p(y|x) in the sense of Shannon's coding theory, and use this identity as the entry point for a framework grounded in communication theory. Each reliability technique is a special case of one of six classical operators: diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing. The framework yields closed-form thresholds for averaging and a contractivity criterion for refinement. A cost-aware semantic-nearest-neighbor router with one Lagrangian knob traverses the quality-cost frontier without retraining.On

What carries the argument

The mapping of LLM temperature sampling to a discrete stochastic channel p(y|x), which allows classical reliability operators to be applied directly to unify and optimize agent reliability techniques.

If this is right

  • No fixed model-technique-budget choice dominates across the six channel configurations and 69 tasks.
  • The router can achieve any point on the empirical quality-cost frontier by adjusting its single Lagrangian parameter.
  • A noise-variance threshold determines when uniform averaging outperforms quality-weighted averaging.
  • Generator-critic refinement is contractive only for models above a certain size, explaining observed transitions between 3B and 14B models.
  • Per-task adaptive allocation is required to reach optimal reliability performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could extend to dynamic estimation of channel parameters during agent operation for real-time adaptation.
  • Similar channel models might apply to other LLM behaviors such as chain-of-thought reasoning steps.
  • Integrating this router into multi-turn agent conversations could treat each turn as a cascaded channel.

Load-bearing premise

LLM sampling at temperature T can be treated as a discrete stochastic channel in the Shannon sense so that classical reliability operators apply directly.

What would settle it

Observing that the proposed router fails to trace the full Pareto frontier or achieve the stated cost and quality improvements when evaluated on a fresh set of tasks or model configurations outside the six tested.

Figures

Figures reproduced from arXiv: 2605.09121 by Hamed Omidvar, Vahideh Akhlaghi.

Figure 1
Figure 1. Figure 1: Quality–cost Pareto frontier on the Ollama-cloud trio, 300-task ( [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Turbo-decoder quality versus refinement iteration on the 14B local configuration (DeepSeek [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Iterative-decoding threshold across model scales. Same turbo configuration [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Forward error correction (FEC) on the Ollama-cloud configuration. [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-category quality–cost Pareto scatter on the 14B local configuration ( [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Matched-budget evaluation on the 14B local configuration (cf. Snell et al., 2025): mean [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-technique quality distributions on the 14B local configuration (violin plots with [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-dataset quality versus normalized cost overhead [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-policy technique pick frequency on the [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Technique × task-category heatmap on the 14B local configuration (DeepSeek-R1 14B + Phi-3 14B; judge Gemma-3 12B; n=69 curated tasks). Each cell reports the per-(technique, category) mean quality with paired-Wilcoxon-vs.-baseline significance stars. Different techniques dominate different task types: hybrid retransmission with incremental redundancy on question answering and reasoning, maximal-ratio combi… view at source ↗
Figure 11
Figure 11. Figure 11: Technique × task-category heatmap on the Ollama-cloud trio. Each cell reports the per- (technique, category) mean quality with paired-Wilcoxon-vs.-baseline significance stars. Different techniques dominate different task types: hybrid retransmission with incremental redundancy on question answering and reasoning, maximal-ratio combining on code. A36 [PITH_FULL_IMAGE:figures/full_fig_p049_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mean quality by technique on the three hard-benchmark splits (Ollama-cloud trio; one [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 14B local: quality versus cost. Left: raw cost (US dollars) per task; center: quality versus normalized cost overhead ρ; right: quality gain over the single-call baseline versus ρ. Large markers are per-technique means with 95% bootstrap confidence intervals; small markers are individual task runs. Techniques in the upper-left of the right panel give the best quality return per cost-dollar invested. 1.0 1… view at source ↗
Figure 14
Figure 14. Figure 14: 14B local: iterative-refinement diagnostics. (a) Mean quality versus retransmission round [PITH_FULL_IMAGE:figures/full_fig_p050_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: 14B local: ACM-router oracle-gap decomposition. Left: per-policy mean quality with 95% bootstrap confidence intervals (cross-validated rows are out-of-fold). Right: additive decomposition of the realized-router-to-oracle gap into a feature-set information limit, a finite-sample generalization gap, a policy gap (router class), and a realization gap (configuration drift). Baseline (uncoded) Diversity-SC Div… view at source ↗
Figure 16
Figure 16. Figure 16: 14B local: per-category mean quality on the 69 curated tasks; one panel per task category [PITH_FULL_IMAGE:figures/full_fig_p051_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: 8B local: quality versus cost (panels as in Fig. 13). The frontier shifts down in absolute [PITH_FULL_IMAGE:figures/full_fig_p052_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: 8B local: iterative-refinement diagnostics. The turbo survivor cohort is noisy and barely [PITH_FULL_IMAGE:figures/full_fig_p052_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: 8B local: ACM-router oracle-gap decomposition (axes as in Fig. 15). [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: 8B local: per-category mean quality (panels as in Fig. 16). [PITH_FULL_IMAGE:figures/full_fig_p053_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: 3B local: quality versus cost (panels as in Fig. 13). [PITH_FULL_IMAGE:figures/full_fig_p053_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: 3B local: iterative-refinement diagnostics. The turbo survivor cohort is net-descending, [PITH_FULL_IMAGE:figures/full_fig_p053_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: 3B local: ACM-router oracle-gap decomposition (axes as in Fig. 15). [PITH_FULL_IMAGE:figures/full_fig_p054_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: 3B local: per-category mean quality (panels as in Fig. 16). [PITH_FULL_IMAGE:figures/full_fig_p054_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: 3B local with cloud judge: quality versus cost (panels as in Fig. 13). Replacing the 12B [PITH_FULL_IMAGE:figures/full_fig_p054_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: 3B local with cloud judge: iterative-refinement diagnostics. [PITH_FULL_IMAGE:figures/full_fig_p055_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: 3B local with cloud judge: ACM-router oracle-gap decomposition (axes as in Fig. 15). [PITH_FULL_IMAGE:figures/full_fig_p055_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Anthropic + OpenAI cloud: quality versus cost (panels as in Fig. 13). Absolute cost is one [PITH_FULL_IMAGE:figures/full_fig_p055_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Anthropic + OpenAI cloud: iterative-refinement diagnostics. Both refinement decoders sit [PITH_FULL_IMAGE:figures/full_fig_p056_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Anthropic + OpenAI cloud: per-category mean quality (panels as in Fig. 16). [PITH_FULL_IMAGE:figures/full_fig_p056_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Ollama-cloud trio on the 69 curated tasks: quality versus cost (panels as in Fig. 13). [PITH_FULL_IMAGE:figures/full_fig_p056_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Ollama-cloud trio on the curated split: iterative-refinement diagnostics. [PITH_FULL_IMAGE:figures/full_fig_p057_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Ollama-cloud trio on the 69 curated tasks: [PITH_FULL_IMAGE:figures/full_fig_p057_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Ollama-cloud trio on the curated split: per-category mean quality (panels as in Fig. 16). [PITH_FULL_IMAGE:figures/full_fig_p058_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Ollama-cloud trio on the 300-task hard-benchmark split: quality versus cost (panels as in [PITH_FULL_IMAGE:figures/full_fig_p058_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Per-dataset mean quality on the 300-task hard-benchmark split, one panel per dataset [PITH_FULL_IMAGE:figures/full_fig_p058_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Per-dataset quality versus normalized cost overhead [PITH_FULL_IMAGE:figures/full_fig_p059_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Ollama-cloud trio on the 300-task hard-benchmark split: [PITH_FULL_IMAGE:figures/full_fig_p059_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Ollama-cloud trio on the 300-task hard-benchmark split: iterative-refinement diagnostics. [PITH_FULL_IMAGE:figures/full_fig_p059_39.png] view at source ↗
read the original abstract

Agents built on large language models (LLMs) rely on a range of reliability techniques, including retry, majority voting, and self-consistency, that have been developed in parallel rather than within a common analytical framework. We observe that an LLM sampled at temperature $T$ is a discrete stochastic channel $p(y \mid x)$ in the sense of Shannon's coding theory, and use this identity as the entry point for such a framework grounded in communication theory. Each of these techniques is a special case of one of six classical reliability operators: diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing. Within the framework we give two closed-form results: a noise-variance threshold above which uniform averaging beats quality-weighted averaging, and a contractivity criterion for generator-critic refinement, consistent with a contractive-to-divergent transition we observe between 3B- and 14B-parameter models. We further introduce a cost-aware semantic-nearest-neighbor router whose single Lagrangian knob traverses the quality-cost frontier without retraining. Across six channel configurations spanning local and cloud models on 69 hard tasks, no fixed model-technique-budget choice dominates, motivating per-task allocation. On a 300-item hard split of MMLU, GSM8K, and HumanEval, our router occupies the full empirical Pareto frontier: at matched quality, its normalized cost is ${\approx}56$\% lower than the strongest fixed technique; at matched normalized cost, it improves quality by ${\approx}7$\% ($26$\% over single-shot decoding). These results argue for consolidating these reliability techniques into a single tunable layer informed by channel coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a communication-theoretic framework for LLM agents by treating LLM sampling at temperature T as a discrete stochastic channel p(y|x) in the Shannon sense. It unifies techniques such as retry, majority voting, and self-consistency as special cases of six classical reliability operators (diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing). The framework yields two closed-form results: a noise-variance threshold above which uniform averaging outperforms quality-weighted averaging, and a contractivity criterion for generator-critic refinement. It introduces a cost-aware semantic-nearest-neighbor router controlled by a single Lagrangian multiplier that traverses the quality-cost frontier. Empirical results on 69 hard tasks and a 300-item hard split of MMLU, GSM8K, and HumanEval claim that the router occupies the full Pareto frontier, achieving ~56% lower normalized cost at matched quality and ~7% quality improvement (~26% over single-shot) at matched cost.

Significance. If the derivations and experiments are substantiated, the work offers a unifying analytical foundation that consolidates disparate LLM reliability methods under communication theory, potentially enabling more principled and tunable designs for agents. The closed-form results provide concrete, testable predictions (including scaling behavior between model sizes) and the single-knob router demonstrates a practical mechanism for cost-aware adaptation without retraining. The reported Pareto dominance on standard benchmarks, if reproducible, would have direct implications for efficient deployment of LLM systems where quality-cost tradeoffs are critical.

major comments (2)
  1. [Abstract] Abstract: The two closed-form results (noise-variance threshold for averaging methods and contractivity criterion for generator-critic decoding) are asserted without any mathematical expressions, derivation steps, or explicit conditions, which are load-bearing for the central claim that the framework is grounded in communication theory and yields analytical insights.
  2. [Abstract] Abstract: The primary empirical claim that the router occupies the full Pareto frontier on the 300-item hard split of MMLU/GSM8K/HumanEval (with ~56% normalized cost reduction at matched quality and ~7% quality gain at matched cost) is stated without any tables, figures, definitions of normalized cost, construction details for the hard split, identity of the strongest fixed baselines, or aggregation method across tasks, rendering the result unassessable and unverifiable from the manuscript.
minor comments (2)
  1. [Abstract] Abstract: The precise definition of 'normalized cost' (e.g., tokens, latency, or API units) is not provided, which affects interpretation of the reported cost-quality tradeoffs.
  2. [Abstract] Abstract: While the six reliability operators are named, the explicit mapping from common LLM techniques (such as self-consistency or majority voting) to specific operators is not detailed, reducing clarity of the unification claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and have revised the abstract to improve clarity and self-containment while preserving its conciseness. The full derivations and experimental details remain in the main text and appendix.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The two closed-form results (noise-variance threshold for averaging methods and contractivity criterion for generator-critic decoding) are asserted without any mathematical expressions, derivation steps, or explicit conditions, which are load-bearing for the central claim that the framework is grounded in communication theory and yields analytical insights.

    Authors: We agree that the abstract would be strengthened by including the explicit expressions. The noise-variance threshold (above which uniform averaging outperforms quality-weighted averaging) and the contractivity criterion for generator-critic refinement are derived in Sections 3.2 and 4.1, respectively, with the observed contractive-to-divergent transition between 3B- and 14B-parameter models. To address the concern, we have revised the abstract to incorporate the key formulas and conditions in compact form. revision: yes

  2. Referee: [Abstract] Abstract: The primary empirical claim that the router occupies the full Pareto frontier on the 300-item hard split of MMLU/GSM8K/HumanEval (with ~56% normalized cost reduction at matched quality and ~7% quality gain at matched cost) is stated without any tables, figures, definitions of normalized cost, construction details for the hard split, identity of the strongest fixed baselines, or aggregation method across tasks, rendering the result unassessable and unverifiable from the manuscript.

    Authors: We acknowledge that the abstract summarizes the results without supporting details. The full experimental evidence—including tables and figures for the Pareto frontier, the definition of normalized cost (total token usage relative to single-shot baseline), the hard-split construction (items failed by base models under single-shot decoding), the strongest fixed baselines (majority voting and self-consistency across budgets), and macro-average aggregation—is provided in Section 5 and the appendix. We have revised the abstract to include a brief definition of normalized cost and the hard-split selection criterion for improved verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in claimed derivation chain

full rationale

The abstract frames LLM sampling as a Shannon channel and maps existing techniques to classical reliability operators, then states two closed-form results and a Lagrangian-parameter router. No equations, self-citations, or derivation steps are supplied in the available text, so none can be shown to reduce to their inputs by construction. The Pareto-frontier claim is presented as an empirical outcome on a 300-item split rather than a mathematical identity or fitted prediction renamed as a result. The framework therefore remains self-contained against external communication-theory benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on treating LLM sampling as a Shannon channel and introducing a tunable router; limited information from abstract prevents full enumeration of parameters or entities.

free parameters (1)
  • Lagrangian multiplier
    Single adjustable knob in the semantic-nearest-neighbor router that traverses the quality-cost frontier.
axioms (1)
  • domain assumption An LLM sampled at temperature T is a discrete stochastic channel p(y | x)
    This identity is the explicit entry point for mapping reliability techniques to classical communication operators.
invented entities (1)
  • cost-aware semantic-nearest-neighbor router no independent evidence
    purpose: Dynamically selects and routes among reliability techniques using one Lagrangian parameter
    New component introduced to achieve the reported Pareto improvements without retraining.

pith-pipeline@v0.9.0 · 5593 in / 1379 out tokens · 85627 ms · 2026-05-12T02:44:40.004469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 12 internal anchors

  1. [1]

    1996 , publisher =

    Delta-Sigma Data Converters: Theory, Design, and Simulation , author =. 1996 , publisher =

  2. [2]

    , title =

    Shannon, Claude E. , title =. Bell System Technical Journal , volume =

  3. [3]

    and Salehi, Masoud , title =

    Proakis, John G. and Salehi, Masoud , title =

  4. [4]

    Tse, David and Viswanath, Pramod , title =

  5. [5]

    Proceedings of ICC'93 -- IEEE International Conference on Communications , pages =

    Berrou, Claude and Glavieux, Alain and Thitimajshima, Punya , title =. Proceedings of ICC'93 -- IEEE International Conference on Communications , pages =. 1993 , volume =

  6. [6]

    , title =

    Gallager, Robert G. , title =. IRE Transactions on Information Theory , volume =

  7. [7]

    The Effect upon Channel Capacity in Wireless Communications of Perfect and Imperfect Knowledge of the Channel , journal =

    M. The Effect upon Channel Capacity in Wireless Communications of Perfect and Imperfect Knowledge of the Channel , journal =

  8. [8]

    IEEE Transactions on Information Theory , volume =

    Yoo, Taesang and Goldsmith, Andrea , title =. IEEE Transactions on Information Theory , volume =

  9. [9]

    IEEE Transactions on Wireless Communications , volume =

    Gao, Feifei and Nallanathan, Arumugam and Wang, Jianguo , title =. IEEE Transactions on Wireless Communications , volume =

  10. [10]

    and Alouini, Mohamed-Slim , title =

    Simon, Marvin K. and Alouini, Mohamed-Slim , title =. Wiley-Interscience , year =

  11. [11]

    IEEE International Symposium on Information Theory (ISIT) , year =

    Taricco, Giorgio and Biglieri, Ezio , title =. IEEE International Symposium on Information Theory (ISIT) , year =

  12. [12]

    IEEE Transactions on Communications , volume =

    ten Brink, Stephan , title =. IEEE Transactions on Communications , volume =

  13. [13]

    IEEE transactions on communications , volume=

    Convergence behavior of iteratively decoded parallel concatenated codes , author=. IEEE transactions on communications , volume=. 2001 , publisher=

  14. [14]

    and Urbanke, R

    Richardson, Thomas J. and Urbanke, R. The Capacity of Low-Density Parity-Check Codes Under Message-Passing Decoding , journal =

  15. [15]

    and Smale, Stephen and Devaney, Robert L

    Hirsch, Morris W. and Smale, Stephen and Devaney, Robert L. , title =

  16. [16]

    and Beaulieu, Norman C

    Tomiuk, Bruno R. and Beaulieu, Norman C. and Abu-Dayya, Adnan A. , title =. IEEE Transactions on Communications , volume =

  17. [17]

    IEEE Transactions on Information Theory , volume =

    Hagenauer, Joachim and Offer, Elke and Papke, Lutz , title =. IEEE Transactions on Information Theory , volume =

  18. [18]

    Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pages =

    Luby, Michael , title =. Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pages =

  19. [19]

    IEEE Transactions on Information Theory , volume =

    Shokrollahi, Amin , title =. IEEE Transactions on Information Theory , volume =

  20. [20]

    Electronics Letters , volume =

    Vogt, Joerg and Finger, Adolf , title =. Electronics Letters , volume =

  21. [21]

    IEEE Transactions on Information Theory , volume =

    Caire, Giuseppe and Tuninetti, Daniela , title =. IEEE Transactions on Information Theory , volume =

  22. [22]

    and Chua, Soon-Ghee , title =

    Goldsmith, Andrea J. and Chua, Soon-Ghee , title =. IEEE Transactions on Communications , volume =

  23. [23]

    and Thomas, Joy A

    Cover, Thomas M. and Thomas, Joy A. , title =

  24. [24]

    , title =

    Alamouti, Siavash M. , title =. IEEE Journal on Selected Areas in Communications , volume =

  25. [25]

    arXiv preprint arXiv:2406.19664 , year =

    Xu, Deng-Ao and others , title =. arXiv preprint arXiv:2406.19664 , year =

  26. [26]

    arXiv preprint arXiv:2303.08774 , year =

  27. [27]

    Anthropic Technical Report , year =

    The. Anthropic Technical Report , year =

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Martin, Louis and Stone, Kevin and others , title =. arXiv preprint arXiv:2302.13971 , year =

  29. [29]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  30. [30]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  31. [31]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages =

    Dhuliawala, Shehzaad and Komeili, Mojtaba and Xu, Jing and Raileanu, Roberta and Li, Xian and Celikyilmaz, Asli and Weston, Jason , title =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =

  32. [32]

    Proceedings of the 11th International Conference on Learning Representations (ICLR) , year =

    Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. Proceedings of the 11th International Conference on Learning Representations (ICLR) , year =

  33. [33]

    Confidence Improves Self-Consistency in

    Taubenfeld, Amir and Sheffer, Tom and Ofek, Eran and Feder, Amir and Goldstein, Ariel and Gekhman, Zorik and Yona, Gal , booktitle=. Confidence Improves Self-Consistency in

  34. [34]

    NeurIPS 2025 Workshop on Efficient Reasoning , year =

    Feng, Austin and Alonso, Marius and Odonnat, Ambroise , title =. NeurIPS 2025 Workshop on Efficient Reasoning , year =

  35. [35]

    , title =

    Cordero-Encinar, Paula and Duncan, Andrew B. , title =. arXiv preprint arXiv:2510.17472 , year =

  36. [36]

    Learning when to sample: Confidence-aware self-consistency for efficient llm chain-of-thought reasoning,

    Chen, Yuheng and others , title =. arXiv preprint arXiv:2603.08999 , year =

  37. [37]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  38. [38]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

    Wang, Junlin and Wang, Jue and Athiwaratkun, Ben and Zhang, Ce and Zou, James , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  39. [39]

    arXiv preprint arXiv:2502.00674 , year =

    Li, Wenzhe and Lin, Yong and Xia, Mengzhou and Jin, Chi , title =. arXiv preprint arXiv:2502.00674 , year =

  40. [40]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  41. [41]

    Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

    Chen, Zhijun and Lu, Xiaodong and Li, Jingzheng and Chen, Pengpeng and Li, Zhuoran and Sun, Kai and Luo, Yuankai and Mao, Qianren and Li, Ming and Xiao, Likang and Yang, Dingqi and Huang, Xiao and Ban, Yikun and Sun, Hailong and Yu, Philip S. , title =. arXiv preprint arXiv:2502.18036 , year =

  42. [42]

    arXiv preprint arXiv:2601.16596 , year =

    Wen, Jianyu and Wei, Yang and Yu, Xiongxi and Xiao, Changxuan and Zeng, Ke , title =. arXiv preprint arXiv:2601.16596 , year =

  43. [43]

    Transactions on Machine Learning Research , year =

    Chen, Lingjiao and Zaharia, Matei and Zou, James , title =. Transactions on Machine Learning Research , year =

  44. [44]

    and Kadous, M

    Ong, Isaac and Almahairi, Amjad and Wu, Vincent and Chiang, Wei-Lin and Wu, Tianhao and Gonzalez, Joseph E. and Kadous, M. Waleed and Stoica, Ion , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  45. [45]

    Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling , journal =

  46. [46]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

    Lu, Keming and Yuan, Hongyi and Lin, Runji and Lin, Junyang and Yuan, Zheng and Zhou, Chang and Zhou, Jingren , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  47. [47]

    A unified approach to routing and cascading for llms.arXiv preprint arXiv:2410.10347, 2024

    Gupta, Shrey and others , title =. arXiv preprint arXiv:2410.10347 , year =

  48. [48]

    arXiv preprint arXiv:2602.09902 , year =

    Yang, Hongbin and others , title =. arXiv preprint arXiv:2602.09902 , year =

  49. [49]

    arXiv preprint arXiv:2602.21231 , year =

    Martinez, Alejandro and others , title =. arXiv preprint arXiv:2602.21231 , year =

  50. [50]

    TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

    Xu, Haibo and others , title =. arXiv preprint arXiv:2601.10245 , year =

  51. [51]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Advances in Neural Information Pro...

  52. [52]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  53. [53]

    arXiv preprint arXiv:2511.10621 , year =

    Kumar, Abhishek and others , title =. arXiv preprint arXiv:2511.10621 , year =

  54. [54]

    arXiv preprint arXiv:2502.05605 , year =

    Chen, Ruoyu and others , title =. arXiv preprint arXiv:2502.05605 , year =

  55. [55]

    Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

    Sun, Zhixiang and others , title =. arXiv preprint arXiv:2509.00084 , year =

  56. [56]

    Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

    Kumar, Aviral and others , title =. arXiv preprint arXiv:2409.12917 , year =

  57. [57]

    Available: https://arxiv.org/abs/2512.20845

    Wang, Lingxiao and others , title =. arXiv preprint arXiv:2512.20845 , year =

  58. [58]

    and Mordatch, Igor , title =

    Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

  59. [59]

    arXiv preprint arXiv:2601.04742 , year =

    Chen, Xiaojun and others , title =. arXiv preprint arXiv:2601.04742 , year =

  60. [60]

    Frontiers of Computer Science , year =

    Ning, Yucheng and Lin, Xixun and Fang, Fang and Cao, Yanan , title =. Frontiers of Computer Science , year =

  61. [61]

    arXiv preprint arXiv:2511.07784 , year =

    Sun, Chunhui and others , title =. arXiv preprint arXiv:2511.07784 , year =

  62. [62]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  63. [63]

    arXiv preprint arXiv:2502.18080 , year =

    Wu, Zhenyu and others , title =. arXiv preprint arXiv:2502.18080 , year =

  64. [64]
  65. [65]

    arXiv preprint arXiv:2604.01411 , year =

    Roberts, Nicholas and Cho, Sungjun and Gao, Zhiqi and Huang, Tzu-Heng and Wu, Albert and Orlanski, Gabriel and Trost, Avi and Buchanan, Kelly and Albarghouthi, Aws and Sala, Frederic , title =. arXiv preprint arXiv:2604.01411 , year =

  66. [66]

    arXiv preprint arXiv:2506.12928 , year =

    Zhu, King and Li, Hanhao and Wu, Siwei and Xing, Tianshun and Ma, Dehua and Tang, Xiangru and Liu, Minghao and Yang, Jian and Liu, Jiaheng and Jiang, Yuchen Eleanor and Zhang, Changwang and Lin, Chenghua and Wang, Jun and Zhang, Ge and Zhou, Wangchunshu , title =. arXiv preprint arXiv:2506.12928 , year =

  67. [67]

    Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

    Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

  68. [68]

    Training Verifiers to Solve Math Word Problems

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , title =. arXiv preprint arXiv:2110.14168 , year =

  69. [69]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

    Weng, Yixuan and Zhu, Minjun and Xia, Fei and Li, Bin and He, Shizhu and Liu, Kang and Zhao, Jun , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

  70. [70]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

    Chow, Yinlam and Tennenholtz, Guy and Gur, Izzeddin and Zhuang, Vincent and Dai, Bo and Thiagarajan, Sridhar and Boutilier, Craig and Agarwal, Rishabh and Kumar, Aviral and Faust, Aleksandra , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  71. [71]

    Technical report: Enhancing llm reasoning with reward-guided tree search

    Guan, Yiyang and others , title =. arXiv preprint arXiv:2411.11694 , year =

  72. [72]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  73. [73]

    A Survey on LLM-as-a-Judge

    Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , title =. arXiv preprint arXiv:2411.15594 , year =

  74. [74]

    The Innovation , pages=

    A survey on LLM-as-a-Judge , author=. The Innovation , pages=. 2026 , publisher=

  75. [75]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Li, Sipeng and others , title =. arXiv preprint arXiv:2412.05579 , year =

  76. [76]

    arXiv preprint arXiv:2412.12509 , year =

    Schroeder, Kayla and Wood-Doughty, Zach , title =. arXiv preprint arXiv:2412.12509 , year =

  77. [77]

    arXiv preprint arXiv:2506.13639 , year =

    Khattar, Dhruv and others , title =. arXiv preprint arXiv:2506.13639 , year =

  78. [78]

    Large language models hallucination: A comprehensive survey.arXiv preprint arXiv:2510.06265, 2025

    Rawte, Vipula and others , title =. arXiv preprint arXiv:2510.06265 , year =

  79. [79]

    arXiv preprint arXiv:2510.24476 , year =

    Ji, Ziwei and others , title =. arXiv preprint arXiv:2510.24476 , year =

  80. [80]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , title =. arXiv preprint arXiv:2308.08155 , year =

Showing first 80 references.