pith. sign in

arxiv: 2512.23213 · v3 · submitted 2025-12-29 · 💻 cs.CL · cs.AI

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Pith reviewed 2026-05-16 19:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM ensemblepeer reviewLLM-as-a-Judgeunsupervised selectionresponse aggregationmulti-model evaluationtruth inference
0
0 comments X

The pith

An unsupervised peer-review process lets multiple LLMs score and select the best response from candidates for any query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM-PeerReview, an ensemble method that has several LLMs evaluate each other's candidate answers for a given query. It applies LLM-as-a-Judge scoring, then aggregates the scores either by averaging or through a graphical truth-inference model, and outputs the single highest-scoring response. The approach needs no labeled data or supervision and stays interpretable at every step. Tests across four datasets show the two variants beat the prior Smoothie-Global ensemble by 6.9 and 7.3 percentage points on factual QA, math reasoning, and instruction-following tasks. The framework treats the collection of available models as a self-contained review panel that surfaces the strongest output.

Core claim

LLM-PeerReview runs in three stages: scoring, where multiple LLMs judge each candidate response; reasoning, where scores are combined by simple averaging or a principled graphical model; and selection, where the response with the highest final score is chosen as the ensemble output. This unsupervised pipeline produces stronger results than previous methods on factual recall, mathematical reasoning, and instruction-following benchmarks.

What carries the argument

The three-stage peer-review pipeline that re-uses the same set of LLMs first as scorers of each candidate response and then as aggregators to pick the single best output.

If this is right

  • The averaging and graphical-model variants each improve over Smoothie-Global by roughly seven points on four datasets.
  • Gains appear consistently across factual QA, math reasoning, and instruction-following tasks.
  • No training data or external supervision is required, so the method can be applied immediately to any new set of models and queries.
  • The explicit scoring stage produces traceable reasons for choosing one response over the others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scoring-and-selection loop could be repeated on refined drafts to create an iterative self-improvement cycle without external reward models.
  • When judge models are drawn from different families, the aggregation step may capture complementary strengths that single-model prompting misses.
  • The approach could serve as a lightweight post-generation filter in production systems where only one final answer is returned to the user.
  • If the graphical model variant proves more robust to noisy judges, it could guide future work on modeling judge reliability explicitly.

Load-bearing premise

That LLM-as-a-Judge scores are reliable and unbiased enough for the aggregation step to correctly identify the best response without any supervision or ground truth.

What would settle it

Measure whether the response chosen by the peer-review scores actually matches the ground-truth answer more often than the single best model or a random candidate on a dataset with verifiable correct answers.

Figures

Figures reproduced from arXiv: 2512.23213 by Bangjie Qin, Hailong Sun, Hao Wu, Jingzheng Li, Jinhuan Song, Junhang Cheng, Kai Sun, Qianren Mao, Xiangyang Ji, Yikun Ban, Zeyu Ji, Zhijun Chen, Zhuoran Li, Zhu Sun, Zizhe Wang.

Figure 1
Figure 1. Figure 1: The proposed LLM-PeerReview contains three steps: (1) Scoring: For a given query, after each LLM independently generates a response (analogous to a submitted academic paper), LLM-PeerReview applies the LLM-as-a-Judge technique (and the proposed flipped-triple scoring trick), treating each model as a reviewer to assign scores to all candidate responses; (2) Reasoning: LLM-PeerReview then uses a truth infere… view at source ↗
Figure 2
Figure 2. Figure 2: Probabilistic graphical representation. annotator-specific transition matrix Π(j ′ ) to model the prob￾ability that an LLM confuses one score category for another, capturing its scoring tendencies and potential biases: p(y (i,j;j ′ ) = n|t (i,j) = m; Π(j ′ ) ) = π (j ′ ) mn , (3) where m, n ∈ {1, . . . , K} and K denotes the number of categories (i.e., the number of score levels). Objective and optimizatio… view at source ↗
Figure 3
Figure 3. Figure 3: LLM performances (bottom: AlpacaEval). 3. Assessment: Experiment Setup We provide the implementation of our LLM-PeerReview and all baselines for the used datasets: ‡ Code. Datasets and evaluation. We evaluate four widely-used datasets, grouped into three categories. (1) Factual Recall: TriviaQA (Joshi et al., 2017; Guha et al., 2024) evaluates the accuracy of model responses to factual questions across var… view at source ↗
Figure 4
Figure 4. Figure 4: Left: The transition matrix of each LLM estimated by LLM-PeerReview-W. Right: Correlation between matrix diagonal information of each LLM and its performance as a single judge (corresponding to “our variants” in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of various variants across dif￾ferent scoring levels. ing the low computational complexity. Further, in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes LLM-PeerReview, an unsupervised three-stage ensemble method for selecting the best response among multiple LLM-generated candidates: scoring via LLM-as-a-Judge, aggregation through either simple averaging or a graphical-model truth-inference procedure, and final selection of the highest-scoring output. The authors claim that both variants outperform the Smoothie-Global baseline by 6.9 and 7.3 percentage points on four datasets spanning factual QA, math reasoning, and instruction following.

Significance. If the reported gains are robust, the method would supply a transparent, training-free mechanism for harnessing complementary LLM strengths without supervision, offering a practical alternative to existing ensemble techniques in NLP.

major comments (3)
  1. [Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.
  2. [Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.
  3. [Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.
minor comments (2)
  1. [Method] The description of the aggregation procedures would be clearer if accompanied by explicit equations or pseudocode for both the averaging and graphical-model methods.
  2. [Related Work] Consider expanding the related-work discussion to include recent LLM-as-a-Judge benchmarks and other unsupervised ensembling baselines for better positioning of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating the revisions we will make to improve the clarity and completeness of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.

    Authors: We agree that these experimental details are essential for assessing the robustness of the reported gains. In the revised manuscript we will add statistical significance tests (paired t-tests across queries), report standard deviations over multiple random seeds, explicitly state the sizes of all four datasets, the number of candidate responses per query, and provide the exact prompting templates and decoding parameters (temperature, top-p, etc.) used for both generation and judging. revision: yes

  2. Referee: [Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.

    Authors: We acknowledge that direct validation of the LLM judges is valuable. Because the method is designed to remain fully unsupervised, we cannot replace the judges with humans at scale; however, we will add (i) a discussion of known LLM-as-Judge biases (length, position, style) with citations to recent studies, (ii) Pearson/Spearman correlations between LLM scores and human ratings on a randomly sampled subset of 200 examples from one dataset, and (iii) an explicit limitations paragraph noting the absence of a full human ablation. These additions will be placed in a new subsection under Scoring Stage. revision: partial

  3. Referee: [Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.

    Authors: We agree that the graphical-model component requires a more precise description. In the revision we will expand the Reasoning Stage section to include: the exact generative model (plate notation and factor graph), the EM-based parameter estimation procedure with initialization details, the closed-form update equations, and a sensitivity analysis varying the number of iterations and prior strength. We will also report the performance delta between the averaging and truth-inference variants on each dataset to quantify the contribution of this stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents LLM-PeerReview as a three-stage empirical procedure: LLM-as-a-Judge scoring of candidate responses, aggregation via averaging or graphical-model truth inference, and selection of the highest-scoring output. Performance is reported via direct empirical comparison to the external baseline Smoothie-Global on four datasets, with no equations, fitted parameters, or self-citations that reduce the central claim to a redefinition or renaming of its own inputs. The method is self-contained as an unsupervised ensemble technique whose validity rests on external benchmark results rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven reliability of LLM-as-Judge scoring and the effectiveness of the chosen aggregation methods; no free parameters or invented entities are mentioned, but the domain assumption that peer-review style scoring works without supervision is load-bearing.

axioms (1)
  • domain assumption LLM-as-a-Judge produces reliable quality scores for responses
    The scoring stage depends entirely on this assumption holding for the subsequent selection to be meaningful.

pith-pipeline@v0.9.0 · 5540 in / 1211 out tokens · 26466 ms · 2026-05-16T19:54:15.258144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Policy Improvement Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv:2303.08774, 2023

  3. [3]

    Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006

  4. [4]

    arXiv preprint arXiv:2310.11689 , year=

    Chen, J., Yoon, J., Ebrahimi, S., Arik, S. O., Pfister, T., and Jha, S. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689, 2023 a

  5. [5]

    An automatic and cost-efficient peer-review framework for language generation evaluation

    Chen, J., Su, W., Chu, Z., Li, H., Ai, Q., Liu, Y., Zhang, M., and Ma, S. An automatic and cost-efficient peer-review framework for language generation evaluation. arXiv preprint arXiv:2410.12265, 2024

  6. [6]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a

  7. [7]

    Adversarial learning from crowds

    Chen, P., Sun, H., Yang, Y., and Chen, Z. Adversarial learning from crowds. In AAAI, 2022

  8. [8]

    Structured probabilistic end-to-end learning from crowds

    Chen, Z., Wang, H., Sun, H., Chen, P., Han, T., Liu, X., and Yang, J. Structured probabilistic end-to-end learning from crowds. In IJCAI, 2021 b

  9. [9]

    Neural-hidden-crf: A robust weakly-supervised sequence labeler

    Chen, Z., Sun, H., Zhang, W., Xu, C., Mao, Q., and Chen, P. Neural-hidden-crf: A robust weakly-supervised sequence labeler. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 274--285, 2023 b

  10. [10]

    Chen, Z., Li, J., Chen, P., Li, Z., Sun, K., Luo, Y., Mao, Q., Yang, D., Sun, H., and Yu, P. S. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

  11. [11]

    E., et al

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023), 2 0 (3): 0 6, 2023

  12. [12]

    Pre: A peer review based large language model evaluator

    Chu, Z., Ai, Q., Tu, Y., Li, H., and Liu, Y. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641, 2024

  13. [13]

    W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

  14. [14]

    Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28 0 (1): 0 20--28, 1979

  15. [15]

    and Mars, J

    Daynauth, R. and Mars, J. Aligning model evaluations with human preferences: Mitigating token count bias in language model assessments. arXiv preprint arXiv:2407.12847, 2024

  16. [16]

    Maximum likelihood from incomplete data via the em algorithm

    Dempter, A. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, 39: 0 1--22, 1977

  17. [17]

    A survey on ensemble learning

    Dong, X., Yu, Z., Cao, W., Shi, Y., and Ma, Q. A survey on ensemble learning. Frontiers of Computer Science, 14: 0 241--258, 2020

  18. [18]

    X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P

    Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P. S., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36: 0 30039--30069, 2023

  19. [19]

    An introduction to latent variable models

    Everett, B. An introduction to latent variable models. Springer Science & Business Media, 2013

  20. [20]

    GPTScore: Evaluate as You Desire

    Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023

  21. [21]

    Deep learning, 2016

    Goodfellow, I. Deep learning, 2016

  22. [22]

    A Survey on LLM-as-a-Judge

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024

  23. [23]

    Smoothie: Label free language model routing

    Guha, N., Chen, M., Chow, T., Khare, I., and Re, C. Smoothie: Label free language model routing. Advances in Neural Information Processing Systems, 37: 0 127645--127672, 2024

  24. [24]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  25. [25]

    Language model preference evaluation with multiple weak evaluators

    Hu, Z., Zhang, J., Xiong, Z., Ratner, A., Xiong, H., and Krishna, R. Language model preference evaluation with multiple weak evaluators. arXiv preprint arXiv:2410.12869, 2024

  26. [26]

    Ensemble learning for heterogeneous large language models with deep parallel collaboration

    Huang, Y., Feng, X., Li, B., Xiang, Y., Wang, H., Liu, T., and Qin, B. Ensemble learning for heterogeneous large language models with deep parallel collaboration. In NeurIPS, 2024

  27. [27]

    Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

  28. [28]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  29. [29]

    Prometheus 2: An open source language model specialized in evaluating other language models, 2024

    Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024

  30. [30]

    Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task

    Kotonya, N., Krishnasamy, S., Tetreault, J., and Jaimes, A. Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task. arXiv preprint arXiv:2311.00686, 2023

  31. [31]

    From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

    Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594, 2024 a

  32. [32]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024 b

  33. [33]

    More agents is all you need

    Li, J., Zhang, Q., Yu, Y., Fu, Q., and Ye, D. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024 c

  34. [34]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

  35. [35]

    Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

    Liu, C., Quan, X., Pan, Y., Lin, L., Wu, W., and Chen, X. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024 b

  36. [36]

    Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

    Lu, X., Liu, Z., Liusie, A., Raina, V., Mudupalli, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024

  37. [37]

    Urg: A unified ranking and generation method for ensembling language models

    Lv, B., Tang, C., Zhang, Y., Liu, X., Luo, P., and Yu, Y. Urg: A unified ranking and generation method for ensembling language models. In Findings of the ACL, 2024

  38. [38]

    RouteLLM: Learning to Route LLMs with Preference Data

    Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024

  39. [39]

    arXiv preprint arXiv:2409.13884 (2024)

    Owens, D. M., Rossi, R. A., Kim, S., Yu, T., Dernoncourt, F., Chen, X., Zhang, R., Gu, J., Deilamsalehy, H., and Lipka, N. A multi-llm debiasing framework. arXiv preprint arXiv:2409.13884, 2024

  40. [40]

    Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

    Park, S., Liu, X., Gong, Y., and Choi, E. Ensembling large language models with process reward-guided tree search for better complex reasoning. arXiv preprint arXiv:2412.15797, 2024

  41. [41]

    Large language model routing with benchmark datasets

    Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y., Solomon, J., Thompson, N., and Yurochkin, M. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

  42. [42]

    Getting more out of mixture of language model reasoning experts

    Si, C., Shi, W., Zhao, C., Zettlemoyer, L., and Boyd-Graber, J. Getting more out of mixture of language model reasoning experts. In Findings of EMNLP, 2023

  43. [43]

    Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

    Srivatsa, K., Maurya, K. K., and Kochmar, E. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  45. [45]

    Llm-topla: Efficient llm ensemble by maximising diversity

    Tekin, S., Ilhan, F., Huang, T., Hu, S., and Liu, L. Llm-topla: Efficient llm ensemble by maximising diversity. In Findings of EMNLP, 2024

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  47. [47]

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024

  48. [48]

    Koala: An index for quantifying overlaps with pre-training corpora

    Vu, T.-T., He, X., Haffari, G., and Shareghi, E. Koala: An index for quantifying overlaps with pre-training corpora. arXiv preprint arXiv:2303.14770, 2023

  49. [49]

    Large Language Models are not Fair Evaluators

    Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

  50. [50]

    J., and Choi, E

    Wang, V., Zhang, M. J., and Choi, E. Improving llm-as-a-judge inference with the judgment distribution. arXiv preprint arXiv:2503.03064, 2025

  51. [51]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  52. [52]

    CoRR , volume =

    Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594, 2024

  53. [53]

    Bridging the gap between different vocabularies for llm ensemble

    Xu, Y., Lu, J., and Zhang, J. Bridging the gap between different vocabularies for llm ensemble. In NAACL, pp.\ 7133--7145, 2024

  54. [54]

    Hit the sweet spot! span-level ensemble for large language models

    Xu, Y., Chen, J., Wu, J., and Zhang, J. Hit the sweet spot! span-level ensemble for large language models. In COLING, pp.\ 8314--8325, 2025

  55. [55]

    Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

    Yu, Y.-C., Kuo, C.-C., Ye, Z., Chang, Y.-C., and Li, Y.-S. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. arXiv preprint arXiv:2406.12585, 2024

  56. [56]

    Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

    Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. Wrench: A comprehensive benchmark for weak supervision. arXiv preprint arXiv:2109.11377, 2021

  57. [57]

    Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

    Zhang, X., Yu, B., Yu, H., Lv, Y., Liu, T., Huang, F., Xu, H., and Li, Y. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023

  58. [58]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36: 0 46595--46623, 2023

  59. [59]

    Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

    Zheng, Y., Li, G., Li, Y., Shan, C., and Cheng, R. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

  60. [60]

    Lima: Less is more for alignment

    Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023