Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Bangjie Qin; Hailong Sun; Hao Wu; Jingzheng Li; Jinhuan Song; Junhang Cheng; Kai Sun; Qianren Mao; Xiangyang Ji; Yikun Ban

arxiv: 2512.23213 · v3 · submitted 2025-12-29 · 💻 cs.CL · cs.AI

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen , Zeyu Ji , Qianren Mao , Hao Wu , Jinhuan Song , Junhang Cheng , Bangjie Qin , Zhuoran Li

show 7 more authors

Jingzheng Li Kai Sun Zizhe Wang Yikun Ban Zhu Sun Xiangyang Ji Hailong Sun

This is my paper

Pith reviewed 2026-05-16 19:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM ensemblepeer reviewLLM-as-a-Judgeunsupervised selectionresponse aggregationmulti-model evaluationtruth inference

0 comments

The pith

An unsupervised peer-review process lets multiple LLMs score and select the best response from candidates for any query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM-PeerReview, an ensemble method that has several LLMs evaluate each other's candidate answers for a given query. It applies LLM-as-a-Judge scoring, then aggregates the scores either by averaging or through a graphical truth-inference model, and outputs the single highest-scoring response. The approach needs no labeled data or supervision and stays interpretable at every step. Tests across four datasets show the two variants beat the prior Smoothie-Global ensemble by 6.9 and 7.3 percentage points on factual QA, math reasoning, and instruction-following tasks. The framework treats the collection of available models as a self-contained review panel that surfaces the strongest output.

Core claim

LLM-PeerReview runs in three stages: scoring, where multiple LLMs judge each candidate response; reasoning, where scores are combined by simple averaging or a principled graphical model; and selection, where the response with the highest final score is chosen as the ensemble output. This unsupervised pipeline produces stronger results than previous methods on factual recall, mathematical reasoning, and instruction-following benchmarks.

What carries the argument

The three-stage peer-review pipeline that re-uses the same set of LLMs first as scorers of each candidate response and then as aggregators to pick the single best output.

If this is right

The averaging and graphical-model variants each improve over Smoothie-Global by roughly seven points on four datasets.
Gains appear consistently across factual QA, math reasoning, and instruction-following tasks.
No training data or external supervision is required, so the method can be applied immediately to any new set of models and queries.
The explicit scoring stage produces traceable reasons for choosing one response over the others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scoring-and-selection loop could be repeated on refined drafts to create an iterative self-improvement cycle without external reward models.
When judge models are drawn from different families, the aggregation step may capture complementary strengths that single-model prompting misses.
The approach could serve as a lightweight post-generation filter in production systems where only one final answer is returned to the user.
If the graphical model variant proves more robust to noisy judges, it could guide future work on modeling judge reliability explicitly.

Load-bearing premise

That LLM-as-a-Judge scores are reliable and unbiased enough for the aggregation step to correctly identify the best response without any supervision or ground truth.

What would settle it

Measure whether the response chosen by the peer-review scores actually matches the ground-truth answer more often than the single best model or a random candidate on a dataset with verifiable correct answers.

Figures

Figures reproduced from arXiv: 2512.23213 by Bangjie Qin, Hailong Sun, Hao Wu, Jingzheng Li, Jinhuan Song, Junhang Cheng, Kai Sun, Qianren Mao, Xiangyang Ji, Yikun Ban, Zeyu Ji, Zhijun Chen, Zhuoran Li, Zhu Sun, Zizhe Wang.

**Figure 1.** Figure 1: The proposed LLM-PeerReview contains three steps: (1) Scoring: For a given query, after each LLM independently generates a response (analogous to a submitted academic paper), LLM-PeerReview applies the LLM-as-a-Judge technique (and the proposed flipped-triple scoring trick), treating each model as a reviewer to assign scores to all candidate responses; (2) Reasoning: LLM-PeerReview then uses a truth infere… view at source ↗

**Figure 2.** Figure 2: Probabilistic graphical representation. annotator-specific transition matrix Π(j ′ ) to model the probability that an LLM confuses one score category for another, capturing its scoring tendencies and potential biases: p(y (i,j;j ′ ) = n|t (i,j) = m; Π(j ′ ) ) = π (j ′ ) mn , (3) where m, n ∈ {1, . . . , K} and K denotes the number of categories (i.e., the number of score levels). Objective and optimizatio… view at source ↗

**Figure 3.** Figure 3: LLM performances (bottom: AlpacaEval). 3. Assessment: Experiment Setup We provide the implementation of our LLM-PeerReview and all baselines for the used datasets: Code. Datasets and evaluation. We evaluate four widely-used datasets, grouped into three categories. (1) Factual Recall: TriviaQA (Joshi et al., 2017; Guha et al., 2024) evaluates the accuracy of model responses to factual questions across var… view at source ↗

**Figure 4.** Figure 4: Left: The transition matrix of each LLM estimated by LLM-PeerReview-W. Right: Correlation between matrix diagonal information of each LLM and its performance as a single judge (corresponding to “our variants” in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of various variants across different scoring levels. ing the low computational complexity. Further, in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-stage LLM-PeerReview ensemble is simple and unsupervised, but the claimed gains rest on unverified judge reliability.

read the letter

The paper's main contribution is a straightforward three-stage process: multiple LLMs score candidate responses using the LLM-as-Judge approach, scores are aggregated either by simple averaging or a graphical truth-inference model, and the highest-scoring response is selected. This is fully unsupervised and reuses models already in hand, which keeps it practical for downstream use without extra training or labels. The specific combination of LLM judges with truth inference for selection is the clearest new element, extending crowdsourcing-style aggregation to this setting in a clean way. The reported 6.9–7.3 point gains over Smoothie-Global across four datasets that cover factual QA, math reasoning, and instruction following are the kind of result that could interest people building reliable LLM pipelines. The method stays conceptually light and interpretable, which is a plus for adoption. The central soft spot is the lack of direct evidence that the judge scores actually track true response quality. Without reported correlations to human judgments, ablations on bias sources like length or style, or checks for systematic errors across model origins, the aggregation step could simply amplify whatever the judges already prefer. The abstract gives no variance numbers, significance tests, or dataset-size details, so the empirical claim is hard to assess from the summary alone. This work is aimed at researchers working on LLM ensembles and evaluation who need lightweight, no-training options. A reader focused on practical improvements would find the framework worth examining, even if they later modify the scoring stage. I would send it to peer review; the idea is coherent enough and the direction is worth referee scrutiny, though the judge-reliability question will need concrete evidence in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes LLM-PeerReview, an unsupervised three-stage ensemble method for selecting the best response among multiple LLM-generated candidates: scoring via LLM-as-a-Judge, aggregation through either simple averaging or a graphical-model truth-inference procedure, and final selection of the highest-scoring output. The authors claim that both variants outperform the Smoothie-Global baseline by 6.9 and 7.3 percentage points on four datasets spanning factual QA, math reasoning, and instruction following.

Significance. If the reported gains are robust, the method would supply a transparent, training-free mechanism for harnessing complementary LLM strengths without supervision, offering a practical alternative to existing ensemble techniques in NLP.

major comments (3)

[Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.
[Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.
[Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.

minor comments (2)

[Method] The description of the aggregation procedures would be clearer if accompanied by explicit equations or pseudocode for both the averaging and graphical-model methods.
[Related Work] Consider expanding the related-work discussion to include recent LLM-as-a-Judge benchmarks and other unsupervised ensembling baselines for better positioning of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating the revisions we will make to improve the clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.

Authors: We agree that these experimental details are essential for assessing the robustness of the reported gains. In the revised manuscript we will add statistical significance tests (paired t-tests across queries), report standard deviations over multiple random seeds, explicitly state the sizes of all four datasets, the number of candidate responses per query, and provide the exact prompting templates and decoding parameters (temperature, top-p, etc.) used for both generation and judging. revision: yes
Referee: [Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.

Authors: We acknowledge that direct validation of the LLM judges is valuable. Because the method is designed to remain fully unsupervised, we cannot replace the judges with humans at scale; however, we will add (i) a discussion of known LLM-as-Judge biases (length, position, style) with citations to recent studies, (ii) Pearson/Spearman correlations between LLM scores and human ratings on a randomly sampled subset of 200 examples from one dataset, and (iii) an explicit limitations paragraph noting the absence of a full human ablation. These additions will be placed in a new subsection under Scoring Stage. revision: partial
Referee: [Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.

Authors: We agree that the graphical-model component requires a more precise description. In the revision we will expand the Reasoning Stage section to include: the exact generative model (plate notation and factor graph), the EM-based parameter estimation procedure with initialization details, the closed-form update equations, and a sensitivity analysis varying the number of iterations and prior strength. We will also report the performance delta between the averaging and truth-inference variants on each dataset to quantify the contribution of this stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents LLM-PeerReview as a three-stage empirical procedure: LLM-as-a-Judge scoring of candidate responses, aggregation via averaging or graphical-model truth inference, and selection of the highest-scoring output. Performance is reported via direct empirical comparison to the external baseline Smoothie-Global on four datasets, with no equations, fitted parameters, or self-citations that reduce the central claim to a redefinition or renaming of its own inputs. The method is self-contained as an unsupervised ensemble technique whose validity rests on external benchmark results rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven reliability of LLM-as-Judge scoring and the effectiveness of the chosen aggregation methods; no free parameters or invented entities are mentioned, but the domain assumption that peer-review style scoring works without supervision is load-bearing.

axioms (1)

domain assumption LLM-as-a-Judge produces reliable quality scores for responses
The scoring stage depends entirely on this assumption holding for the subsequent selection to be meaningful.

pith-pipeline@v0.9.0 · 5540 in / 1211 out tokens · 26466 ms · 2026-05-16T19:54:15.258144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLM-PeerReview operates in three stages: scoring via LLM-as-a-Judge, reasoning via averaging or graphical-model truth inference (Dawid-Skene), and selection of highest-scoring response.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show 6.9–7.3 pp gains over Smoothie-Global on TriviaQA/GSM8k/MATH/AlpacaEval with no reference to cost convexity or golden-ratio fixed points.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Improvement Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006

work page 2006
[4]

arXiv preprint arXiv:2310.11689 , year=

Chen, J., Yoon, J., Ebrahimi, S., Arik, S. O., Pfister, T., and Jha, S. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689, 2023 a

work page arXiv 2023
[5]

An automatic and cost-efficient peer-review framework for language generation evaluation

Chen, J., Su, W., Chu, Z., Li, H., Ai, Q., Liu, Y., Zhang, M., and Ma, S. An automatic and cost-efficient peer-review framework for language generation evaluation. arXiv preprint arXiv:2410.12265, 2024

work page arXiv 2024
[6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Adversarial learning from crowds

Chen, P., Sun, H., Yang, Y., and Chen, Z. Adversarial learning from crowds. In AAAI, 2022

work page 2022
[8]

Structured probabilistic end-to-end learning from crowds

Chen, Z., Wang, H., Sun, H., Chen, P., Han, T., Liu, X., and Yang, J. Structured probabilistic end-to-end learning from crowds. In IJCAI, 2021 b

work page 2021
[9]

Neural-hidden-crf: A robust weakly-supervised sequence labeler

Chen, Z., Sun, H., Zhang, W., Xu, C., Mao, Q., and Chen, P. Neural-hidden-crf: A robust weakly-supervised sequence labeler. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 274--285, 2023 b

work page 2023
[10]

Chen, Z., Li, J., Chen, P., Li, Z., Sun, K., Luo, Y., Mao, Q., Yang, D., Sun, H., and Yu, P. S. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

E., et al

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023), 2 0 (3): 0 6, 2023

work page 2023
[12]

Pre: A peer review based large language model evaluator

Chu, Z., Ai, Q., Tu, Y., Li, H., and Liu, Y. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641, 2024

work page arXiv 2024
[13]

W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024
[14]

Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28 0 (1): 0 20--28, 1979

work page 1979
[15]

and Mars, J

Daynauth, R. and Mars, J. Aligning model evaluations with human preferences: Mitigating token count bias in language model assessments. arXiv preprint arXiv:2407.12847, 2024

work page arXiv 2024
[16]

Maximum likelihood from incomplete data via the em algorithm

Dempter, A. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, 39: 0 1--22, 1977

work page 1977
[17]

A survey on ensemble learning

Dong, X., Yu, Z., Cao, W., Shi, Y., and Ma, Q. A survey on ensemble learning. Frontiers of Computer Science, 14: 0 241--258, 2020

work page 2020
[18]

X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P

Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P. S., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36: 0 30039--30069, 2023

work page 2023
[19]

An introduction to latent variable models

Everett, B. An introduction to latent variable models. Springer Science & Business Media, 2013

work page 2013
[20]

GPTScore: Evaluate as You Desire

Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023

work page internal anchor Pith review arXiv 2023
[21]

Deep learning, 2016

Goodfellow, I. Deep learning, 2016

work page 2016
[22]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Smoothie: Label free language model routing

Guha, N., Chen, M., Chow, T., Khare, I., and Re, C. Smoothie: Label free language model routing. Advances in Neural Information Processing Systems, 37: 0 127645--127672, 2024

work page 2024
[24]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Language model preference evaluation with multiple weak evaluators

Hu, Z., Zhang, J., Xiong, Z., Ratner, A., Xiong, H., and Krishna, R. Language model preference evaluation with multiple weak evaluators. arXiv preprint arXiv:2410.12869, 2024

work page arXiv 2024
[26]

Ensemble learning for heterogeneous large language models with deep parallel collaboration

Huang, Y., Feng, X., Li, B., Xiang, Y., Wang, H., Liu, T., and Qin, B. Ensemble learning for heterogeneous large language models with deep parallel collaboration. In NeurIPS, 2024

work page 2024
[27]

Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

work page arXiv 2023
[28]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Prometheus 2: An open source language model specialized in evaluating other language models, 2024

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024

work page arXiv 2024
[30]

Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task

Kotonya, N., Krishnasamy, S., Tetreault, J., and Jaimes, A. Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task. arXiv preprint arXiv:2311.00686, 2023

work page arXiv 2023
[31]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594, 2024 a

work page arXiv 2024
[32]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

More agents is all you need

Li, J., Zhang, Q., Yu, Y., Fu, Q., and Ye, D. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024 c

work page arXiv 2024
[34]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

Liu, C., Quan, X., Pan, Y., Lin, L., Wu, W., and Chen, X. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024 b

work page arXiv 2024
[36]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

Lu, X., Liu, Z., Liusie, A., Raina, V., Mudupalli, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024

work page arXiv 2024
[37]

Urg: A unified ranking and generation method for ensembling language models

Lv, B., Tang, C., Zhang, Y., Liu, X., Luo, P., and Yu, Y. Urg: A unified ranking and generation method for ensembling language models. In Findings of the ACL, 2024

work page 2024
[38]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

arXiv preprint arXiv:2409.13884 (2024)

Owens, D. M., Rossi, R. A., Kim, S., Yu, T., Dernoncourt, F., Chen, X., Zhang, R., Gu, J., Deilamsalehy, H., and Lipka, N. A multi-llm debiasing framework. arXiv preprint arXiv:2409.13884, 2024

work page arXiv 2024
[40]

Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

Park, S., Liu, X., Gong, Y., and Choi, E. Ensembling large language models with process reward-guided tree search for better complex reasoning. arXiv preprint arXiv:2412.15797, 2024

work page arXiv 2024
[41]

Large language model routing with benchmark datasets

Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y., Solomon, J., Thompson, N., and Yurochkin, M. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

work page arXiv 2023
[42]

Getting more out of mixture of language model reasoning experts

Si, C., Shi, W., Zhao, C., Zettlemoyer, L., and Boyd-Graber, J. Getting more out of mixture of language model reasoning experts. In Findings of EMNLP, 2023

work page 2023
[43]

Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

Srivatsa, K., Maurya, K. K., and Kochmar, E. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024

work page arXiv 2024
[44]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Llm-topla: Efficient llm ensemble by maximising diversity

Tekin, S., Ilhan, F., Huang, T., Hu, S., and Liu, L. Llm-topla: Efficient llm ensemble by maximising diversity. In Findings of EMNLP, 2024

work page 2024
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Koala: An index for quantifying overlaps with pre-training corpora

Vu, T.-T., He, X., Haffari, G., and Shareghi, E. Koala: An index for quantifying overlaps with pre-training corpora. arXiv preprint arXiv:2303.14770, 2023

work page arXiv 2023
[49]

Large Language Models are not Fair Evaluators

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review arXiv 2023
[50]

J., and Choi, E

Wang, V., Zhang, M. J., and Choi, E. Improving llm-as-a-judge inference with the judgment distribution. arXiv preprint arXiv:2503.03064, 2025

work page arXiv 2025
[51]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

CoRR , volume =

Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594, 2024

work page arXiv 2024
[53]

Bridging the gap between different vocabularies for llm ensemble

Xu, Y., Lu, J., and Zhang, J. Bridging the gap between different vocabularies for llm ensemble. In NAACL, pp.\ 7133--7145, 2024

work page 2024
[54]

Hit the sweet spot! span-level ensemble for large language models

Xu, Y., Chen, J., Wu, J., and Zhang, J. Hit the sweet spot! span-level ensemble for large language models. In COLING, pp.\ 8314--8325, 2025

work page 2025
[55]

Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

Yu, Y.-C., Kuo, C.-C., Ye, Z., Chang, Y.-C., and Li, Y.-S. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. arXiv preprint arXiv:2406.12585, 2024

work page arXiv 2024
[56]

Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. Wrench: A comprehensive benchmark for weak supervision. arXiv preprint arXiv:2109.11377, 2021

work page arXiv 2021
[57]

Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

Zhang, X., Yu, B., Yu, H., Lv, Y., Liu, T., Huang, F., Xu, H., and Li, Y. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023

work page arXiv 2023
[58]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36: 0 46595--46623, 2023

work page 2023
[59]

Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

Zheng, Y., Li, G., Li, Y., Shan, C., and Cheng, R. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

work page 2017
[60]

Lima: Less is more for alignment

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023

work page 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006

work page 2006

[4] [4]

arXiv preprint arXiv:2310.11689 , year=

Chen, J., Yoon, J., Ebrahimi, S., Arik, S. O., Pfister, T., and Jha, S. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689, 2023 a

work page arXiv 2023

[5] [5]

An automatic and cost-efficient peer-review framework for language generation evaluation

Chen, J., Su, W., Chu, Z., Li, H., Ai, Q., Liu, Y., Zhang, M., and Ma, S. An automatic and cost-efficient peer-review framework for language generation evaluation. arXiv preprint arXiv:2410.12265, 2024

work page arXiv 2024

[6] [6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Adversarial learning from crowds

Chen, P., Sun, H., Yang, Y., and Chen, Z. Adversarial learning from crowds. In AAAI, 2022

work page 2022

[8] [8]

Structured probabilistic end-to-end learning from crowds

Chen, Z., Wang, H., Sun, H., Chen, P., Han, T., Liu, X., and Yang, J. Structured probabilistic end-to-end learning from crowds. In IJCAI, 2021 b

work page 2021

[9] [9]

Neural-hidden-crf: A robust weakly-supervised sequence labeler

Chen, Z., Sun, H., Zhang, W., Xu, C., Mao, Q., and Chen, P. Neural-hidden-crf: A robust weakly-supervised sequence labeler. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 274--285, 2023 b

work page 2023

[10] [10]

Chen, Z., Li, J., Chen, P., Li, Z., Sun, K., Luo, Y., Mao, Q., Yang, D., Sun, H., and Yu, P. S. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

E., et al

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023), 2 0 (3): 0 6, 2023

work page 2023

[12] [12]

Pre: A peer review based large language model evaluator

Chu, Z., Ai, Q., Tu, Y., Li, H., and Liu, Y. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641, 2024

work page arXiv 2024

[13] [13]

W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024

[14] [14]

Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28 0 (1): 0 20--28, 1979

work page 1979

[15] [15]

and Mars, J

Daynauth, R. and Mars, J. Aligning model evaluations with human preferences: Mitigating token count bias in language model assessments. arXiv preprint arXiv:2407.12847, 2024

work page arXiv 2024

[16] [16]

Maximum likelihood from incomplete data via the em algorithm

Dempter, A. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, 39: 0 1--22, 1977

work page 1977

[17] [17]

A survey on ensemble learning

Dong, X., Yu, Z., Cao, W., Shi, Y., and Ma, Q. A survey on ensemble learning. Frontiers of Computer Science, 14: 0 241--258, 2020

work page 2020

[18] [18]

X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P

Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P. S., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36: 0 30039--30069, 2023

work page 2023

[19] [19]

An introduction to latent variable models

Everett, B. An introduction to latent variable models. Springer Science & Business Media, 2013

work page 2013

[20] [20]

GPTScore: Evaluate as You Desire

Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023

work page internal anchor Pith review arXiv 2023

[21] [21]

Deep learning, 2016

Goodfellow, I. Deep learning, 2016

work page 2016

[22] [22]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Smoothie: Label free language model routing

Guha, N., Chen, M., Chow, T., Khare, I., and Re, C. Smoothie: Label free language model routing. Advances in Neural Information Processing Systems, 37: 0 127645--127672, 2024

work page 2024

[24] [24]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Language model preference evaluation with multiple weak evaluators

Hu, Z., Zhang, J., Xiong, Z., Ratner, A., Xiong, H., and Krishna, R. Language model preference evaluation with multiple weak evaluators. arXiv preprint arXiv:2410.12869, 2024

work page arXiv 2024

[26] [26]

Ensemble learning for heterogeneous large language models with deep parallel collaboration

Huang, Y., Feng, X., Li, B., Xiang, Y., Wang, H., Liu, T., and Qin, B. Ensemble learning for heterogeneous large language models with deep parallel collaboration. In NeurIPS, 2024

work page 2024

[27] [27]

Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

work page arXiv 2023

[28] [28]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Prometheus 2: An open source language model specialized in evaluating other language models, 2024

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024

work page arXiv 2024

[30] [30]

Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task

Kotonya, N., Krishnasamy, S., Tetreault, J., and Jaimes, A. Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task. arXiv preprint arXiv:2311.00686, 2023

work page arXiv 2023

[31] [31]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594, 2024 a

work page arXiv 2024

[32] [32]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

More agents is all you need

Li, J., Zhang, Q., Yu, Y., Fu, Q., and Ye, D. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024 c

work page arXiv 2024

[34] [34]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

Liu, C., Quan, X., Pan, Y., Lin, L., Wu, W., and Chen, X. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024 b

work page arXiv 2024

[36] [36]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

Lu, X., Liu, Z., Liusie, A., Raina, V., Mudupalli, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024

work page arXiv 2024

[37] [37]

Urg: A unified ranking and generation method for ensembling language models

Lv, B., Tang, C., Zhang, Y., Liu, X., Luo, P., and Yu, Y. Urg: A unified ranking and generation method for ensembling language models. In Findings of the ACL, 2024

work page 2024

[38] [38]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

arXiv preprint arXiv:2409.13884 (2024)

Owens, D. M., Rossi, R. A., Kim, S., Yu, T., Dernoncourt, F., Chen, X., Zhang, R., Gu, J., Deilamsalehy, H., and Lipka, N. A multi-llm debiasing framework. arXiv preprint arXiv:2409.13884, 2024

work page arXiv 2024

[40] [40]

Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

Park, S., Liu, X., Gong, Y., and Choi, E. Ensembling large language models with process reward-guided tree search for better complex reasoning. arXiv preprint arXiv:2412.15797, 2024

work page arXiv 2024

[41] [41]

Large language model routing with benchmark datasets

Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y., Solomon, J., Thompson, N., and Yurochkin, M. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

work page arXiv 2023

[42] [42]

Getting more out of mixture of language model reasoning experts

Si, C., Shi, W., Zhao, C., Zettlemoyer, L., and Boyd-Graber, J. Getting more out of mixture of language model reasoning experts. In Findings of EMNLP, 2023

work page 2023

[43] [43]

Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

Srivatsa, K., Maurya, K. K., and Kochmar, E. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024

work page arXiv 2024

[44] [44]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Llm-topla: Efficient llm ensemble by maximising diversity

Tekin, S., Ilhan, F., Huang, T., Hu, S., and Liu, L. Llm-topla: Efficient llm ensemble by maximising diversity. In Findings of EMNLP, 2024

work page 2024

[46] [46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Koala: An index for quantifying overlaps with pre-training corpora

Vu, T.-T., He, X., Haffari, G., and Shareghi, E. Koala: An index for quantifying overlaps with pre-training corpora. arXiv preprint arXiv:2303.14770, 2023

work page arXiv 2023

[49] [49]

Large Language Models are not Fair Evaluators

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review arXiv 2023

[50] [50]

J., and Choi, E

Wang, V., Zhang, M. J., and Choi, E. Improving llm-as-a-judge inference with the judgment distribution. arXiv preprint arXiv:2503.03064, 2025

work page arXiv 2025

[51] [51]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

CoRR , volume =

Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594, 2024

work page arXiv 2024

[53] [53]

Bridging the gap between different vocabularies for llm ensemble

Xu, Y., Lu, J., and Zhang, J. Bridging the gap between different vocabularies for llm ensemble. In NAACL, pp.\ 7133--7145, 2024

work page 2024

[54] [54]

Hit the sweet spot! span-level ensemble for large language models

Xu, Y., Chen, J., Wu, J., and Zhang, J. Hit the sweet spot! span-level ensemble for large language models. In COLING, pp.\ 8314--8325, 2025

work page 2025

[55] [55]

Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

Yu, Y.-C., Kuo, C.-C., Ye, Z., Chang, Y.-C., and Li, Y.-S. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. arXiv preprint arXiv:2406.12585, 2024

work page arXiv 2024

[56] [56]

Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. Wrench: A comprehensive benchmark for weak supervision. arXiv preprint arXiv:2109.11377, 2021

work page arXiv 2021

[57] [57]

Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

Zhang, X., Yu, B., Yu, H., Lv, Y., Liu, T., Huang, F., Xu, H., and Li, Y. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023

work page arXiv 2023

[58] [58]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36: 0 46595--46623, 2023

work page 2023

[59] [59]

Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

Zheng, Y., Li, G., Li, Y., Shan, C., and Cheng, R. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017

work page 2017

[60] [60]

Lima: Less is more for alignment

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023

work page 2023