Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
Pith reviewed 2026-05-16 19:54 UTC · model grok-4.3
The pith
An unsupervised peer-review process lets multiple LLMs score and select the best response from candidates for any query.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-PeerReview runs in three stages: scoring, where multiple LLMs judge each candidate response; reasoning, where scores are combined by simple averaging or a principled graphical model; and selection, where the response with the highest final score is chosen as the ensemble output. This unsupervised pipeline produces stronger results than previous methods on factual recall, mathematical reasoning, and instruction-following benchmarks.
What carries the argument
The three-stage peer-review pipeline that re-uses the same set of LLMs first as scorers of each candidate response and then as aggregators to pick the single best output.
If this is right
- The averaging and graphical-model variants each improve over Smoothie-Global by roughly seven points on four datasets.
- Gains appear consistently across factual QA, math reasoning, and instruction-following tasks.
- No training data or external supervision is required, so the method can be applied immediately to any new set of models and queries.
- The explicit scoring stage produces traceable reasons for choosing one response over the others.
Where Pith is reading between the lines
- The same scoring-and-selection loop could be repeated on refined drafts to create an iterative self-improvement cycle without external reward models.
- When judge models are drawn from different families, the aggregation step may capture complementary strengths that single-model prompting misses.
- The approach could serve as a lightweight post-generation filter in production systems where only one final answer is returned to the user.
- If the graphical model variant proves more robust to noisy judges, it could guide future work on modeling judge reliability explicitly.
Load-bearing premise
That LLM-as-a-Judge scores are reliable and unbiased enough for the aggregation step to correctly identify the best response without any supervision or ground truth.
What would settle it
Measure whether the response chosen by the peer-review scores actually matches the ground-truth answer more often than the single best model or a random candidate on a dataset with verifiable correct answers.
Figures
read the original abstract
We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LLM-PeerReview, an unsupervised three-stage ensemble method for selecting the best response among multiple LLM-generated candidates: scoring via LLM-as-a-Judge, aggregation through either simple averaging or a graphical-model truth-inference procedure, and final selection of the highest-scoring output. The authors claim that both variants outperform the Smoothie-Global baseline by 6.9 and 7.3 percentage points on four datasets spanning factual QA, math reasoning, and instruction following.
Significance. If the reported gains are robust, the method would supply a transparent, training-free mechanism for harnessing complementary LLM strengths without supervision, offering a practical alternative to existing ensemble techniques in NLP.
major comments (3)
- [Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.
- [Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.
- [Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.
minor comments (2)
- [Method] The description of the aggregation procedures would be clearer if accompanied by explicit equations or pseudocode for both the averaging and graphical-model methods.
- [Related Work] Consider expanding the related-work discussion to include recent LLM-as-a-Judge benchmarks and other unsupervised ensembling baselines for better positioning of the contribution.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating the revisions we will make to improve the clarity and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] Abstract and results section: the central claim of 6.9% and 7.3 pp gains over Smoothie-Global is presented without any mention of statistical significance tests, standard deviations across runs, dataset sizes, number of candidates per query, or exact prompting and decoding controls, leaving the empirical support for the performance advantage incomplete.
Authors: We agree that these experimental details are essential for assessing the robustness of the reported gains. In the revised manuscript we will add statistical significance tests (paired t-tests across queries), report standard deviations over multiple random seeds, explicitly state the sizes of all four datasets, the number of candidate responses per query, and provide the exact prompting templates and decoding parameters (temperature, top-p, etc.) used for both generation and judging. revision: yes
-
Referee: [Scoring Stage] Scoring stage: the approach depends on LLM-as-a-Judge producing reliable rankings in the absence of ground truth, yet no correlation analysis with human judgments, bias diagnostics (e.g., length or style preferences), or ablation replacing LLM judges with humans is supplied; this assumption is load-bearing for the subsequent aggregation and selection steps.
Authors: We acknowledge that direct validation of the LLM judges is valuable. Because the method is designed to remain fully unsupervised, we cannot replace the judges with humans at scale; however, we will add (i) a discussion of known LLM-as-Judge biases (length, position, style) with citations to recent studies, (ii) Pearson/Spearman correlations between LLM scores and human ratings on a randomly sampled subset of 200 examples from one dataset, and (iii) an explicit limitations paragraph noting the absence of a full human ablation. These additions will be placed in a new subsection under Scoring Stage. revision: partial
-
Referee: [Reasoning Stage] Reasoning stage: while the graphical-model truth-inference variant is offered as a principled alternative to averaging, the manuscript does not detail the precise generative model, parameter estimation procedure, or any sensitivity analysis, making it impossible to determine how much this component contributes to the reported improvements.
Authors: We agree that the graphical-model component requires a more precise description. In the revision we will expand the Reasoning Stage section to include: the exact generative model (plate notation and factor graph), the EM-based parameter estimation procedure with initialization details, the closed-form update equations, and a sensitivity analysis varying the number of iterations and prior strength. We will also report the performance delta between the averaging and truth-inference variants on each dataset to quantify the contribution of this stage. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents LLM-PeerReview as a three-stage empirical procedure: LLM-as-a-Judge scoring of candidate responses, aggregation via averaging or graphical-model truth inference, and selection of the highest-scoring output. Performance is reported via direct empirical comparison to the external baseline Smoothie-Global on four datasets, with no equations, fitted parameters, or self-citations that reduce the central claim to a redefinition or renaming of its own inputs. The method is self-contained as an unsupervised ensemble technique whose validity rests on external benchmark results rather than internal construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-Judge produces reliable quality scores for responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLM-PeerReview operates in three stages: scoring via LLM-as-a-Judge, reasoning via averaging or graphical-model truth inference (Dawid-Skene), and selection of highest-scoring response.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show 6.9–7.3 pp gains over Smoothie-Global on TriviaQA/GSM8k/MATH/AlpacaEval with no reference to cost convexity or golden-ratio fixed points.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Policy Improvement Reinforcement Learning
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006
work page 2006
-
[4]
arXiv preprint arXiv:2310.11689 , year=
Chen, J., Yoon, J., Ebrahimi, S., Arik, S. O., Pfister, T., and Jha, S. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689, 2023 a
-
[5]
An automatic and cost-efficient peer-review framework for language generation evaluation
Chen, J., Su, W., Chu, Z., Li, H., Ai, Q., Liu, Y., Zhang, M., and Ma, S. An automatic and cost-efficient peer-review framework for language generation evaluation. arXiv preprint arXiv:2410.12265, 2024
-
[6]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Adversarial learning from crowds
Chen, P., Sun, H., Yang, Y., and Chen, Z. Adversarial learning from crowds. In AAAI, 2022
work page 2022
-
[8]
Structured probabilistic end-to-end learning from crowds
Chen, Z., Wang, H., Sun, H., Chen, P., Han, T., Liu, X., and Yang, J. Structured probabilistic end-to-end learning from crowds. In IJCAI, 2021 b
work page 2021
-
[9]
Neural-hidden-crf: A robust weakly-supervised sequence labeler
Chen, Z., Sun, H., Zhang, W., Xu, C., Mao, Q., and Chen, P. Neural-hidden-crf: A robust weakly-supervised sequence labeler. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 274--285, 2023 b
work page 2023
-
[10]
Chen, Z., Li, J., Chen, P., Li, Z., Sun, K., Luo, Y., Mao, Q., Yang, D., Sun, H., and Yu, P. S. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [11]
-
[12]
Pre: A peer review based large language model evaluator
Chu, Z., Ai, Q., Tu, Y., Li, H., and Liu, Y. Pre: A peer review based large language model evaluator. arXiv preprint arXiv:2401.15641, 2024
-
[13]
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024
work page 2024
-
[14]
Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28 0 (1): 0 20--28, 1979
work page 1979
-
[15]
Daynauth, R. and Mars, J. Aligning model evaluations with human preferences: Mitigating token count bias in language model assessments. arXiv preprint arXiv:2407.12847, 2024
-
[16]
Maximum likelihood from incomplete data via the em algorithm
Dempter, A. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, 39: 0 1--22, 1977
work page 1977
-
[17]
Dong, X., Yu, Z., Cao, W., Shi, Y., and Ma, Q. A survey on ensemble learning. Frontiers of Computer Science, 14: 0 241--258, 2020
work page 2020
-
[18]
X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P
Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P. S., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36: 0 30039--30069, 2023
work page 2023
-
[19]
An introduction to latent variable models
Everett, B. An introduction to latent variable models. Springer Science & Business Media, 2013
work page 2013
-
[20]
GPTScore: Evaluate as You Desire
Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023
work page internal anchor Pith review arXiv 2023
- [21]
-
[22]
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Smoothie: Label free language model routing
Guha, N., Chen, M., Chow, T., Khare, I., and Re, C. Smoothie: Label free language model routing. Advances in Neural Information Processing Systems, 37: 0 127645--127672, 2024
work page 2024
-
[24]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Language model preference evaluation with multiple weak evaluators
Hu, Z., Zhang, J., Xiong, Z., Ratner, A., Xiong, H., and Krishna, R. Language model preference evaluation with multiple weak evaluators. arXiv preprint arXiv:2410.12869, 2024
-
[26]
Ensemble learning for heterogeneous large language models with deep parallel collaboration
Huang, Y., Feng, X., Li, B., Xiang, Y., Wang, H., Liu, T., and Qin, B. Ensemble learning for heterogeneous large language models with deep parallel collaboration. In NeurIPS, 2024
work page 2024
- [27]
-
[28]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Prometheus 2: An open source language model specialized in evaluating other language models, 2024
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535, 2024
-
[30]
Kotonya, N., Krishnasamy, S., Tetreault, J., and Jaimes, A. Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task. arXiv preprint arXiv:2311.00686, 2023
-
[31]
From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,
Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594, 2024 a
-
[32]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Li, J., Zhang, Q., Yu, Y., Fu, Q., and Ye, D. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024 c
-
[34]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,
Liu, C., Quan, X., Pan, Y., Lin, L., Wu, W., and Chen, X. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024 b
-
[36]
Lu, X., Liu, Z., Liusie, A., Raina, V., Mudupalli, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024
-
[37]
Urg: A unified ranking and generation method for ensembling language models
Lv, B., Tang, C., Zhang, Y., Liu, X., Luo, P., and Yu, Y. Urg: A unified ranking and generation method for ensembling language models. In Findings of the ACL, 2024
work page 2024
-
[38]
RouteLLM: Learning to Route LLMs with Preference Data
Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
arXiv preprint arXiv:2409.13884 (2024)
Owens, D. M., Rossi, R. A., Kim, S., Yu, T., Dernoncourt, F., Chen, X., Zhang, R., Gu, J., Deilamsalehy, H., and Lipka, N. A multi-llm debiasing framework. arXiv preprint arXiv:2409.13884, 2024
-
[40]
Park, S., Liu, X., Gong, Y., and Choi, E. Ensembling large language models with process reward-guided tree search for better complex reasoning. arXiv preprint arXiv:2412.15797, 2024
-
[41]
Large language model routing with benchmark datasets
Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y., Solomon, J., Thompson, N., and Yurochkin, M. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023
-
[42]
Getting more out of mixture of language model reasoning experts
Si, C., Shi, W., Zhao, C., Zettlemoyer, L., and Boyd-Graber, J. Getting more out of mixture of language model reasoning experts. In Findings of EMNLP, 2023
work page 2023
-
[43]
Srivatsa, K., Maurya, K. K., and Kochmar, E. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Llm-topla: Efficient llm ensemble by maximising diversity
Tekin, S., Ilhan, F., Huang, T., Hu, S., and Liu, L. Llm-topla: Efficient llm ensemble by maximising diversity. In Findings of EMNLP, 2024
work page 2024
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Koala: An index for quantifying overlaps with pre-training corpora
Vu, T.-T., He, X., Haffari, G., and Shareghi, E. Koala: An index for quantifying overlaps with pre-training corpora. arXiv preprint arXiv:2303.14770, 2023
-
[49]
Large Language Models are not Fair Evaluators
Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023
work page internal anchor Pith review arXiv 2023
-
[50]
Wang, V., Zhang, M. J., and Choi, E. Improving llm-as-a-judge inference with the judgment distribution. arXiv preprint arXiv:2503.03064, 2025
-
[51]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., and Sukhbaatar, S. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594, 2024
-
[53]
Bridging the gap between different vocabularies for llm ensemble
Xu, Y., Lu, J., and Zhang, J. Bridging the gap between different vocabularies for llm ensemble. In NAACL, pp.\ 7133--7145, 2024
work page 2024
-
[54]
Hit the sweet spot! span-level ensemble for large language models
Xu, Y., Chen, J., Wu, J., and Zhang, J. Hit the sweet spot! span-level ensemble for large language models. In COLING, pp.\ 8314--8325, 2025
work page 2025
-
[55]
Yu, Y.-C., Kuo, C.-C., Ye, Z., Chang, Y.-C., and Li, Y.-S. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. arXiv preprint arXiv:2406.12585, 2024
-
[56]
Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,
Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. Wrench: A comprehensive benchmark for weak supervision. arXiv preprint arXiv:2109.11377, 2021
-
[57]
Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023
Zhang, X., Yu, B., Yu, H., Lv, Y., Liu, T., Huang, F., Xu, H., and Li, Y. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023
-
[58]
Judging llm-as-a-judge with mt-bench and chatbot arena
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36: 0 46595--46623, 2023
work page 2023
-
[59]
Zheng, Y., Li, G., Li, Y., Shan, C., and Cheng, R. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10 0 (5): 0 541--552, 2017
work page 2017
-
[60]
Lima: Less is more for alignment
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.