League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Baosheng Wang; Enze Wang; Kai Chen; Qianhong Guo; Shuoyoucheng Ma; Tian Xia; Wei Xie; Xiaobing Sun; Xiaofang Cai; Xiaofeng Wang

arxiv: 2507.22359 · v4 · submitted 2025-07-30 · 💻 cs.AI · cs.CL

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo , Wei Xie , Xiaofang Cai , Enze Wang , Shuoyoucheng Ma , Xiaobing Sun , Tian Xia , Kai Chen

show 2 more authors

Xiaofeng Wang Baosheng Wang

This is my paper

Pith reviewed 2026-05-19 03:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM evaluationbenchmark-freemutual evaluationmodel rankingself-governed leaguecapability distinctionevaluation criteria

0 comments

The pith

Large language models can evaluate each other in repeated rounds to produce stable capability rankings without fixed benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes League of LLMs, a paradigm that places multiple LLMs into a self-governed league where they perform multi-round mutual evaluations of one another. This setup incorporates four integrated criteria—dynamic, transparent, objective, and professional—to address data contamination, opacity, and subjective bias in conventional testing. Experiments involving eight mainstream models on mathematics and programming tasks show that the resulting rankings distinguish model strengths while achieving 70.7 percent top-k consistency across internal checks. The approach also surfaces observations such as memorization-driven responses in some models and statistically higher scores within the same model family.

Core claim

League of LLMs organizes LLMs into a self-governed league for multi-round mutual evaluation, integrating four core criteria to mitigate limitations of existing paradigms, and experiments on eight models demonstrate it distinguishes LLM capabilities while maintaining high internal ranking stability of 70.7 percent top-k consistency, along with empirical findings on memorization behaviors and in-family score differences.

What carries the argument

The self-governed league structure that enables LLMs to conduct repeated, multi-model mutual evaluations under the combined dynamic, transparent, objective, and professional criteria.

Load-bearing premise

Large language models can evaluate one another in an objective and professional manner without their own biases or contamination affecting the outcomes.

What would settle it

Repeated league runs on the same models producing top-k consistency well below 70 percent, or league rankings diverging sharply from rankings obtained on large sets of uncontaminated benchmark questions.

Figures

Figures reproduced from arXiv: 2507.22359 by Baosheng Wang, Enze Wang, Kai Chen, Qianhong Guo, Shuoyoucheng Ma, Tian Xia, Wei Xie, Xiaobing Sun, Xiaofang Cai, Xiaofeng Wang.

**Figure 2.** Figure 2: Evaluation pipeline overview. reducing subjective bias and the possibility of manipulation, to accurately reflect the objective performance of the LLMs. 4. Professional: The questions and answers employed in evaluations should be at or near the average level of human experts in relevant domains, enabling in-depth evaluation of the LLM’s professional capabilities in vertical domains. None of the existing ev… view at source ↗

**Figure 3.** Figure 3: Prompts design using role-play and zero-shot for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Prompts design using role-play and few-shot for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Dual-axis comparison of LLMs on programming [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top-$k$ consistency $= 70.7\%$). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance, ``memorization-based answering'' behaviors are observed in some models, and higher in-family scores are found in the OpenAI model family ($\Delta = 9$, $p < 0.05$). Finally, we make our framework and code publicly available as a valuable complement to the current LLM evaluation ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The mutual evaluation league mainly surfaces family biases in LLM judgments rather than delivering externally validated capability rankings.

read the letter

The main thing to know is that this paper sets up LLMs to judge each other in a league format to sidestep benchmarks, but the results lean more toward models favoring their own family than toward clean, objective distinctions in ability. The new element is the self-governed league that runs multi-round mutual evaluations under four criteria labeled dynamic, transparent, objective, and professional. They applied it to eight mainstream LLMs on math and programming tasks and reported 70.7 percent top-k ranking consistency plus observations such as memorization-based answering in some models and a nine-point in-family score boost for OpenAI models at p less than 0.05. The paper does well by releasing the framework and code publicly, which gives others a concrete starting point, and by using the mutual setup to flag empirical patterns that fixed benchmarks often miss. The soft spots sit in the validation. Ranking stability is computed from the same mutual scores, so it stays internal and does not confirm that the order tracks real capability gaps. The family bias finding itself undercuts the objectivity claim, since it suggests shared training data or stylistic preferences may drive the scores. No external anchor such as human ratings or correlation with established benchmarks is described, which leaves the central claim plausible but unanchored. Prompt details and explicit controls for self-bias are also thin in the reported summary. This work targets researchers who want alternatives to contaminated benchmarks. Readers focused on evaluation methods could extract value from the league idea and the open code even if the current evidence is preliminary. It has enough structure to deserve a serious referee who can require external checks and tighter method descriptions. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes League of LLMs (LOL), a benchmark-free paradigm in which multiple LLMs are organized into a self-governed league that performs multi-round mutual evaluations. The approach integrates four core criteria (dynamic, transparent, objective, and professional) to address limitations of existing evaluation methods such as data contamination and opacity. Experiments involving eight mainstream LLMs on mathematics and programming tasks report that LOL distinguishes model capabilities while achieving 70.7% top-k ranking consistency; additional observations include memorization-based answering patterns and statistically significant in-family score inflation for OpenAI models (Δ = 9, p < 0.05). The framework and code are released publicly.

Significance. If the mutual-evaluation rankings can be shown to track genuine capability differences rather than model similarity or self-reinforcing bias, LOL would constitute a useful complement to benchmark-based evaluation by mitigating contamination risks. The public code release supports reproducibility and further experimentation. At present, however, the absence of external anchors limits the strength of the significance claim.

major comments (2)

[Abstract] Abstract: the central claim that LOL 'can effectively distinguish LLM capabilities' rests on internal metrics (Top-k consistency = 70.7% and family score differences) computed from the same mutual evaluations; no correlation with human judgments, standard benchmarks, or other external validators is reported, leaving open the possibility that observed distinctions primarily reflect model similarity or shared training artifacts rather than objective capability.
[Abstract] Abstract / experimental description: the four core criteria are asserted to produce objective and professional evaluations, yet the abstract provides no information on prompt templates, scoring rubrics, aggregation rules, or explicit controls for self-bias and contamination; without these details the objectivity claim cannot be assessed and the reported stability metric remains difficult to interpret.

minor comments (1)

The abstract would be clearer if it briefly stated the number of evaluation rounds, the exact tasks used, and the total number of pairwise comparisons performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our benchmark-free evaluation paradigm. We respond to each major comment below and note planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LOL 'can effectively distinguish LLM capabilities' rests on internal metrics (Top-k consistency = 70.7% and family score differences) computed from the same mutual evaluations; no correlation with human judgments, standard benchmarks, or other external validators is reported, leaving open the possibility that observed distinctions primarily reflect model similarity or shared training artifacts rather than objective capability.

Authors: We agree that external validators such as human judgments or standard benchmark correlations would strengthen claims of distinguishing genuine capabilities rather than similarity artifacts. The current work emphasizes a benchmark-free approach precisely to sidestep contamination and opacity issues in existing evaluations; the reported 70.7% top-k consistency across independent rounds and the statistically significant in-family inflation (Δ=9, p<0.05) provide internal evidence of stable distinctions and bias detection. We will revise the manuscript to explicitly discuss this limitation in a dedicated paragraph and outline future directions for external anchoring, such as targeted human comparisons on a subset of tasks. revision: partial
Referee: [Abstract] Abstract / experimental description: the four core criteria are asserted to produce objective and professional evaluations, yet the abstract provides no information on prompt templates, scoring rubrics, aggregation rules, or explicit controls for self-bias and contamination; without these details the objectivity claim cannot be assessed and the reported stability metric remains difficult to interpret.

Authors: The abstract is length-constrained, while the full manuscript details the prompt templates, scoring rubrics, multi-round aggregation rules, and controls for self-bias (including randomized evaluator pairing and anonymization) in the methods section. We will revise the abstract to include a concise reference to these mechanisms and the explicit controls for self-bias and contamination, enabling readers to better assess the objectivity claim and interpret the stability metric from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LOL mutual evaluation derivation

full rationale

The paper proposes a benchmark-free mutual evaluation framework (LOL) integrating four criteria and reports empirical results from experiments on eight LLMs, including internal Top-k consistency of 70.7% and in-family score differences. These metrics are computed directly from the generated evaluation data as descriptive properties of the process rather than predictions or first-principles derivations that reduce to fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way that collapses the central claims. The approach is self-contained by design as a complement to benchmarks, with observations like memorization behaviors standing as independent empirical findings from the mutual evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that LLMs can generate sufficiently objective judgments of other LLMs; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Multi-round mutual evaluations by LLMs can satisfy dynamic, transparent, objective, and professional criteria simultaneously.
Invoked to justify the league structure as a solution to contamination and subjectivity.

invented entities (1)

League of LLMs (LOL) no independent evidence
purpose: Self-governed multi-round mutual evaluation framework
New organizational structure for LLM assessment introduced in the paper.

pith-pipeline@v0.9.0 · 5752 in / 1152 out tokens · 32204 ms · 2026-05-19T03:13:31.909145+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Top-k consistency = 70.7%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeGenTWeb: A First Look at LLM-dominant Websites
cs.NI 2026-04 unverdicted novelty 5.0

DeGenTWeb shows LLM-dominant websites are common and increasing in Common Crawl and Bing search results, but accurate detection is getting harder with newer models.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ahn, J.; Verma, R.; Lou, R.; Liu, D.; Zhang, R.; and Yin, W. 2024. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157

work page arXiv 2024
[3]

Llemma: An Open Language Model For Mathematics

Azerbayev, Z.; Schoelkopf, H.; Paster, K.; Santos, M. D.; McAleer, S.; Jiang, A. Q.; Deng, J.; Biderman, S.; and Welleck, S. 2023. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631

work page internal anchor Pith review arXiv 2023
[4]

Blunch, N. J. 1984. Position bias in multiple-choice questions. Journal of Marketing Research, 21(2): 216--220

work page 1984
[5]

H.; Li, J

Boyko, J.; Cohen, J.; Fox, N.; Veiga, M. H.; Li, J. I.; Liu, J.; Modenesi, B.; Rauch, A. H.; Reid, K. N.; Tribedi, S.; et al. 2023. An interdisciplinary outlook on large language models for scientific research. arXiv preprint arXiv:2311.04929

work page arXiv 2023
[6]

L.; Bucknall, B.; Haupt, A.; Wei, K.; Scheurer, J.; Hobbhahn, M.; et al

Casper, S.; Ezell, C.; Siegmann, C.; Kolt, N.; Curtis, T. L.; Bucknall, B.; Haupt, A.; Wei, K.; Scheurer, J.; Hobbhahn, M.; et al. 2024. Black-box access is insufficient for rigorous ai audits. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254--2272

work page 2024
[7]

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3): 1--45

work page 2024
[8]

Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

Chen, G. H.; Chen, S.; Liu, Z.; Jiang, F.; and Wang, B. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669

work page arXiv 2024
[9]

Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. D. O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Cheng, Y.; Chang, Y.; and Wu, Y. 2025. A survey on data contamination for large language models. arXiv preprint arXiv:2502.14425

work page arXiv 2025
[11]

Chern, E.; Zou, H.; Li, X.; Hu, J.; Feng, K.; Li, J.; and Liu, P. 2023. Generative AI for Math: Abel. https://github.com/GAIR-NLP/abel

work page 2023
[12]

N.; Li, T.; Li, D.; Zhu, B.; Zhang, H.; Jordan, M.; Gonzalez, J

Chiang, W.-L.; Zheng, L.; Sheng, Y.; Angelopoulos, A. N.; Li, T.; Li, D.; Zhu, B.; Zhang, H.; Jordan, M.; Gonzalez, J. E.; et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning

work page 2024
[13]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

C.; and Tianle Li, A

Connor Chen, W.-L. C.; and Tianle Li, A. N. A. 2025. Introducing Sentiment Control: Disentagling Sentiment and Substance

work page 2025
[15]

Deng, C.; Zhao, Y.; Tang, X.; Gerstein, M.; and Cohan, A. 2023. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783

work page arXiv 2023
[16]

T.; and Yadav, V

Etzine, B.; Hashemi, M.; Madhusudhan, N.; Davasam, S.; Sharma, R.; Madhusudhan, S. T.; and Yadav, V. 2025. Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance. arXiv preprint arXiv:2503.05551

work page arXiv 2025
[17]

Ge, Y.; Hua, W.; Mei, K.; Tan, J.; Xu, S.; Li, Z.; Zhang, Y.; et al. 2023. Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems, 36: 5539--5568

work page 2023
[18]

Gudibande, A.; Wallace, E.; Snell, C.; Geng, X.; Liu, H.; Abbeel, P.; Levine, S.; and Song, D. 2023. The False Promise of Imitating Proprietary LLMs. CoRR

work page 2023
[19]

u lsmann, J.; and Kaspar, K. 2014. The interplay between usability and aesthetics: More evidence for the “what is usable is beautiful

Hamborg, K.-C.; H \"u lsmann, J.; and Kaspar, K. 2014. The interplay between usability and aesthetics: More evidence for the “what is usable is beautiful” notion. Advances in Human-Computer Interaction, 2014(1): 946239

work page 2014
[20]

Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020
[21]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Huang, Y.; Lin, Z.; Liu, X.; Gong, Y.; Lu, S.; Lei, F.; Liang, Y.; Shen, Y.; Lin, C.; Duan, N.; et al. 2023. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143

work page arXiv 2023
[23]

Jain, N.; Han, K.; Gu, A.; Li, W.-D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; and Stoica, I. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

u chemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; G \

Kasneci, E.; Se ler, K.; K \"u chemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; G \"u nnemann, S.; H \"u llermeier, E.; et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103: 102274

work page 2023
[25]

Kiela, D.; Bartolo, M.; Nie, Y.; Kaushik, D.; Geiger, A.; Wu, Z.; Vidgen, B.; Prasad, G.; Singh, A.; Ringshia, P.; et al. 2021. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337

work page arXiv 2021
[26]

Li, Q.; Cui, L.; Zhao, X.; Kong, L.; and Bi, W. 2024. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. arXiv preprint arXiv:2402.19255

work page arXiv 2024
[27]

Li, T.; Angelopoulos, A.; and Chiang, W.-L. 2024. Does style matter? disentangling style and substance in chatbot arena. LMSYS Blog

work page 2024
[28]

D.; Re, C.; Acosta-Navas, D.; Hudson, D

Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; Newman, B.; Yuan, B.; Yan, B.; Zhang, C.; Cosgrove, C.; Manning, C. D.; Re, C.; Acosta-Navas, D.; Hudson, D. A.; Zelikman, E.; Durmus, E.; Ladhak, F.; Rong, F.; Ren, H.; Yao, H.; WANG, J.; Santhanam, K.; Orr, L.; Zheng, L.; Yuksekgonul, M....

work page 2023
[29]

Lu, P.; Qiu, L.; Yu, W.; Welleck, S.; and Chang, K.-W. 2023. A Survey of Deep Learning for Mathematical Reasoning. In ACL (1)

work page 2023
[30]

Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; and Zhang, D. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Milli \`e re, R.; and Buckner, C. 2024. A philosophical introduction to language models-part ii: The way forward. arXiv preprint arXiv:2405.03207

work page arXiv 2024
[32]

Mirzadeh, I.; Alizadeh, K.; Shahrokhi, H.; Tuzel, O.; Bengio, S.; and Farajtabar, M. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Nejjar, M.; Zacharias, L.; Stiehle, F.; and Weber, I. 2025. LLMs for science:: Usage for code generation and data analysis. Journal of Software: Evolution and Process, 37(1): e2723

work page 2025
[34]

Raghubir, P.; and Valenzuela, A. 2006. Center-of-inattention: Position biases in decision-making. Organizational Behavior and Human Decision Processes, 99(1): 66--80

work page 2006
[35]

L.; Stickland, A

Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

work page 2024
[36]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Sainz, O.; Campos, J. A.; Garc \' a-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; and Agirre, E. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. arXiv preprint arXiv:2310.18018

work page arXiv 2023
[37]

A.; Abid, A.; Fisch, A.; Brown, A

Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research

work page 2023
[38]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; Millican, K.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

N.; Roth, S

Tuch, A. N.; Roth, S. P.; Hornb k, K.; Opwis, K.; and Bargas-Avila, J. A. 2012. Is beautiful really usable? Toward understanding the relation between usability, aesthetics, and affect in HCI. Computers in human behavior, 28(5): 1596--1607

work page 2012
[41]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30

work page 2017
[42]

Vendrow, J.; Vendrow, E.; Beery, S.; and Madry, A. 2025. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461

work page arXiv 2025
[43]

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019
[44]

Xie, W.; Ma, S.; Wang, Z.; Wang, E.; Chen, K.; Sun, X.; and Wang, B. 2024. Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology. arXiv preprint arXiv:2410.14979

work page arXiv 2024
[45]

Xu, C.; Guan, S.; Greene, D.; Kechadi, M.; et al. 2024. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244

work page internal anchor Pith review arXiv 2024
[46]

F.; Alon, U.; Neubig, G.; and Hellendoorn, V

Xu, F. F.; Alon, U.; Neubig, G.; and Hellendoorn, V. J. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN international symposium on machine programming, 1--10

work page 2022
[47]

Guanhua Zhang and Moritz Hardt

Yang, S.; Chiang, W.-L.; Zheng, L.; Gonzalez, J. E.; and Stoica, I. 2023. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850

work page arXiv 2023
[48]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Yu, L.; Jiang, W.; Shi, H.; Yu, J.; Liu, Z.; Zhang, Y.; Kwok, J. T.; Li, Z.; Weller, A.; and Liu, W. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[50]

Zhong, W.; Cui, R.; Guo, Y.; Liang, Y.; Lu, S.; Wang, Y.; Saied, A.; Chen, W.; and Duan, N. 2023. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[52]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ahn, J.; Verma, R.; Lou, R.; Liu, D.; Zhang, R.; and Yin, W. 2024. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157

work page arXiv 2024

[3] [3]

Llemma: An Open Language Model For Mathematics

Azerbayev, Z.; Schoelkopf, H.; Paster, K.; Santos, M. D.; McAleer, S.; Jiang, A. Q.; Deng, J.; Biderman, S.; and Welleck, S. 2023. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631

work page internal anchor Pith review arXiv 2023

[4] [4]

Blunch, N. J. 1984. Position bias in multiple-choice questions. Journal of Marketing Research, 21(2): 216--220

work page 1984

[5] [5]

H.; Li, J

Boyko, J.; Cohen, J.; Fox, N.; Veiga, M. H.; Li, J. I.; Liu, J.; Modenesi, B.; Rauch, A. H.; Reid, K. N.; Tribedi, S.; et al. 2023. An interdisciplinary outlook on large language models for scientific research. arXiv preprint arXiv:2311.04929

work page arXiv 2023

[6] [6]

L.; Bucknall, B.; Haupt, A.; Wei, K.; Scheurer, J.; Hobbhahn, M.; et al

Casper, S.; Ezell, C.; Siegmann, C.; Kolt, N.; Curtis, T. L.; Bucknall, B.; Haupt, A.; Wei, K.; Scheurer, J.; Hobbhahn, M.; et al. 2024. Black-box access is insufficient for rigorous ai audits. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254--2272

work page 2024

[7] [7]

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3): 1--45

work page 2024

[8] [8]

Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

Chen, G. H.; Chen, S.; Liu, Z.; Jiang, F.; and Wang, B. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669

work page arXiv 2024

[9] [9]

Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. D. O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Cheng, Y.; Chang, Y.; and Wu, Y. 2025. A survey on data contamination for large language models. arXiv preprint arXiv:2502.14425

work page arXiv 2025

[11] [11]

Chern, E.; Zou, H.; Li, X.; Hu, J.; Feng, K.; Li, J.; and Liu, P. 2023. Generative AI for Math: Abel. https://github.com/GAIR-NLP/abel

work page 2023

[12] [12]

N.; Li, T.; Li, D.; Zhu, B.; Zhang, H.; Jordan, M.; Gonzalez, J

Chiang, W.-L.; Zheng, L.; Sheng, Y.; Angelopoulos, A. N.; Li, T.; Li, D.; Zhu, B.; Zhang, H.; Jordan, M.; Gonzalez, J. E.; et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. In Forty-first International Conference on Machine Learning

work page 2024

[13] [13]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

C.; and Tianle Li, A

Connor Chen, W.-L. C.; and Tianle Li, A. N. A. 2025. Introducing Sentiment Control: Disentagling Sentiment and Substance

work page 2025

[15] [15]

Deng, C.; Zhao, Y.; Tang, X.; Gerstein, M.; and Cohan, A. 2023. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783

work page arXiv 2023

[16] [16]

T.; and Yadav, V

Etzine, B.; Hashemi, M.; Madhusudhan, N.; Davasam, S.; Sharma, R.; Madhusudhan, S. T.; and Yadav, V. 2025. Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance. arXiv preprint arXiv:2503.05551

work page arXiv 2025

[17] [17]

Ge, Y.; Hua, W.; Mei, K.; Tan, J.; Xu, S.; Li, Z.; Zhang, Y.; et al. 2023. Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems, 36: 5539--5568

work page 2023

[18] [18]

Gudibande, A.; Wallace, E.; Snell, C.; Geng, X.; Liu, H.; Abbeel, P.; Levine, S.; and Song, D. 2023. The False Promise of Imitating Proprietary LLMs. CoRR

work page 2023

[19] [19]

u lsmann, J.; and Kaspar, K. 2014. The interplay between usability and aesthetics: More evidence for the “what is usable is beautiful

Hamborg, K.-C.; H \"u lsmann, J.; and Kaspar, K. 2014. The interplay between usability and aesthetics: More evidence for the “what is usable is beautiful” notion. Advances in Human-Computer Interaction, 2014(1): 946239

work page 2014

[20] [20]

Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020

[21] [21]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Huang, Y.; Lin, Z.; Liu, X.; Gong, Y.; Lu, S.; Lei, F.; Liang, Y.; Shen, Y.; Lin, C.; Duan, N.; et al. 2023. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143

work page arXiv 2023

[23] [23]

Jain, N.; Han, K.; Gu, A.; Li, W.-D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; and Stoica, I. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

u chemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; G \

Kasneci, E.; Se ler, K.; K \"u chemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; G \"u nnemann, S.; H \"u llermeier, E.; et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103: 102274

work page 2023

[25] [25]

Kiela, D.; Bartolo, M.; Nie, Y.; Kaushik, D.; Geiger, A.; Wu, Z.; Vidgen, B.; Prasad, G.; Singh, A.; Ringshia, P.; et al. 2021. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337

work page arXiv 2021

[26] [26]

Li, Q.; Cui, L.; Zhao, X.; Kong, L.; and Bi, W. 2024. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. arXiv preprint arXiv:2402.19255

work page arXiv 2024

[27] [27]

Li, T.; Angelopoulos, A.; and Chiang, W.-L. 2024. Does style matter? disentangling style and substance in chatbot arena. LMSYS Blog

work page 2024

[28] [28]

D.; Re, C.; Acosta-Navas, D.; Hudson, D

Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; Newman, B.; Yuan, B.; Yan, B.; Zhang, C.; Cosgrove, C.; Manning, C. D.; Re, C.; Acosta-Navas, D.; Hudson, D. A.; Zelikman, E.; Durmus, E.; Ladhak, F.; Rong, F.; Ren, H.; Yao, H.; WANG, J.; Santhanam, K.; Orr, L.; Zheng, L.; Yuksekgonul, M....

work page 2023

[29] [29]

Lu, P.; Qiu, L.; Yu, W.; Welleck, S.; and Chang, K.-W. 2023. A Survey of Deep Learning for Mathematical Reasoning. In ACL (1)

work page 2023

[30] [30]

Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; and Zhang, D. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Milli \`e re, R.; and Buckner, C. 2024. A philosophical introduction to language models-part ii: The way forward. arXiv preprint arXiv:2405.03207

work page arXiv 2024

[32] [32]

Mirzadeh, I.; Alizadeh, K.; Shahrokhi, H.; Tuzel, O.; Bengio, S.; and Farajtabar, M. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Nejjar, M.; Zacharias, L.; Stiehle, F.; and Weber, I. 2025. LLMs for science:: Usage for code generation and data analysis. Journal of Software: Evolution and Process, 37(1): e2723

work page 2025

[34] [34]

Raghubir, P.; and Valenzuela, A. 2006. Center-of-inattention: Position biases in decision-making. Organizational Behavior and Human Decision Processes, 99(1): 66--80

work page 2006

[35] [35]

L.; Stickland, A

Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

work page 2024

[36] [36]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Sainz, O.; Campos, J. A.; Garc \' a-Ferrero, I.; Etxaniz, J.; de Lacalle, O. L.; and Agirre, E. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. arXiv preprint arXiv:2310.18018

work page arXiv 2023

[37] [37]

A.; Abid, A.; Fisch, A.; Brown, A

Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research

work page 2023

[38] [38]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; Millican, K.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

N.; Roth, S

Tuch, A. N.; Roth, S. P.; Hornb k, K.; Opwis, K.; and Bargas-Avila, J. A. 2012. Is beautiful really usable? Toward understanding the relation between usability, aesthetics, and affect in HCI. Computers in human behavior, 28(5): 1596--1607

work page 2012

[41] [41]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30

work page 2017

[42] [42]

Vendrow, J.; Vendrow, E.; Beery, S.; and Madry, A. 2025. Do large language model benchmarks test reliability? arXiv preprint arXiv:2502.03461

work page arXiv 2025

[43] [43]

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019

[44] [44]

Xie, W.; Ma, S.; Wang, Z.; Wang, E.; Chen, K.; Sun, X.; and Wang, B. 2024. Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology. arXiv preprint arXiv:2410.14979

work page arXiv 2024

[45] [45]

Xu, C.; Guan, S.; Greene, D.; Kechadi, M.; et al. 2024. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244

work page internal anchor Pith review arXiv 2024

[46] [46]

F.; Alon, U.; Neubig, G.; and Hellendoorn, V

Xu, F. F.; Alon, U.; Neubig, G.; and Hellendoorn, V. J. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN international symposium on machine programming, 1--10

work page 2022

[47] [47]

Guanhua Zhang and Moritz Hardt

Yang, S.; Chiang, W.-L.; Zheng, L.; Gonzalez, J. E.; and Stoica, I. 2023. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850

work page arXiv 2023

[48] [48]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Yu, L.; Jiang, W.; Shi, H.; Yu, J.; Liu, Z.; Zhang, Y.; Kwok, J. T.; Li, Z.; Weller, A.; and Liu, W. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[50] [50]

Zhong, W.; Cui, R.; Guo, Y.; Liang, Y.; Lu, S.; Wang, Y.; Saied, A.; Chen, W.; and Duan, N. 2023. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[52] [52]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page