arxiv: 2404.18796 · v2 · submitted 2024-04-29 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga , Sebastian Hofstatter , Sophia Althammer , Yixuan Su , Aleksandra Piktus , Arkady Arkhangorodsky , Minjie Xu , Naomi White

show 1 more author

Patrick Lewis

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationLLM as judgepanel of modelsmodel biascost efficiencyensemble evaluationgeneration quality

0 comments

The pith

A panel of smaller diverse LLMs judges model outputs better than one large model while costing far less.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a Panel of LLMs, or PoLL, made from many smaller models drawn from different families to score the quality of generations from other LLMs. Experiments across six datasets and three judge settings show the panel outperforms a single large model like GPT-4, shows less bias from relying on one model family, and runs at more than seven times lower cost. A reader would care because reliable evaluation of free-form outputs is becoming a bottleneck as models grow more capable, and current methods either cost too much or inherit the blind spots of whichever single model is chosen as judge.

Core claim

The central claim is that a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

What carries the argument

The PoLL, an aggregated scoring system that combines outputs from multiple smaller LLMs drawn from disjoint model families to produce a final quality judgment.

If this is right

Evaluation budgets can expand to cover more test cases or larger model pools because of the cost reduction.
Model rankings become less skewed by any one family's preferences or limitations.
The same panel approach scales to new judge settings without retraining or fine-tuning a giant model.
Diversity across model families becomes a design lever for evaluation rather than raw size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar diversity-based panels might improve reliability in other LLM tasks such as reasoning chains or safety checks.
Researchers could test whether adding or swapping specific small models in the panel yields further gains on particular domains.
The finding invites experiments that directly compare PoLL outputs to human jury panels on the same generations.

Load-bearing premise

The collective judgments of smaller models from disjoint families can capture nuanced quality signals at least as well as a single frontier model without systematic blind spots on the evaluated tasks.

What would settle it

A new dataset where human experts agree more closely with scores from the single large model than with the aggregated PoLL scores would falsify the performance claim.

read the original abstract

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A panel of smaller diverse LLMs beats single large judges on evaluation tasks with lower cost and bias, but the family-diversity mechanism needs an ablation to stand alone.

read the letter

The main thing here is that a panel of smaller LLMs can judge generations more reliably than one large model like GPT-4, with over 7x lower cost and apparently less bias. The paper backs this with comparisons across six datasets and three settings, showing consistent gains over single-judge baselines. That empirical scope and the cost numbers are the useful parts; they turn a known pain point in LLM work into something more practical and accessible. The claim that mixing disjoint model families cuts intra-model bias is a reasonable angle and fits the results they report. The work stays grounded in direct comparisons to reference scores rather than any circular fitting, which keeps it straightforward. What stands out is how they make the case without needing frontier-scale judges for the judging step itself. The soft spot is exactly the one flagged in the stress test. They attribute the bias drop to the disjoint families, but without holding panel size fixed and testing repeated copies of one small model family against the mixed set, the effect could just be from averaging more independent samples. That is a fixable gap rather than a load-bearing flaw, but it matters for the central explanation. Prompt details and model selection criteria are probably expanded in the full text, and they should be explicit enough for replication. This is for researchers who run LLM evaluations or benchmarks regularly and want cheaper, less biased alternatives. Anyone updating their judging setups would get concrete numbers to work with. It shows honest engagement with the practical constraints of the problem. I would bring it to a reading group to talk through the ablation and cost tradeoffs. I would cite the main performance and cost results in my own evaluation work. It deserves peer review because the experiments cover enough ground and the finding is actionable even with room for tighter controls on the mechanism.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes replacing single large LLM judges (e.g., GPT-4) with a Panel of LLMs (PoLL) composed of multiple smaller models drawn from disjoint families. Across six datasets and three evaluation settings, it reports that PoLL yields higher agreement with reference scores, exhibits lower intra-model bias, and reduces cost by more than 7×.

Significance. If the empirical gains hold after controlling for ensemble size, the work would supply a practical, lower-cost alternative for LLM-as-judge pipelines while highlighting the value of model-family diversity. The direct comparison against reference scores and the cost analysis constitute reproducible strengths.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that bias reduction stems specifically from 'composition of disjoint model families' is not isolated from the effect of ensemble size. No ablation is reported that holds the number of judges fixed while removing family diversity (e.g., five copies or slight variants of a single small model versus the heterogeneous panel). Without this control, the observed reduction could be explained by simple majority/averaging over more independent samples.
[§5.2] §5.2 (Bias Analysis): the intra-model bias metric is defined only for the single-judge baseline; the corresponding metric for PoLL is not shown to be computed under identical prompt and scoring conditions, leaving open the possibility that prompt variation across the panel contributes to the reported difference.

minor comments (2)

[Table 2] Table 2: the cost comparison column should explicitly state the token-price assumptions and the number of API calls per evaluation so that the 'over seven times less expensive' figure can be reproduced.
[§3.1] §3.1 (Panel Composition): the criteria used to select the specific smaller models and the exact prompt templates supplied to each family member are described at a high level; adding the full prompts and selection rationale would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the contributions of our work. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that bias reduction stems specifically from 'composition of disjoint model families' is not isolated from the effect of ensemble size. No ablation is reported that holds the number of judges fixed while removing family diversity (e.g., five copies or slight variants of a single small model versus the heterogeneous panel). Without this control, the observed reduction could be explained by simple majority/averaging over more independent samples.

Authors: We agree that the current experiments do not fully isolate family diversity from ensemble size, which is a valid concern for substantiating the source of bias reduction. In the revised manuscript we will add a controlled ablation that fixes panel size (e.g., five judges) and directly compares a homogeneous panel (multiple copies or minor variants drawn from a single model family) against the heterogeneous PoLL. This will allow readers to assess whether the observed gains require cross-family diversity or can be achieved by simple ensembling. We have initiated these runs and will report the full results. revision: yes
Referee: [§5.2] §5.2 (Bias Analysis): the intra-model bias metric is defined only for the single-judge baseline; the corresponding metric for PoLL is not shown to be computed under identical prompt and scoring conditions, leaving open the possibility that prompt variation across the panel contributes to the reported difference.

Authors: The intra-model bias values for PoLL were obtained by applying the identical prompt templates, scoring rubrics, and temperature settings to each constituent model individually, then averaging the per-model bias scores. To remove any ambiguity we will expand §5.2 with an explicit statement of these identical conditions, include the per-model bias numbers for the panel members, and clarify the aggregation procedure. This ensures the comparison remains under matched evaluation protocols. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation

full rationale

The paper presents direct empirical comparisons of PoLL outputs versus single large judges on six datasets across three settings, measuring agreement with reference scores, bias metrics, and cost. No derivation, first-principles prediction, fitted parameter renamed as result, or self-citation load-bearing step exists in the reported chain. All claims rest on measured performance differences rather than any reduction to inputs by construction, so the analysis is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical; it relies on the standard assumption that LLM-generated text can be meaningfully scored by other LLMs and on the choice of which smaller models to include in the panel.

free parameters (1)

panel composition and size
Selection of specific smaller models and the number used in the panel is chosen by the authors to demonstrate the effect.

axioms (1)

domain assumption LLM outputs can be reliably scored for quality by other LLMs
Core premise underlying all LLM-as-judge evaluations, invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5505 in / 1161 out tokens · 51701 ms · 2026-05-15T23:27:15.223901+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence
cs.CL 2026-05 unverdicted novelty 7.0

BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63....
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
cs.IR 2026-04 accept novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
cs.AI 2026-04 conditional novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
Do AI Coding Agents Log Like Humans? An Empirical Study
cs.SE 2026-04 unverdicted novelty 7.0

AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
cs.CL 2026-03 conditional novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
cs.HC 2026-05 unverdicted novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
cs.AI 2026-05 unverdicted novelty 6.0

VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
FUSE: Ensembling Verifiers with Zero Labeled Data
stat.ML 2026-04 unverdicted novelty 6.0

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and...
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
cs.CL 2026-05 unverdicted novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
cs.CR 2026-05 accept novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 unverdicted novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
cs.CL 2026-04 unverdicted novelty 5.0

Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
On Cost-Effective LLM-as-a-Judge Improvement Techniques
cs.CL 2026-04 conditional novelty 5.0

Ensemble scoring plus task-specific criteria injection raises LLM judge accuracy to 85.8 percent on RewardBench 2, a 13.5-point gain over baseline, with small models gaining the most.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
cs.AI 2026-05 unverdicted novelty 4.0

Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

291 extracted references · 291 canonical work pages · cited by 20 Pith papers · 3 internal anchors

[1]

Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku

work page 2024
[3]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. In International Conference on Learning Representations

work page 2020
[5]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017
[6]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Yu Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. https://api.semanticscholar.org/CorpusID:215737187 Dense passage retrieval for open-domain question answering . ArXiv, abs/2004.04906

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93

work page 1938
[8]

Tom Kocmi and Christian Federmann. 2023 a . https://aclanthology.org/2023.eamt-1.19 Large language models are state-of-the-art evaluators of translation quality . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193--203, Tampere, Finland. European Association for Machine Translation

work page 2023
[10]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

work page doi:10.1162/tacl_a_00276 2019
[12]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2024 a . https://lmsys.org/blog/2024-04-19-arena-hard/ From live data to high-quality benchmarks: The arena-hard pipeline

work page 2024
[13]

Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, and Chongyang Tao. 2024 b . Leveraging large language models for nlg evaluation: A survey. arXiv e-prints, pages arXiv--2401

work page 2024
[14]

Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013 ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

work page 2004
[16]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. Llm evaluators recognize and favor their own generations. arXiv e-prints, pages arXiv--2401

work page 2024
[18]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318

work page 2002
[19]

Karl Pearson. 1895. Vii. note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352):240--242

work page
[20]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2021. Kilt: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...

work page 2021
[21]

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687--5711

work page 2023
[22]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392

work page 2016
[23]

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. https://doi.org/10.18653/v1/2020.acl-main.704 BLEURT : Learning robust metrics for text generation . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881--7892, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.704 2020
[25]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

work page 2024
[27]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. https://api.semanticscholar.org/CorpusID:266693831 Improving text embeddings with large language models . ArXiv, abs/2401.00368

work page arXiv 2023
[28]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

work page doi:10.18653/v1/d18-1259 2018
[30]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36

work page 2024
[32]

arXiv preprint arXiv:2212.08037 , year=

Attributed question answering: Evaluation and modeling for attributed large language models , author=. arXiv preprint arXiv:2212.08037 , year=

work page arXiv
[33]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2016
[34]

arXiv e-prints , pages=

Leveraging Large Language Models for NLG Evaluation: A Survey , author=. arXiv e-prints , pages=

work page
[35]

arXiv e-prints , pages=

LLM Evaluators Recognize and Favor Their Own Generations , author=. arXiv e-prints , pages=

work page
[36]

Gonzalez and Ion Stoica , month =

Tianle Li and Wei-Lin Chiang and Evan Frick and Lisa Dunlap and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica , month =. From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline , url =

work page
[37]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

KILT: a Benchmark for Knowledge Intensive Language Tasks , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[38]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page
[39]

Biochemia medica , volume=

Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=

work page 2012
[40]

The Claude 3 Model Family: Opus, Sonnet, Haiku , author=

work page
[41]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[42]

Biometrika , volume=

A new measure of rank correlation , author=. Biometrika , volume=. 1938 , publisher=

work page 1938
[43]

Note on regression and inheritance in the case of two parents , author=

VII. Note on regression and inheritance in the case of two parents , author=. proceedings of the royal society of London , volume=. 1895 , publisher=

work page
[44]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Measuring and Narrowing the Compositionality Gap in Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[45]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[46]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Shen, Chenhui and Cheng, Liying and Nguyen, Xuan-Phi and You, Yang and Bing, Lidong. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.278

work page doi:10.18653/v1/2023.findings-emnlp.278 2023
[48]

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

work page 2023
[49]

, title =

Voorhees, Ellen M. , title =. 1998 , isbn =. doi:10.1145/290941.291017 , booktitle =

work page doi:10.1145/290941.291017 1998
[50]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page
[51]

arXiv preprint arXiv:2403.02839 , year=

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers , author=. arXiv preprint arXiv:2403.02839 , year=

work page arXiv
[52]

arXiv preprint arXiv:2302.14520 , year=

Large language models are state-of-the-art evaluators of translation quality , author=. arXiv preprint arXiv:2302.14520 , year=

work page arXiv
[53]

Advances in Neural Information Processing Systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

arXiv preprint arXiv:2310.17631 , year=

Judgelm: Fine-tuned large language models are scalable judges , author=. arXiv preprint arXiv:2310.17631 , year=

work page arXiv
[55]

ArXiv , year=

Improving Text Embeddings with Large Language Models , author=. ArXiv , year=

work page
[56]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title = "

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title = ". Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00638 , url =

work page doi:10.1162/tacl_a_00638 2024
[57]

ArXiv , year=

Dense Passage Retrieval for Open-Domain Question Answering , author=. ArXiv , year=

work page
[58]

arXiv preprint arXiv:2307.02762 , year=

Prd: Peer rank and discussion improve large language model based evaluations , author=. arXiv preprint arXiv:2307.02762 , year=

work page arXiv
[59]

Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

work page 2022
[60]

A Systematic Survey of Text Worlds as Embodied Natural Language Environments

Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

work page doi:10.18653/v1/2022.wordplay-1.1 2022
[61]

A Minimal Computational Improviser Based on Oral Thought

Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

work page doi:10.18653/v1/2022.wordplay-1.2 2022
[62]

Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

work page doi:10.18653/v1/2022.wordplay-1.3 2022
[63]

A Sequence Modelling Approach to Question Answering in Text-Based Games

Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

work page doi:10.18653/v1/2022.wordplay-1.4 2022
[64]

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

work page doi:10.18653/v1/2022.wordplay-1.5 2022
[65]

Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022

work page 2022
[66]

Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing

Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1

work page doi:10.18653/v1/2022.woah-1.1 2022
[67]

Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions

Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2

work page doi:10.18653/v1/2022.woah-1.2 2022
[68]

G rease V ision: Rewriting the Rules of the Interface

Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3

work page doi:10.18653/v1/2022.woah-1.3 2022
[69]

Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4

work page doi:10.18653/v1/2022.woah-1.4 2022
[70]

`` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch

Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5

work page doi:10.18653/v1/2022.woah-1.5 2022
[71]

Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts

Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6

work page doi:10.18653/v1/2022.woah-1.6 2022
[72]

S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes

Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7

work page doi:10.18653/v1/2022.woah-1.7 2022
[73]

The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists

Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8

work page doi:10.18653/v1/2022.woah-1.8 2022
[74]

Lost in Distillation: A Case Study in Toxicity Modeling

Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9

work page doi:10.18653/v1/2022.woah-1.9 2022
[75]

Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words

Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10

work page doi:10.18653/v1/2022.woah-1.10 2022
[76]

Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler

Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11

work page doi:10.18653/v1/2022.woah-1.11 2022
[77]

Resources for Multilingual Hate Speech Detection

Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12

work page doi:10.18653/v1/2022.woah-1.12 2022
[78]

Enriching Abusive Language Detection with Community Context

Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13

work page doi:10.18653/v1/2022.woah-1.13 2022
[79]

DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis

Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14

work page doi:10.18653/v1/2022.woah-1.14 2022
[80]

Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models

R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15

work page doi:10.18653/v1/2022.woah-1.15 2022
[81]

Distributional properties of political dogwhistle representations in S wedish BERT

Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16

work page doi:10.18653/v1/2022.woah-1.16 2022
[82]

Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions

Khurana, Urja and Vermeulen, Ivar and Nalisnick, Eric and Van Noorloos, Marloes and Fokkens, Antske. Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.17

work page doi:10.18653/v1/2022.woah-1.17 2022
[83]

Accounting for Offensive Speech as a Practice of Resistance

Diaz, Mark and Amironesei, Razvan and Weidinger, Laura and Gabriel, Iason. Accounting for Offensive Speech as a Practice of Resistance. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.18

work page doi:10.18653/v1/2022.woah-1.18 2022
[84]

Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging

Zheng, Joan and Friedman, Scott and Schmer-galunder, Sonja and Magnusson, Ian and Wheelock, Ruta and Gottlieb, Jeremy and Gomez, Diana and Miller, Christopher. Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:1...

work page doi:10.18653/v1/2022.woah-1.19 2022
[85]

Flexible text generation for counterfactual fairness probing

Fryer, Zee and Axelrod, Vera and Packer, Ben and Beutel, Alex and Chen, Jilin and Webster, Kellie. Flexible text generation for counterfactual fairness probing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.20

work page doi:10.18653/v1/2022.woah-1.20 2022
[86]

Users Hate Blondes: Detecting Sexism in User Comments on Online R omanian News

Moldovan, Andreea and Cs. Users Hate Blondes: Detecting Sexism in User Comments on Online R omanian News. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.21

work page doi:10.18653/v1/2022.woah-1.21 2022
[87]

Targeted Identity Group Prediction in Hate Speech Corpora

Sachdeva, Pratik and Barreto, Renata and Von Vacano, Claudia and Kennedy, Chris. Targeted Identity Group Prediction in Hate Speech Corpora. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.22

work page doi:10.18653/v1/2022.woah-1.22 2022
[88]

Revisiting Queer Minorities in Lexicons

Ramesh, Krithika and Kumar, Sumeet and Khudabukhsh, Ashiqur. Revisiting Queer Minorities in Lexicons. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.23

work page doi:10.18653/v1/2022.woah-1.23 2022
[89]

HATE - ITA : Hate Speech Detection in I talian Social Media Text

Nozza, Debora and Bianchi, Federico and Attanasio, Giuseppe. HATE - ITA : Hate Speech Detection in I talian Social Media Text. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.24

work page doi:10.18653/v1/2022.woah-1.24 2022

Showing first 80 references.