Prometheus 2: An open source language model specialized in evaluating other language models

Kim, Seungone, Suk, Juyoung, Longpre, Shayne, Lin, Bill Yuchen, Shin, Jamin, Welleck, Sean · 2024 · DOI 10.18653/v1/2024.emnlp-main.248

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

cs.CL · 2026-04-27 · unverdicted · novelty 8.0

BenchGuard is the first LLM-based automated auditing framework for execution-based agent benchmarks, identifying 12 confirmed issues in ScienceAgentBench and matching 83.3% of expert findings on BIXBench at low cost.

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

citing papers explorer

Showing 7 of 7 citing papers.

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks cs.CL · 2026-04-27 · unverdicted · none · ref 4
BenchGuard is the first LLM-based automated auditing framework for execution-based agent benchmarks, identifying 12 confirmed issues in ScienceAgentBench and matching 83.3% of expert findings on BIXBench at low cost.
Instance-Optimal Estimation with Multiple LLM Judges on a Budget cs.LG · 2026-05-22 · unverdicted · none · ref 7
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering cs.SE · 2026-05-07 · unverdicted · none · ref 133
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 33
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 11
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild cs.SE · 2026-01-25 · conditional · none · ref 28
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 83
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Prometheus 2: An open source language model specialized in evaluating other language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer