BenchGuard is the first LLM-based automated auditing framework for execution-based agent benchmarks, identifying 12 confirmed issues in ScienceAgentBench and matching 83.3% of expert findings on BIXBench at low cost.
Prometheus 2: An open source language model specialized in evaluating other language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
citing papers explorer
-
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
BenchGuard is the first LLM-based automated auditing framework for execution-based agent benchmarks, identifying 12 confirmed issues in ScienceAgentBench and matching 83.3% of expert findings on BIXBench at low cost.
-
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
-
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.