Prometheus: Inducing fine-grained evaluation capability in language models

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models · 2023 · arXiv 2310.08491

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Before and After Temperature: A Distributional View of Creative LLM Generation

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

cs.LG · 2025-10-28 · unverdicted · novelty 6.0

PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.

CAMI: Cost-Aware Agent-Guided Multi-Indexing for Semantic Retrieval

cs.IR · 2026-06-14 · unverdicted · novelty 5.0

CAMI frames multi-index construction for semantic retrieval as a budgeted multi-objective portfolio problem and uses agent-guided search plus confidence-aware pruning to find high-recall configurations with reduced evaluation cost.

Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots

cs.AI · 2025-05-25 · unverdicted · novelty 4.0

Multimodal LLMs in robots develop self-identification and predictive awareness through sensorimotor loops, with structural equation modeling linking sensory integration to dimensions of the minimal self.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

cs.CL · 2025-01-03 · unverdicted · novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 100
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why cs.CL · 2026-05-25 · conditional · none · ref 21
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Prometheus: Inducing fine-grained evaluation capability in language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer