A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.
Prometheus: Inducing fine-grained evaluation capability in language models
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
CAMI frames multi-index construction for semantic retrieval as a budgeted multi-objective portfolio problem and uses agent-guided search plus confidence-aware pruning to find high-recall configurations with reduced evaluation cost.
Multimodal LLMs in robots develop self-identification and predictive awareness through sensorimotor loops, with structural equation modeling linking sensory integration to dimensions of the minimal self.
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
citing papers explorer
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.