Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Prover-verifier games improve legibility of llm outputs
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Self-trained verification trains verifiers to imitate informed versions of themselves using reference solutions, improving test-time V-R loops and training-time self-improvement with reported gains of 2x on hard math and 14x on scientific reasoning.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
CCO aggregates scoring functions into a calibrated penalty using conformal decision theory to enforce target violation rates for AI oversight on benchmarks like modified SWE-bench and MACHIAVELLI.
citing papers explorer
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.