TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
hub
Prover-verifier games improve legibility of llm outputs
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Pseudo-Formalization decomposes proofs into self-contained natural language modules for independent LLM-based Block Verification, outperforming LLM-as-judge baselines on olympiad and research math benchmarks while releasing ArxivMathGradingBench.
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
DAC decomposes agentic search into cooperative searcher and generator agents with cross-agent signals (abstention reward and hard-positive augmentation), achieving strong QA benchmark performance via LoRA on a shared backbone.
Weak models used as critics supplying non-misleading revision directions, distilled on-policy via OPCD, improve frozen and trained strong models on reasoning and alignment benchmarks.
Self-trained verification trains verifiers to imitate informed versions of themselves using reference solutions, improving test-time V-R loops and training-time self-improvement with reported gains of 2x on hard math and 14x on scientific reasoning.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
CCO aggregates scoring functions into a calibrated penalty using conformal decision theory to enforce target violation rates for AI oversight on benchmarks like modified SWE-bench and MACHIAVELLI.
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
citing papers explorer
No citing papers match the current filters.