Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten · 2025 · cs.LG · arXiv 2509.14234

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.

representative citing papers

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

cs.CY · 2026-05-24 · unverdicted · novelty 6.0

Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

cs.CL · 2025-11-24 · conditional · novelty 6.0

DR Tulu-8B trained with RLER outperforms open deep research agents by 15.6% on average and matches proprietary agents while being 1000x cheaper per query.

On the Generalization Gap in Self-Evolving Language Model Reasoning

cs.CL · 2026-05-31 · unverdicted · novelty 5.0

Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

cs.AI · 2025-11-13 · unverdicted · novelty 5.0

A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research cs.CL · 2025-11-24 · conditional · none · ref 1 · internal anchor
DR Tulu-8B trained with RLER outperforms open deep research agents by 15.6% on average and matches proprietary agents while being 1000x cheaper per query.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

fields

years

verdicts

representative citing papers

citing papers explorer