Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
Pith reviewed 2026-05-18 15:40 UTC · model grok-4.3
The pith
Aggregating model rollouts into pseudo-references enables reference-free RL supervision that rivals expert training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Inference compute itself can serve as supervision by generating parallel rollouts and converting them into reference estimates. Models can learn without human labels, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. The Compute as Teacher (CaT) framework turns inference-time compute from parallel rollouts into supervision for RL training through reference estimation that aggregates rollouts into a pseudo-reference answer and reward derivation that converts that pseudo-reference into RL rewards via self-proposed binary rubrics.
What carries the argument
The Compute as Teacher (CaT) framework, which aggregates parallel rollouts to form pseudo-references and derives self-proposed binary rubrics for LLM-based reward scoring.
If this is right
- Trained models match or exceed the quality of inference-time aggregation on HealthBench while using 9x less test-time compute.
- CaT achieves up to 30% relative improvement over the initial policy and competes with training on expert physician annotations.
- On MATH-500, CaT matches the best existing baselines for test-time RL.
- The framework applies as a versatile drop-in method to both non-verifiable and verifiable domains.
Where Pith is reading between the lines
- This suggests models could iteratively improve by repeatedly applying the process to their own outputs.
- It may enable effective post-training in many specialized fields where obtaining expert labels is costly.
- The technique could be combined with other consistency-based methods to further boost performance in open-ended generation tasks.
- Testing CaT on additional domains like legal reasoning or scientific explanation would show how general the approach is.
Load-bearing premise
The pseudo-references formed by aggregating rollouts are accurate enough that rubrics derived from them produce rewards that genuinely improve the model when used in RL.
What would settle it
Running the CaT training process and then evaluating the resulting model on HealthBench against both the initial model and the aggregated-rollout performance; failure to match or exceed the aggregation quality would disprove the main result.
read the original abstract
Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Compute as Teacher (CaT), a framework that converts inference-time compute from parallel rollouts into reference-free supervision for RL post-training. It consists of reference estimation via aggregation (e.g., synthesis) into a pseudo-reference and reward derivation, including self-proposed binary rubrics generated from the pseudo-reference and scored by an LLM judge for non-verifiable domains. The central empirical claims are that on HealthBench, CaT-trained models match or exceed the quality of inference-time aggregation while using 9x less test-time compute, compete with learning from expert physician annotations (up to +30% relative improvement over the initial policy), and that the approach extends to verifiable domains by matching strong baselines on MATH-500.
Significance. If the results hold under rigorous controls, the work is significant for providing a practical way to generate supervision signals without human labels or programmatic verifiers, especially in high-stakes non-verifiable domains like healthcare guidance. The framework's drop-in versatility across domain types and explicit use of inference compute as a teacher are notable strengths; the introduction of auditable self-proposed rubrics is a concrete technical contribution that could reduce annotation costs in RL settings.
major comments (2)
- [§4.1] §4.1 (HealthBench results): The claims of matching inference-time aggregation quality with 9x less compute and up to +30% relative gains require explicit reporting of experimental controls, including the number of independent training runs, statistical significance tests, baseline implementation details for both the initial policy and inference-time aggregation, and any selection criteria for rollouts; without these, the support for the central claim that CaT yields reliable improvements remains preliminary.
- [§3.2] §3.2 (Reward derivation via self-proposed rubrics): The load-bearing assumption that aggregated pseudo-references yield rubrics whose LLM-judge scores provide a useful RL signal without systematic misalignment to true response quality is not directly tested; the manuscript should include at least a small-scale human validation (e.g., correlation between LLM rubric scores and expert physician ratings on a held-out set) to address the risk that base-policy biases propagate into the reward model.
minor comments (2)
- [Figure 2] Figure 2: The flowchart illustrating the CaT pipeline would be clearer with explicit arrows distinguishing the reference-estimation stage from the reward-derivation stage and with a legend for the LLM-judge component.
- [Table 1] Table 1: Add standard deviations or confidence intervals to the reported metrics on both HealthBench and MATH-500 to allow readers to assess variability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript introducing Compute as Teacher (CaT). The comments help clarify how to strengthen the empirical rigor and validation of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§4.1] §4.1 (HealthBench results): The claims of matching inference-time aggregation quality with 9x less compute and up to +30% relative gains require explicit reporting of experimental controls, including the number of independent training runs, statistical significance tests, baseline implementation details for both the initial policy and inference-time aggregation, and any selection criteria for rollouts; without these, the support for the central claim that CaT yields reliable improvements remains preliminary.
Authors: We agree that additional explicit reporting of experimental controls is necessary to make the reliability of the reported improvements fully transparent. In the revised manuscript we will expand §4.1 (and the appendix) to state the number of independent training runs performed, include statistical significance tests comparing CaT against the reported baselines, provide fuller implementation details for the initial policy and the inference-time aggregation baseline, and clarify the rollout selection criteria used for aggregation. These additions will directly address the concern that the central claims currently rest on preliminary evidence. revision: yes
-
Referee: [§3.2] §3.2 (Reward derivation via self-proposed rubrics): The load-bearing assumption that aggregated pseudo-references yield rubrics whose LLM-judge scores provide a useful RL signal without systematic misalignment to true response quality is not directly tested; the manuscript should include at least a small-scale human validation (e.g., correlation between LLM rubric scores and expert physician ratings on a held-out set) to address the risk that base-policy biases propagate into the reward model.
Authors: We acknowledge that a direct test of alignment between the LLM-judge rubric scores and expert human ratings would strengthen confidence in the reward signal and help rule out systematic bias propagation. We will add a small-scale human validation study to the revised manuscript: on a held-out set of responses we will collect expert physician ratings and report the correlation with the LLM rubric scores derived from the pseudo-references. This addition will be placed in §3.2 or a new subsection of the experiments. revision: yes
Circularity Check
No circularity: empirical framework validated on external benchmarks
full rationale
The paper introduces Compute as Teacher (CaT) as a practical method that aggregates parallel inference rollouts into pseudo-references and derives binary rubrics for RL rewards in non-verifiable domains. All reported gains on HealthBench (matching inference-time aggregation with 9x less compute, up to +30% over initial policy) and MATH-500 are presented as direct empirical comparisons against baselines and expert annotations. No equations, first-principles derivations, or self-referential definitions appear in the provided text that would make any claimed result equivalent to its inputs by construction. The central components (synthesis aggregator and self-proposed rubrics) are described as design choices whose value is assessed externally rather than presupposed.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of parallel rollouts
- rubric generation hyperparameters
axioms (2)
- domain assumption Aggregated parallel rollouts yield a higher-quality reference than a single rollout
- domain assumption An LLM judge can produce reliable binary scores on self-proposed rubrics for non-verifiable tasks
invented entities (1)
-
self-proposed rubrics
no independent evidence
Forward citations
Cited by 3 Pith papers
-
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reaso...
Reference graph
Works this paper leans on
-
[1]
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning.arXiv preprint arXiv:2505.15134,
work page internal anchor Pith review arXiv
-
[2]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. HealthBench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
One-shot entropy minimization.arXiv preprint arXiv:2505.20282,
Zitian Gao, Lynx Chen, Joey Zhou, and Bryan Dai. One-shot entropy minimization.arXiv preprint arXiv:2505.20282,
-
[5]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as Rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Distilling the Knowledge in a Neural Network
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual, 2021.https://data...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
work page 2022
-
[9]
OpenReview.net, 2022.https://openreview.net/ forum?id=nZeVKeeFYf9. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence Is All You Need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395,
-
[13]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
work page 2019
-
[14]
OpenReview.net, 2019.https: //openreview.net/forum?id=Bkg6RiCqY7. 11 Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, and Maria Lomeli. Source2Synth: Synthetic data generation and curation grounded in real data sources.arXiv preprint arXiv:2409.08239,
-
[15]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Mad- die Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training lan- guage models to follow instructions with human...
-
[16]
arXiv preprint arXiv:2505.22660 , year=
Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning.arXiv preprint arXiv:2505.22660,
-
[17]
ZeRO: memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page
work page 2020
-
[18]
Generalized Slow Roll for Tensors
doi: 10.1109/SC41405.2020.00024.https://doi.org/10.1109/SC41405.2020.00024. Stephen Roller, Y-Lan Boureau, Jason Weston, Antoine Bordes, Emily Dinan, Angela Fan, David Gunning, Da Ju, Margaret Li, Spencer Poff, et al. Open-domain conversational agents: Current progress, open problems, and future directions.arXiv preprint arXiv:2006.12442,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024.https://doi.org/10.1109/sc41405.2020.00024 2020
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025a
Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025a. Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M. Kakade, Dean P. Foster, and Udaya Ghai. Mind the Gap: Examining the self-improvement capabilities of large language models. InThe Thirteenth International Conference on Learning Repre...
-
[23]
Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data
OpenReview.net, 2025b.https://openreview. net/forum?id=mtJSMcF3ek. Yunhao Tang, Sid Wang, Lovish Madaan, and Rémi Munos. Beyond Verifiable Rewards: Scaling reinforcement learning for language models to unverifiable data.arXiv preprint arXiv:2503.19618,
-
[24]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International 12 Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
work page 2023
-
[25]
https://openreview.net/forum?id=1PL1NIMMrw
OpenReview.net, 2023a. https://openreview.net/forum?id=1PL1NIMMrw. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...
-
[26]
OpenReview.net, 2022.https://openreview.net/ forum?id=gEZrGCozdqR. Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, et al. Unsupervised elicitation of language models.arXiv preprint arXiv:2506.10139,
-
[27]
The invisible leash: Why RLVR may not escape its origin.arXiv preprint arXiv:2507.14843,
Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why RLVR may not escape its origin.arXiv preprint arXiv:2507.14843,
-
[28]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
RLPR: Extrapolating RLVR to general domains without verifiers.arXiv preprint arXiv:2506.18254,
Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, et al. RLPR: Extrapolating RLVR to general domains without verifiers.arXiv preprint arXiv:2506.18254,
-
[30]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-STaR: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute Zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025a. Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Reinforcing general reasoning without verifiers
http://papers.nips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, and Chao Du. Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493,
-
[34]
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
All rollouts failed to provide the correct answer, exhibiting calculation errors. The following is an example from the second rollout which did not compute a division correctly: ✗→z 1 = 1 137 +2i 1 137 = 1+2i·137 137 = 1+274i 137 ✓→z 1 = 1 137 +2i 1 137 = 1+274i 1 = 1 + 274i In another example, the sixth rollout made several calculation errors, inexplicab...
work page 2025
-
[36]
due to fast overfitting and worse results with full parameter fine-tuning. 21 RL fine-tuning.Much of the detail for RL fine-tuning is described in the main body and other appendices. Here, we note that for math data, we extract a verifiable final answer from boxed text, e.g.,boxed{...}, using regular expressions and string matching where we have instructe...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.