arxiv: 2605.13695 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Recognition: no theorem link

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

Andrea Morandi

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM-as-a-judgeprompting paradigmFeynman learning techniqueself-critiqueJudgeBenchpairwise accuracyno fine-tuning

0 comments

The pith

A three-stage prompting method lifts LLM judge accuracy from 65% to 79% on hard pairwise comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RTLC, a prompting strategy with three stages—Research, Teach-to-Learn, and Critique—drawn from the Feynman learning approach. This turns a single language model into a more reliable judge for evaluating AI outputs without any additional training or tools. On a set of 350 challenging items from JudgeBench, the method raises accuracy by 14 points over a basic prompt. The improvement comes from scaffolding the model to explain concepts and then review its own multiple candidate answers. Such better automated judging matters for scaling the assessment of open-ended model generations.

Core claim

RTLC is a three-stage prompting paradigm that applies the Feynman Learning Technique to LLM judging by first wrapping the input in a pedagogical scaffold for research and teaching, then sampling ten candidate verdicts, and finally having the model critique and select one final verdict, resulting in 78.6% pairwise accuracy on JudgeBench-GPT compared to 64.6% with vanilla prompting.

What carries the argument

The RTLC three-stage prompting recipe that ports the Feynman Learning Technique into a fixed scaffold for generating and self-critiquing multiple verdicts.

If this is right

RTLC achieves higher accuracy than sampling ten responses and taking a majority vote.
The largest gain comes from the Teach-to-Learn stage.
Performance improves across knowledge, reasoning, math, and coding evaluation categories.
RTLC can be combined with score calibration to multiply the accuracy gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured prompting like this may allow smaller models to serve as effective judges in place of larger ones.
The approach could be tested on other evaluation benchmarks to see if similar gains appear.
By reducing reliance on human raters, RTLC points toward more scalable AI development cycles.

Load-bearing premise

That the reported gains are specifically due to the RTLC stages and not simply from increased token usage or the particular benchmark setup.

What would settle it

Replicating the JudgeBench evaluation with the same model but using a different prompting method that consumes the same number of tokens and finding comparable accuracy to RTLC.

Figures

Figures reproduced from arXiv: 2605.13695 by Andrea Morandi.

**Figure 2.** Figure 2: Headline accuracy on the 350-item JudgeBench-GPT split. The vanilla [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cost-accuracy frontier plotted in relative output-token units (vanilla = [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-category accuracy on JudgeBench (knowledge, reasoning, math, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RTLC, a three-stage prompting paradigm (Research, Teach-to-Learn, Critique) inspired by the Feynman Learning Technique to improve LLM-as-judge accuracy on JudgeBench without fine-tuning, retrieval, or tools. It reports that Claude 3.7 Sonnet's pairwise accuracy on 350 hard JudgeBench-GPT items rises from 64.6% (vanilla single-shot) to 78.6% (RTLC with N=10 critique), outperforming N=10 self-consistency (77.7%) and zero-shot (74.0%), with a three-step ablation attributing +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalization, and +0.9 pp to explicit critique.

Significance. If the accuracy gains prove reproducible and driven by the specific RTLC structure, the work offers a practical, training-free method to boost LLM judging performance on objective-correctness tasks, with added value from the ablation breakdown, cost-accuracy frontier analysis, and discussion of orthogonal composition with post-hoc calibration. The empirical focus on a public benchmark and stage-wise contributions strengthens its utility for evaluation pipelines in generative NLP.

major comments (2)

[Method section (stage descriptions)] Method section (stage descriptions): The manuscript supplies only high-level summaries of the three RTLC stages (fixed pedagogical scaffold, N=10 candidates at temperature 0.4, self-critique at temperature 0) without verbatim prompt templates, system instructions, or formatting details. This prevents isolating whether the 14pp gain (64.6% to 78.6%) arises from the intended Feynman-inspired mechanism or from unstated factors such as increased token budget or prompt wording.
[Results (JudgeBench-GPT evaluation)] Results (JudgeBench-GPT evaluation): The reported accuracies and ablation deltas lack error bars, statistical significance tests, or details on run-to-run variance, which is required to substantiate that the gains over self-consistency (77.7%) are reliable rather than benchmark-specific or model-sensitive artifacts.

minor comments (2)

[Abstract and §4] The abstract and results refer to 'JudgeBench-GPT (350 hard pairwise items)' without clarifying its exact relation to the public JudgeBench split or selection criteria; adding this detail would aid reproducibility.
[Appendix] Consider moving the full prompt templates and any additional error-budget tables to an appendix to support the central reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that strengthen reproducibility and statistical reporting while preserving the core contributions of the work.

read point-by-point responses

Referee: Method section (stage descriptions): The manuscript supplies only high-level summaries of the three RTLC stages (fixed pedagogical scaffold, N=10 candidates at temperature 0.4, self-critique at temperature 0) without verbatim prompt templates, system instructions, or formatting details. This prevents isolating whether the 14pp gain (64.6% to 78.6%) arises from the intended Feynman-inspired mechanism or from unstated factors such as increased token budget or prompt wording.

Authors: We agree that the absence of verbatim prompts limits the ability to attribute gains precisely to the RTLC structure. In the revised manuscript we will add a new appendix containing the complete prompt templates for all three stages, including system instructions, formatting constraints, and temperature settings. This will allow readers to replicate the exact experimental conditions and better isolate the contribution of the pedagogical scaffold versus token budget or wording effects. revision: yes
Referee: Results (JudgeBench-GPT evaluation): The reported accuracies and ablation deltas lack error bars, statistical significance tests, or details on run-to-run variance, which is required to substantiate that the gains over self-consistency (77.7%) are reliable rather than benchmark-specific or model-sensitive artifacts.

Authors: We acknowledge that reporting variance and significance tests would strengthen the claims. Because each RTLC configuration incurs substantial API cost, we conducted single runs per setting. In revision we will (i) add a limitations paragraph explaining the single-run design, (ii) report McNemar’s test p-values for the key pairwise comparisons against self-consistency and zero-shot baselines, and (iii) include a small-scale multi-run experiment (N=3) on a 100-item subset to provide indicative error bars. Full multi-run variance on the entire 350-item set remains computationally prohibitive without additional resources. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical prompting study with independent benchmark measurements

full rationale

The manuscript is an empirical evaluation of a prompting recipe on JudgeBench-GPT. It reports direct accuracy measurements (64.6% baseline to 78.6% with RTLC) and ablations (+9.4 pp Teach-to-Learn, +3.7 pp N=10, +0.9 pp critique) without any equations, derivations, fitted parameters, or self-referential definitions. No load-bearing step reduces a claimed result to its own inputs by construction, and no self-citations justify uniqueness theorems or ansatzes. The evaluation is falsifiable via benchmark reproduction and remains self-contained against external measurements.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The approach relies on standard LLM prompting capabilities with no new postulated entities. Free parameters are limited to standard hyperparameters for sampling and temperature; no parameters are fitted to the target accuracy results in a circular manner.

free parameters (3)

N (number of candidates) = 10
Number of independent verdicts generated in stage 2 for marginalisation
Candidate generation temperature = 0.4
Temperature setting for diversity in stage 2
Critique temperature = 0
Temperature for deterministic final output in stage 3

axioms (1)

domain assumption LLMs can follow complex multi-stage pedagogical instructions to improve judgment quality
The method assumes the model will properly execute the Research, Teach-to-Learn, and Critique stages when prompted

pith-pipeline@v0.9.0 · 5651 in / 1528 out tokens · 65094 ms · 2026-05-14T19:33:16.648397+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia

S. Tan, S. Zhuang, K. Montgomery, W. Y . Tang, A. Cuadron, C. Wang, R. A. Popa, and I. Stoica, “JudgeBench: A benchmark for evaluating LLM-based judges,” inInternational Conference on Learning Represen- tations (ICLR), 2025, arXiv:2410.12784

work page arXiv 2025
[2]

Two ways to de-bias an LLM-as-a-Judge: A continuous- score comparison of hierarchical Bayesian calibration and Neural-ODE score transport,

A. Morandi, “Two ways to de-bias an LLM-as-a-Judge: A continuous- score comparison of hierarchical Bayesian calibration and Neural-ODE score transport,” 2026. [Online]. Available: https://arxiv.org/abs/2605. 09227

work page 2026
[3]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

work page 2023
[4]

UltraFeedback: Boosting language models with scaled AI feedback,

G. Cui, L. Yuan, W. Zhu, Z. Liu, and M. Sun, “UltraFeedback: Boosting language models with scaled AI feedback,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 9722–9744. [Online]. Available: https://proceedings.mlr.press/v235/cui24f.html

work page 2024
[5]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations (ICLR), 2023, arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refine- ment with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Constitutional AI: Harmlessness from AI feedback,

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...

work page 2022
[8]

Improving factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing (ICML), ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 11 733–11 763

work page 2024
[9]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2201.11903. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023