pith. sign in

arxiv: 2605.27088 · v1 · pith:UMS4CLDAnew · submitted 2026-05-26 · 💻 cs.CL · cs.LG

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

Pith reviewed 2026-06-29 17:47 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords prompt optimizationmath tutoringlarge language modelstraining-free methodspedagogical alignmentsystem promptseducational codebookout-of-distribution evaluation
0
0 comments X

The pith

Optimizing system prompts alone produces math-tutoring LLMs that beat RL-trained baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether evolving only the system prompt through API calls can replace reinforcement learning for aligning LLMs as math tutors. It adapts seven published prompt methods and adds five education-specific ones, then evaluates all twelve under five conditions on two out-of-distribution benchmark suites. Every best-per-method configuration exceeds the strongest RL baseline's total score of 0.633, and the proposed ParetoGrad method achieves the best balance across post-test solve rate, leak control, and helpfulness. Behavioral coding with an 82-code educational codebook shows training-free methods apply teaching-knowledge patterns at two to three times the rate of RL models, with a compensating drop in intent-level scaffolding. A task-dependent reasoning mode effect appears consistently in both paradigms.

Core claim

Training-free prompt optimization of system prompts alone can produce LLM math tutors that surpass RL-trained models on a composite score, with the ParetoGrad method delivering the strongest Pareto balance across solve rate, leak control, and helpfulness. Training-free approaches exhibit 2-3x higher rates of teaching-knowledge patterns and a ~10 percentage-point reduction in intent-level scaffolding, while both paradigms display the same task-dependent reasoning mode effect.

What carries the argument

Training-free evolution of system prompts via API calls, paired with an 82-code educational codebook that quantifies teaching-knowledge patterns and scaffolding in responses.

If this is right

  • All twelve best-per-method configurations surpass the RL baseline total score of 0.633.
  • ParetoGrad achieves the best overall balance rather than leading on any single metric.
  • Training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL models.
  • A task-dependent reasoning mode effect holds across both training-free and RL paradigms.
  • Pedagogically aligned tutors can be developed with prompts and minimal compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-only approach could be applied to tutoring in other subjects without retraining.
  • Lower compute requirements may allow smaller teams or schools to build custom tutors.
  • Live classroom trials would test whether the measured patterns produce better student outcomes.
  • The observed reasoning mode effect suggests a general property of LLM tutoring rather than a training artifact.

Load-bearing premise

The two OOD benchmark suites and the 82-code educational codebook sufficiently capture real pedagogical quality and generalization in math tutoring scenarios.

What would settle it

A controlled study with actual students using the optimized prompts versus the RL baseline, measuring real learning gains on post-tests; equal or lower gains would undermine the claim.

Figures

Figures reproduced from arXiv: 2605.27088 by Eunjoo Lee, Hoilym Kwon, Jeongsu Moon, Kyungtae Joo, Minchul Shin, Sookbun Lee, Unggi Lee, Yeil Jeong.

Figure 1
Figure 1. Figure 1: Post-test solve rate (Rsol, K=8) vs. leak rate trade-off across the 12 listed training-free methods; bub￾ble size reflects helpfulness. Our proposed ParetoGrad (⋆) achieves the best Pareto balance across the three objectives without updating model weights. training problems, creating a significant barrier for educators and researchers. Meanwhile, automatic prompt optimization methods such as OPRO (Yang et … view at source ↗
Figure 2
Figure 2. Figure 2: Left shows RL-based tutor alignment and center shows our training-free prompt optimization, illustrated with ParetoGrad as a representative instance; right lists method highlights. RL updates tutor weights θ via GRPO over reward components (solve, non-leak, helpfulness, thinking), requiring multi-GPU training over 10K+ problems. In contrast, our approach keeps the tutor and student models frozen and evolve… view at source ↗
Figure 3
Figure 3. Figure 3: Left compares NoThink and Think (avg of 4 think conditions) on in-domain metrics (0-1 scale). Thinking modestly degrades leak control (1−Leak drops from 0.55 to 0.43). Right shows OOD benchmark averages. Think improves MathTutorBench (+0.87) but hurts OpenLearnLM (−0.44), revealing a task￾dependent reasoning mode effect. ure 3 right), thinking improves MathTutorBench but degrades OpenLearnLM [PITH_FULL_IM… view at source ↗
Figure 4
Figure 4. Figure 4: Left plots in-domain Rtotal vs. OOD MTB-Avg across the 12 listed methods. Reward maximization is essentially uncorrelated with OOD MathTutorBench performance (ρ= 0.01, p= 0.96) and only weakly correlated with OpenLearnLM (ρ = 0.25, p = 0.41). Center compares sentence-multilabel code frequency (%) between RL-trained models (Dinucu-Jianu et al., 2025; Lee et al., 2026a) and training-free methods (CondBridge,… view at source ↗
Figure 5
Figure 5. Figure 5: Left shows optimization convergence for 6 methods under a shared 500-evaluation budget; gradient methods converge within 10 iterations while dual-objective methods improve gradually. Center displays code￾to-code transition probability differences (high − low). High-performance methods chain content delivery codes (Step-by-step, Information provision), while low-performance methods chain question codes (Exp… view at source ↗
Figure 6
Figure 6. Figure 6: Hierarchical clustering (Ward linkage on Eu [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of 7-category code-instance % [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-method effect of the pedagogical seed [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that training-free prompt optimization—evolving only the system prompt via API calls—can serve as a practical alternative to RL-based training for aligning LLMs as math tutors. It adapts 7 published methods and introduces 5 education-specialized methods, evaluates the resulting 12 methods under 5 conditions on 2 OOD benchmark suites, and reports that all 12 best-per-method configurations surpass the strongest RL baseline (R_total=0.633), with the proposed ParetoGrad achieving the best multi-objective Pareto balance across post-test solve rate, leak control, and helpfulness. Behavioral analysis via an 82-code educational codebook shows training-free methods using teaching-knowledge patterns at 2-3x the rate of RL models (with a compensating drop in intent-level scaffolding) and identifies a task-dependent reasoning-mode effect consistent across paradigms.

Significance. If the chosen benchmarks and codebook are accepted as valid proxies, the result would indicate that prompt optimization alone can yield pedagogically stronger tutors than RL at far lower compute cost, while also surfacing interpretable differences in teaching behavior. The multi-method comparison and Pareto analysis provide a concrete demonstration that training-free approaches need not trade off across the three axes.

major comments (1)
  1. [Evaluation and Behavioral Analysis sections] Evaluation and Behavioral Analysis sections: the headline superiority claim (all 12 configs > R_total=0.633 and ParetoGrad best on the three-way front) and the behavioral conclusion (2-3x teaching-knowledge rate) rest on the 2 OOD suites and 82-code codebook being faithful proxies for pedagogical quality and generalization. The manuscript supplies no validation of the codebook (inter-annotator agreement, correlation with learning gains, or coverage of common student misconceptions) nor any argument that the chosen OOD suites match the distribution of scaffolding needs that would arise in live tutoring; this is load-bearing for the pedagogical interpretation.
minor comments (1)
  1. Abstract and §3: R_total is reported as 0.633 without an explicit decomposition into its constituent metrics in the opening summary; a one-sentence definition would improve readability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the detailed feedback emphasizing the need to validate our evaluation proxies. We respond point-by-point below and outline planned revisions.

read point-by-point responses
  1. Referee: [Evaluation and Behavioral Analysis sections] Evaluation and Behavioral Analysis sections: the headline superiority claim (all 12 configs > R_total=0.633 and ParetoGrad best on the three-way front) and the behavioral conclusion (2-3x teaching-knowledge rate) rest on the 2 OOD suites and 82-code codebook being faithful proxies for pedagogical quality and generalization. The manuscript supplies no validation of the codebook (inter-annotator agreement, correlation with learning gains, or coverage of common student misconceptions) nor any argument that the chosen OOD suites match the distribution of scaffolding needs that would arise in live tutoring; this is load-bearing for the pedagogical interpretation.

    Authors: We agree that stronger validation of the proxies would reinforce the claims. In the revised manuscript we will add inter-annotator agreement statistics for the 82-code codebook (computed during annotation) and expand the Methods and Discussion sections with explicit arguments for the OOD suites, citing their coverage of diverse math topics, error types, and scaffolding scenarios drawn from prior educational datasets. A direct empirical correlation between the codebook and measured learning gains would require a separate human-subject study that lies outside the present scope; we will therefore note this explicitly as a limitation while retaining the comparative behavioral analysis as an interpretable signal across paradigms. revision: partial

standing simulated objections not resolved
  • Empirical correlation between the 82-code codebook and actual student learning gains from live tutoring sessions

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is an empirical evaluation comparing 12 prompt-optimization configurations (7 adapted + 5 proposed) against an RL baseline on two external OOD benchmark suites, using post-test solve rate, leak control, helpfulness, and behavioral rates from an 82-code educational codebook. No equations, parameter fits, self-definitional reductions, or load-bearing self-citations are present that would make any reported superiority equivalent to its inputs by construction. All claims rest on direct measurement against independent benchmarks and codebook analysis, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The evaluation implicitly assumes the chosen benchmarks and codebook are valid proxies for pedagogical quality.

pith-pipeline@v0.9.1-grok · 5740 in / 1144 out tokens · 32283 ms · 2026-06-29T17:47:50.582099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2026. GEPA : Reflective prompt evolution can outperform reinforcement learning. In Proceedings...

  2. [2]

    Alon Albalak, Daman Agarwal, Pratyush Maini, Jon Saad-Falcon, and Tatsunori Hashimoto. 2025. BigMath : A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387

  3. [3]

    Deborah Loewenberg Ball, Mark Hoover Thames, and Geoffrey Phelps. 2008. Content knowledge for teaching: What makes it special? Journal of Teacher Education, 59(5):389--407

  4. [4]

    Chi and Ruth Wylie

    Michelene T.H. Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4):219--243

  5. [5]

    David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, and Mrinmaya Sachan. 2025. From problem-solving to teaching problem-solving: Aligning LLMs with pedagogy using reinforcement learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  6. [6]

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR)

  7. [7]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. Dspy: Compiling declarative language model calls into self-improving pipelines. In International Conference on Learning Representations

  8. [8]

    LearnLM Team, Google . 2025. LearnLM : Improving Gemini for learning. arXiv preprint arXiv:2412.16429

  9. [9]

    Unggi Lee, Jiyeong Bae, Jaehyeon Park, Haeun Park, Taejun Park, Younghoon Jeon, Sungmin Cho, Junbo Koh, Yeil Jeong, and Gyeonggeon Lee. 2026 a . Rewarding how models think pedagogically: Integrating pedagogical reasoning and thinking rewards for LLMs in education. arXiv preprint arXiv:2601.14560

  10. [10]

    Unggi Lee, Sookbun Lee, Heungsoo Choi, Jinseo Lee, Haeun Park, Younghoon Jeon, Sungmin Cho, Minju Kang, Junbo Koh, Jiyeong Bae, Minwoo Nam, Juyeon Eun, Yeonji Jung, and Yeil Jeong. 2026 b . OpenLearnLM benchmark: A unified framework for evaluating knowledge, skill, and attitude in educational large language models. arXiv preprint arXiv:2601.13882

  11. [11]

    Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial : A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. In Findings of the Association for Computational Linguistics: EMNLP 2023

  12. [12]

    Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2025. MathTutorBench : A benchmark for measuring open-ended pedagogical capabilities of LLM tutors. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  13. [13]

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing instructions and demonstrations for multi-stage language model programs. arXiv preprint arXiv:2406.11695

  14. [14]

    George P \'o lya. 1945. How to Solve It: A New Aspect of Mathematical Method. Princeton University Press

  15. [15]

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. 2023. Automatic prompt optimization with ``gradient descent'' and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  16. [16]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  17. [17]

    Lee S. Shulman. 1986. Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2):4--14

  18. [18]

    Ana \" s Tack and Chris Piech. 2022. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. In Proceedings of the International Conference on Artificial Intelligence in Education

  19. [19]

    Bruner, and Gail Ross

    David Wood, Jerome S. Bruner, and Gail Ross. 1976. The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2):89--100

  20. [20]

    Large Language Models as Optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. Large language models as optimizers. arXiv preprint arXiv:2309.03409

  21. [21]

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad : Automatic ``differentiation'' via text. arXiv preprint arXiv:2406.07496

  22. [22]

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2026. Agentic context engineering: Evolving contexts for self-improving language models. In Proceedings of the Fourteenth International Conference on Learning Representatio...

  23. [23]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging LLM -as-a-judge with MT -bench and chatbot arena. Advances in Neural Information Processing Systems, 36

  24. [24]

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human-level prompt engineers. In International Conference on Learning Representations

  25. [25]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  26. [26]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...