TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

arxiv: 2505.15692 · v5 · submitted 2025-05-21 · 💻 cs.CL · cs.LG

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

Jinyang Wu , Chonghua Liao , Mingkuan Feng , Shuai Zhang , Zhengqi Wen , Haoran Luo , Ling Yang , Huazhe Xu

show 1 more author

Jianhua Tao

This is my paper

classification 💻 cs.CL cs.LG

keywords templatetemplaterlstructuredtrainingenhancingexplicitgrpoguidance

0 comments p. Extension

read the original abstract

Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
cs.LG 2026-05 conditional novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
cs.LG 2026-04 conditional novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning
cs.LG 2025-08 unverdicted novelty 6.0

EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.