ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Canonical reference
Beyond two-stage training: Cooperative sft and rl for llm reasoning
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to integrate SFT and RLVR in a single stage by reweighting or scheduling their objectives. However, such coupling can be counterproductive because supervised updates are not uniformly beneficial for reward optimization. To address this, we propose BRIDGE, a scalable framework in which SFT learns to supervise RL by selectively transferring knowledge that improves reward optimization. Specifically, BRIDGE alternates two updates at each meta-training step: a base-model update that fuses the SFT and RL gradients, and an update to a lightweight low-rank adapter (LoRA) that coordinates the two objectives by maximizing a cooperative-gain signal, defined as the reward of joint SFT-RL training over an RL-only baseline. Across five mathematical reasoning benchmarks, BRIDGE consistently outperforms two-stage cold start, naive mixing, and representative single-stage integration baselines, yielding over three points average absolute improvement and more stable training dynamics. We further show that BRIDGE extends to logical reasoning and generalizes out-of-distribution to code and science without additional training, while staying robust under noisy rewards.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning benchmarks under strong/weak initialization.
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning benchmarks under strong/weak initialization.
-
Enhancing LLM Metacognition via Cognitive Pairwise Training
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.