TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

URL https://arxiv · 2025 · arXiv 2506.11902

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Reflective Prompted Policy Optimization uses a Critic-LLM to inspect full trajectories and propose grounded revisions, yielding higher mean best rewards, faster near-optimal performance, and greater stability than scalar-reward baselines across ten environments.

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.

Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

cs.LG · 2025-11-01 · unverdicted · novelty 6.0

Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Your Model Diversity, Not Method, Determines Reasoning Strategy

cs.AI · 2026-04-12 · unverdicted · novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

citing papers explorer

Showing 6 of 6 citing papers.

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias cs.LG · 2026-05-08 · unverdicted · none · ref 5
Reflective Prompted Policy Optimization uses a Critic-LLM to inspect full trajectories and propose grounded revisions, yielding higher mean best rewards, faster near-optimal performance, and greater stability than scalar-reward baselines across ten environments.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping cs.CL · 2026-05-07 · unverdicted · none · ref 17
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse cs.LG · 2025-11-01 · unverdicted · none · ref 4
Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 12
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Your Model Diversity, Not Method, Determines Reasoning Strategy cs.AI · 2026-04-12 · unverdicted · none · ref 5
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 198
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer