arXiv preprint arXiv:2509.09284 , year=

Tree-opo: Off-policy monte carlo tree-guided advantage optimization for multistep reasoning , author= · 2026 · arXiv 2509.09284

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

cs.AI · 2026-04-08 · unverdicted · novelty 6.0

T-STAR consolidates multi-turn trajectories into a Cognitive Tree for variance-reduced step-level advantages and surgical policy optimization via thought grafting at critical points.

citing papers explorer

Showing 3 of 3 citing papers.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 40
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 34 · 2 links
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization cs.AI · 2026-04-08 · unverdicted · none · ref 2
T-STAR consolidates multi-turn trajectories into a Cognitive Tree for variance-reduced step-level advantages and surgical policy optimization via thought grafting at critical points.

arXiv preprint arXiv:2509.09284 , year=

fields

years

verdicts

representative citing papers

citing papers explorer