MIT press, 2018

Richard S Sutton, Andrew G Barto · 2018

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

cs.LG · 2025-10-17 · unverdicted · novelty 7.0

Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.

citing papers explorer

Showing 4 of 4 citing papers.

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning stat.ML · 2026-05-06 · unverdicted · none · ref 18
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies cs.LG · 2025-10-17 · unverdicted · none · ref 1
Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 13
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization cs.LG · 2026-05-18 · unverdicted · none · ref 29
DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.

MIT press, 2018

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer