Momentum-based federated reinforcement learning with interaction and communication efficiency

Zhang, D · 2024 · arXiv 2412.11006

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

cs.AI · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 28
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models cs.AI · 2026-05-02 · unverdicted · none · ref 16
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · unverdicted · none · ref 44
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 25 · 2 links
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

Momentum-based federated reinforcement learning with interaction and communication efficiency

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer