pith. sign in

hub Mixed citations

Proximal policy optimization algorithms

Mixed citation behavior. Most common role is background (60%).

29 Pith papers citing it
Background 60% of classified citations

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

representative citing papers

Distributional Reinforcement Learning via the Cram\'er Distance

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

COSAC enables scalable per-agent policy gradients in sequential cooperative teams via ridge regression on additive reward decomposition and counterfactual advantages from fictitious policy continuations, extending aristocrat utility with controlled bias-variance bounds.

RTMC: Step-Level Credit Assignment via Rollout Trees

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

RTMC aggregates returns across rollout trees to produce step-level Q-values and advantages, improving pass@1 by 3.2 points over GRPO on SWE-bench Verified.

RAGEN-2: Reasoning Collapse in Agentic RL

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

Efficient 3D Content Reconstruction and Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.

PokeRL: Reinforcement Learning for Pokemon Red

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

PokeRL trains PPO agents to finish early Pokemon Red tasks using a loop-aware environment wrapper, multi-layer anti-loop mechanisms, and dense hierarchical rewards.

citing papers explorer

Showing 29 of 29 citing papers.