Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji · 2025 · DOI 10.1609/aaai.v39i19.34213

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

cs.SE · 2026-05-02 · unverdicted · novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

cs.LG · 2026-05-07

citing papers explorer

Showing 3 of 3 citing papers.

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE · 2026-05-02 · unverdicted · none · ref 28
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG · 2026-04-13 · unverdicted · none · ref 29
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs cs.LG · 2026-05-07 · unreviewed · ref 27

Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer