SubMAPG uses a new Partition Multilinear Extension to derive unbiased policy gradients from submodular difference rewards, delivering 1/2-approximation and sublinear dynamic regret for online distributed task allocation in open multi-agent systems.
Journal of Machine Learning Research , volume=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
method 2polarities
use method 2representative citing papers
Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.
citing papers explorer
-
Submodular Multi-Agent Policy Learning for Online Distributed Task Allocation in Open Multi-Agent Systems
SubMAPG uses a new Partition Multilinear Extension to derive unbiased policy gradients from submodular difference rewards, delivering 1/2-approximation and sublinear dynamic regret for online distributed task allocation in open multi-agent systems.
-
Delightful Gradients Accelerate Corner Escape
Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Policy Gradient Methods for Non-Markovian Reinforcement Learning
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
-
Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
-
Actor-Critic Algorithm for Dynamic Expectile and CVaR
A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.