Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence

Aditya Mahajan; Amit Sinha; Matthieu Geist

arxiv: 2604.09495 · v1 · submitted 2026-04-10 · 💻 cs.MA

Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence

Amit Sinha , Matthieu Geist , Aditya Mahajan This is my paper

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.MA

keywords Dec-POMDPagent-state policiespolicy iterationiterated best responserisk-seekingmonotonic improvementpolynomial runtimelimited memory

0 comments

The pith

An iterated best-response algorithm with risk-seeking conservative updates finds locally optimal limited-memory policies for Dec-POMDPs in polynomial time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dec-POMDPs model decentralized decision making under uncertainty but are intractable to solve optimally because policies can depend on full observation histories. The paper shows that restricting agents to policies with a small number of internal states still allows an iterated best-response procedure to improve policies monotonically until a local optimum is reached. By modifying the objective to encourage risk-seeking behavior during conservative updates, the method reaches better local optima than standard approaches. This matters because many practical multi-agent systems have limited onboard memory and cannot store or compute full histories, yet still need reliable performance guarantees. The approach runs in time polynomial in the size of the Dec-POMDP model rather than exponential in the horizon.

Core claim

The central claim is that an iterated best response algorithm using risk-seeking conservative policy iteration on agent-state policies guarantees monotonic improvements and converges to a local optimum in polynomial time relative to the Dec-POMDP model size. Empirical evaluation on benchmark problems demonstrates that the resulting policies achieve performance comparable to state-of-the-art methods while respecting memory limits, and that solution quality increases when more agent states are permitted.

What carries the argument

Agent-state based policies, which map from a fixed finite set of internal memory states to actions, inside an iterated best-response loop that applies risk-seeking modifications to a conservative policy iteration step.

If this is right

The algorithm terminates after a number of iterations that is polynomial in the size of the Dec-POMDP model.
Each best-response step produces a strictly higher value for the updating agent.
Increasing the number of permitted agent states produces strictly better policies on the tested benchmarks.
The final policy matches or approaches the performance of existing methods that lack runtime guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same risk-seeking modification could be applied to other compact policy representations, such as finite-state controllers learned from data.
In deployed multi-robot teams the polynomial runtime would let designers trade memory size against computation time in a predictable way.
The approach hints that local search over memory-constrained policies may remain useful even when the global optimum lies outside the class.
Extending the risk-seeking term to continuous observation spaces might preserve the convergence property if the agent-state discretization is refined appropriately.

Load-bearing premise

That policies using only a small fixed number of agent states can achieve performance close to optimal on the applications of interest, even if the optimal policy requires unbounded memory.

What would settle it

On a small Dec-POMDP where the best possible agent-state policy can be found by exhaustive search, the algorithm's converged value falls more than a few percent below that benchmark value.

Figures

Figures reproduced from arXiv: 2604.09495 by Aditya Mahajan, Amit Sinha, Matthieu Geist.

**Figure 2.** Figure 2: Cooperative Box Pushing (Image take from Seuken & Zilberstein (2012) In this Dec-POMDP, there are two agents, one in the left corner and the other in the right corner. They are initially facing the center of the arena (indicated by the red lines in [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 1.** Figure 1: Scaling of memory and runtime with horizon T and agent-state size Z across three benchmarks: Dec-Tiger, Cooperative Box Pushing, and Mars Rovers. Rows correspond to different metrics, columns correspond to Dec-POMDPs. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗

read the original abstract

Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a convergent polynomial-time algorithm for limited-memory agent-state policies in Dec-POMDPs by mixing conservative policy iteration with a risk-seeking objective tweak.

read the letter

The main point is a new synthesis for Dec-POMDPs: restrict policies to a small fixed set of agent states, then run iterated best-response updates that are conservative and augmented with a risk-seeking term. This is claimed to deliver monotonic improvement and convergence to a local optimum in time polynomial in the model size, while still matching state-of-the-art performance on benchmarks. The restriction to agent states is acknowledged as not containing the global optimum, but the paper shows that adding more states improves results, which is a useful practical observation. The empirical side reports competitive numbers without exponential blow-up, which is the real selling point for anyone who needs to deploy these policies under compute limits. The risk-seeking addition is the part that stands out as novel relative to plain conservative policy iteration in this setting. That said, the stress-test concern lands: if the monotonicity guarantee is proved only for the modified objective, it is not automatic that each step improves (or even preserves) the original expected return. The abstract gives no proof sketch, so the full paper must show explicitly that the risk-seeking term does not undermine the original-value improvement property; otherwise the convergence claim applies to a different problem than the one the experiments evaluate. The assumption that a small memory budget suffices is stated plainly, which is honest, but it remains an approximation whose quality depends on the domain. This work is aimed at researchers who already work on Dec-POMDPs and want scalable methods with some theory attached. It is coherent on its own terms and deserves a serious referee to check the proof details around the risk-seeking modification and the exact empirical setup. I would send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper presents an iterated best-response algorithm for Dec-POMDPs that restricts policies to a fixed number of agent states (limited memory). It claims that a conservative policy iteration update guarantees monotonic improvement and convergence to a local optimum in time polynomial in the Dec-POMDP description size. A risk-seeking term is added to the objective to reach better local optima. Empirical results on several (unspecified) benchmark Dec-POMDPs are reported to match state-of-the-art performance while remaining polynomial-time, with further gains from increasing the number of agent states.

Significance. A sound polynomial-time algorithm with monotonicity guarantees for bounded-memory policies in Dec-POMDPs would be a useful practical contribution, given the NEXP-completeness of the unrestricted problem. The combination of conservative updates with a risk-seeking modification to escape poor local optima is a reasonable heuristic idea. The manuscript does not, however, supply machine-checked proofs, reproducible code, or parameter-free derivations that would strengthen the assessment.

major comments (2)

[Abstract] Abstract: the claim of 'monotonic improvements' is stated with respect to the standard Dec-POMDP objective, yet the algorithm optimizes a modified objective that includes an explicit risk-seeking term. It is therefore unclear whether each conservative update is guaranteed to improve (or at least not degrade) the true expected return; if monotonicity holds only for the augmented objective, the convergence guarantee does not directly support the performance claims made for the original objective.
[Algorithm and Theoretical Analysis] Algorithm description and convergence argument: no derivation sketch, proof outline, or explicit statement of the precise objective being improved is supplied. Because the risk-seeking modification is load-bearing for the 'better local optimum' claim, the interaction between the conservative update and the added term must be shown not to violate improvement on the original value function.

minor comments (2)

[Experiments] Empirical section: benchmark environments are not named, and reported results lack error bars, number of runs, or statistical significance tests.
[Preliminaries] Notation: the precise definition of the agent-state policy class and the form of the risk-seeking augmentation should be stated formally before the algorithm is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback on the clarity of our claims and theoretical presentation. We address each major comment below and have made revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'monotonic improvements' is stated with respect to the standard Dec-POMDP objective, yet the algorithm optimizes a modified objective that includes an explicit risk-seeking term. It is therefore unclear whether each conservative update is guaranteed to improve (or at least not degrade) the true expected return; if monotonicity holds only for the augmented objective, the convergence guarantee does not directly support the performance claims made for the original objective.

Authors: We appreciate this observation on the potential ambiguity. The monotonic improvement and polynomial-time convergence to a local optimum are guaranteed with respect to the modified objective that incorporates the risk-seeking term. This term is added to bias the search toward policies that can escape poor local optima in the agent-state policy space. In the revised manuscript, we have updated the abstract to explicitly state that the guarantees apply to the augmented objective. We have also added a paragraph in the introduction and theoretical analysis section clarifying that while per-step improvement is not guaranteed on the original expected return, the overall procedure yields competitive or superior performance on the standard objective, as validated empirically on the benchmarks. revision: yes
Referee: [Algorithm and Theoretical Analysis] Algorithm description and convergence argument: no derivation sketch, proof outline, or explicit statement of the precise objective being improved is supplied. Because the risk-seeking modification is load-bearing for the 'better local optimum' claim, the interaction between the conservative update and the added term must be shown not to violate improvement on the original value function.

Authors: We agree that an explicit statement of the objective and a derivation sketch would strengthen the presentation. The conservative update is applied to the modified objective (original value plus risk-seeking bonus), preserving monotonicity on that objective by construction, similar to standard conservative policy iteration. The risk-seeking term is formulated so that it does not interfere with the improvement property on the modified objective. In the revised version, we have inserted a proof outline (with key steps) in the main text and a full derivation in the appendix that states the precise objective and demonstrates the monotonicity property. We note that the update is not guaranteed to improve the original value function at every iteration (only the modified one), but the risk-seeking component is intended to reach better local optima overall; this is supported by our experimental results rather than a theoretical guarantee on the original objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an iterated best-response algorithm based on conservative policy iteration over a restricted agent-state policy class for Dec-POMDPs. The claimed monotonic improvements, convergence to a local optimum, and polynomial runtime are tied directly to the structure of the algorithm and the finite model size, without any reduction by construction to fitted parameters, self-definitions, or unverified self-citations. The risk-seeking modification is introduced as an auxiliary objective term to improve the quality of the attained local optimum; the core guarantee statements do not collapse into equivalence with this modification or with any input data. No load-bearing uniqueness theorems, ansatz smuggling, or renaming of known results are present. The derivation chain remains self-contained against the Dec-POMDP formalism.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unshown proof of monotonicity and the assumption that the risk-seeking objective improves local optima without introducing new instability.

pith-pipeline@v0.9.0 · 5517 in / 1086 out tokens · 28091 ms · 2026-05-10T16:13:55.646421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Since it is linear in the value function V , it satisfies concavity and convexity

Classical MDP risk neutral: This is the standard risk-neutral reward function used in MDPs. Since it is linear in the value function V , it satisfies concavity and convexity. Since it also satisfies positive homogeneity, it is coherent as well. R(V|s, a) :=E P [V|s, a] = X s′ P(s ′ |s, a)V(s ′)

work page
[2]

The parameter λ∈R controls the risk-preference of R

Entropic map: This is based on the logsumexp function. The parameter λ∈R controls the risk-preference of R. If λ >0 , then R is convex everywhere and therefore risk-seeking everywhere. If λ <0 , R is concave everywhere and therefore risk-averse everywhere. It can also be shown that the limit as λ→0 is the same as the classical MDP risk neutral reward. R(V...

work page
[3]

Next”: which controls how partial policies are expanded and also returns their heuristic values; and “Select

Mean-semideviation trade-off: This involves modifying risk based on deviation from the expected values of rewards. λ∈Rworks similar to the entropic map. The notationx + denotesmax(x,0)andk≥1. R(V|s, a) :=E P [V|s, a] +λE P (V−E P [V|s, a]) k + |s, a 1 k . C. Related work C.1. Monotonic improvement dynamic programs / policy iteration schemes for POMDPs A s...

work page 2025
[4]

A2 seesEmptyand choosesForward

A1 seesEmptyand choosesForward. A2 seesEmptyand choosesForward

work page
[5]

A2 seesOtherAand choosesTurnR

A1 seesOtherAand choosesTurnL. A2 seesOtherAand choosesTurnR. (Both agents are now facing theLargeB) 15 Multi-agent RS-CPI

work page
[6]

A2 seesLargeBand choosesForward

A1 seesLargeBand choosesForward. A2 seesLargeBand choosesForward. (Both agents move forward and get large reward for pushing theLargeB)

work page
[7]

The environment then gets reset back to the same initial state in this step, the actions do not matter in a reset step

work page
[8]

It might also be instructive to observe what happens in case an action fails

It is interesting to note that same cycle from steps 1-3 continue in a periodic fashion. It might also be instructive to observe what happens in case an action fails. For example let’s assume that when both agents do Forward , then A1 fails. We mark the actions in blue. Then:

work page
[9]

A2 seesEmptyand chooses Forward (success)

A1 seesEmptyand chooses Forward (but fails). A2 seesEmptyand chooses Forward (success)

work page
[10]

A2 seesEmptyand chooses Forward

A1 seesEmptyand chooses TurnL. A2 seesEmptyand chooses Forward

work page
[11]

A2 seesOtherAand chooses TurnR

A1 seesSmallBand chooses Forward. A2 seesOtherAand chooses TurnR. (A small reward is received for pushing the small box successfully)

work page
[12]

This step is the rest step again

work page
[13]

In this manner many different contingencies can be worked out, where the resulting actions are just the outcome of a process that overall has a tendency to maximize reward

And then we get back into the cycle! So even when the best outcome doesn’t work out, there is some contingency for the agents to finish the episode and restart it to try again. In this manner many different contingencies can be worked out, where the resulting actions are just the outcome of a process that overall has a tendency to maximize reward. E.2. Co...

work page

[1] [1]

Since it is linear in the value function V , it satisfies concavity and convexity

Classical MDP risk neutral: This is the standard risk-neutral reward function used in MDPs. Since it is linear in the value function V , it satisfies concavity and convexity. Since it also satisfies positive homogeneity, it is coherent as well. R(V|s, a) :=E P [V|s, a] = X s′ P(s ′ |s, a)V(s ′)

work page

[2] [2]

The parameter λ∈R controls the risk-preference of R

Entropic map: This is based on the logsumexp function. The parameter λ∈R controls the risk-preference of R. If λ >0 , then R is convex everywhere and therefore risk-seeking everywhere. If λ <0 , R is concave everywhere and therefore risk-averse everywhere. It can also be shown that the limit as λ→0 is the same as the classical MDP risk neutral reward. R(V...

work page

[3] [3]

Next”: which controls how partial policies are expanded and also returns their heuristic values; and “Select

Mean-semideviation trade-off: This involves modifying risk based on deviation from the expected values of rewards. λ∈Rworks similar to the entropic map. The notationx + denotesmax(x,0)andk≥1. R(V|s, a) :=E P [V|s, a] +λE P (V−E P [V|s, a]) k + |s, a 1 k . C. Related work C.1. Monotonic improvement dynamic programs / policy iteration schemes for POMDPs A s...

work page 2025

[4] [4]

A2 seesEmptyand choosesForward

A1 seesEmptyand choosesForward. A2 seesEmptyand choosesForward

work page

[5] [5]

A2 seesOtherAand choosesTurnR

A1 seesOtherAand choosesTurnL. A2 seesOtherAand choosesTurnR. (Both agents are now facing theLargeB) 15 Multi-agent RS-CPI

work page

[6] [6]

A2 seesLargeBand choosesForward

A1 seesLargeBand choosesForward. A2 seesLargeBand choosesForward. (Both agents move forward and get large reward for pushing theLargeB)

work page

[7] [7]

The environment then gets reset back to the same initial state in this step, the actions do not matter in a reset step

work page

[8] [8]

It might also be instructive to observe what happens in case an action fails

It is interesting to note that same cycle from steps 1-3 continue in a periodic fashion. It might also be instructive to observe what happens in case an action fails. For example let’s assume that when both agents do Forward , then A1 fails. We mark the actions in blue. Then:

work page

[9] [9]

A2 seesEmptyand chooses Forward (success)

A1 seesEmptyand chooses Forward (but fails). A2 seesEmptyand chooses Forward (success)

work page

[10] [10]

A2 seesEmptyand chooses Forward

A1 seesEmptyand chooses TurnL. A2 seesEmptyand chooses Forward

work page

[11] [11]

A2 seesOtherAand chooses TurnR

A1 seesSmallBand chooses Forward. A2 seesOtherAand chooses TurnR. (A small reward is received for pushing the small box successfully)

work page

[12] [12]

This step is the rest step again

work page

[13] [13]

In this manner many different contingencies can be worked out, where the resulting actions are just the outcome of a process that overall has a tendency to maximize reward

And then we get back into the cycle! So even when the best outcome doesn’t work out, there is some contingency for the agents to finish the episode and restart it to try again. In this manner many different contingencies can be worked out, where the resulting actions are just the outcome of a process that overall has a tendency to maximize reward. E.2. Co...

work page