Approximations and Learning for Decentralized Stochastic Control and Near Optimal Finite Window Policies

Omar Mrani-Zentar; Serdar Yuksel

arxiv: 2605.00160 · v1 · submitted 2026-04-30 · 🧮 math.OC · math.PR

Approximations and Learning for Decentralized Stochastic Control and Near Optimal Finite Window Policies

Omar Mrani-Zentar , Serdar Yuksel This is my paper

Pith reviewed 2026-05-09 20:06 UTC · model grok-4.3

classification 🧮 math.OC math.PR

keywords decentralized stochastic controlfinite window policiesinformation sharing patternspredictor stabilityQ-learningnear optimalityapproximation boundsstochastic control

0 comments

The pith

Finite sliding windows of common information yield near-optimal policies for decentralized stochastic control under a stability condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses decentralized stochastic control problems with general Borel spaces under one-step delayed and K-step periodic information sharing. It demonstrates that policies using only a finite sliding window of the common information can achieve performance arbitrarily close to the optimal centralized solution. The key to this approximation is a predictor stability condition measured in expected total variation. Adapted Q-learning algorithms with quantization or finite memory are shown to converge to these near-optimal policies. This supplies the first explicit conditions and rigorous bounds for approximations and learning in such general decentralized settings.

Core claim

For the OSDISP and KSPISP problems, finite-window policies obtained by truncating the common information to a sliding window are near-optimal provided a predictor stability condition in expected total variation holds. The performance difference can be bounded and made small. Q-learning algorithms in quantized or finite-memory forms converge asymptotically to near-optimal solutions for these problems with general state and action spaces.

What carries the argument

Finite sliding window replacement for full common information, whose effectiveness is ensured by the predictor stability condition in expected total variation.

If this is right

The approximation error for finite-window policies can be bounded using the stability condition.
Q-learning converges to the near-optimal finite-window policies.
These results provide explicit conditions for learning in decentralized control with general spaces.
Near-optimality holds for both KSPISP and OSDISP information structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practical decentralized systems could use limited history to reduce data transmission while maintaining good performance.
The stability condition offers a testable criterion for when finite memory suffices in specific applications.
This framework may extend to other information patterns or multi-agent reinforcement learning problems.

Load-bearing premise

A predictor stability condition in expected total variation exists so that distant past information has negligible effect on future predictions.

What would settle it

Construct a counterexample decentralized control problem with the given information structures where the predictor stability condition fails and the cost of finite-window policies does not approach the optimal cost.

Figures

Figures reproduced from arXiv: 2605.00160 by Omar Mrani-Zentar, Serdar Yuksel.

**Figure 3.** Figure 3: Convergence of the quantized Q-learning algorithm for [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 2.** Figure 2: Convergence of the quantized Q-learning algorithm for [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Decentralized stochastic control problems are difficult to study due to information structure dependent subtleties, which prevent many classical methods in stochastic control from being applicable. In this paper we consider such problems with general standard Borel spaces under two related information structures. (a) the one-step delayed information sharing pattern (OSDISP) where agents share their information with one-step delay, and (b) the $K$-step periodic information sharing pattern (KSPISP), where information is shared periodically. It is known that OSDISP and KSPISP problems admit a centralized reduction where the agents view the problem from the perspective of a centralized controller that uses the common information to prescribe function valued actions (local policies) which map each agent's private information to an optimal action in the original problem. We provide rigorous approximation results and performance bounds for the KSPISP and OSDISP problems, which results from replacing the full common information by a finite sliding window of information and we establish near optimality of such policies. The latter depends on a predictor stability condition in expected total variation. As a further contribution, we show that under the information structures provided, corresponding Q-learning algorithms (in quantized or finite memory forms) converge asymptotically to near optimal solutions. While restrictive and hypothetical conditions have been presented in the literature, our contributions are thus to provide, to our knowledge, the first explicit conditions and rigorous approximation and learning results for such decentralized problems with general spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first explicit finite-window approximation bounds and Q-learning convergence for decentralized stochastic control under delayed sharing in general Borel spaces, but the results stand or fall on an unverified predictor stability condition.

read the letter

The core advance here is replacing full common information with a finite sliding window in the OSDISP and KSPISP settings, then proving performance bounds and near-optimality for the resulting policies in general spaces. They also show that quantized or finite-memory Q-learning versions converge to those near-optimal solutions. That moves the literature past the restrictive hypothetical conditions that earlier papers left hanging.

Referee Report

2 major / 1 minor

Summary. The manuscript considers decentralized stochastic control problems on general Borel spaces under the one-step delayed information sharing pattern (OSDISP) and K-step periodic information sharing pattern (KSPISP). It establishes a centralized reduction via common information, then derives approximation results and performance bounds by replacing the full common information with a finite sliding window of information. Near-optimality of the resulting policies is claimed under a predictor stability condition in expected total variation. The paper further asserts that corresponding Q-learning algorithms (in quantized or finite-memory forms) converge asymptotically to near-optimal solutions, providing what it describes as the first explicit conditions for such problems.

Significance. If the predictor stability condition can be made explicit, verified under verifiable assumptions on the kernels, and shown to be preserved under the quantization and finite-memory approximations, the results would constitute a meaningful advance by supplying rigorous approximation bounds and learning guarantees for decentralized problems with general spaces, moving beyond the restrictive hypothetical conditions noted in prior literature.

major comments (2)

[Abstract] Abstract: The near-optimality of finite-window policies and the asymptotic convergence of the Q-learning algorithms both rest on the existence of a 'predictor stability condition in expected total variation,' yet the abstract provides neither an explicit definition of this condition (including constants), nor the assumptions on transition kernels and observation channels under which it holds for general Borel spaces, nor verification that it is preserved under the quantization/finite-memory approximations used for learning. This condition is load-bearing for the central claims.
[Abstract] Abstract: The manuscript asserts 'rigorous approximation results and performance bounds' and 'asymptotic convergence' but supplies no proof sketches, error bounds, or stability verification in the provided summary; without these, the support for the reduction from OSDISP/KSPISP to the sliding-window centralized problem cannot be assessed.

minor comments (1)

The acronyms OSDISP and KSPISP are introduced without an accompanying diagram or table clarifying the timing of information sharing; this would improve readability for readers unfamiliar with the patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and will revise the manuscript to improve clarity in the abstract while preserving the technical content already present in the body of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The near-optimality of finite-window policies and the asymptotic convergence of the Q-learning algorithms both rest on the existence of a 'predictor stability condition in expected total variation,' yet the abstract provides neither an explicit definition of this condition (including constants), nor the assumptions on transition kernels and observation channels under which it holds for general Borel spaces, nor verification that it is preserved under the quantization/finite-memory approximations used for learning. This condition is load-bearing for the central claims.

Authors: We agree that the abstract, as a concise overview, omits the explicit definition, constants, and assumptions. These are provided in the full manuscript: Definition 3.1 defines the predictor stability condition in expected total variation as the existence of C > 0 and ρ ∈ (0,1) such that the expected TV distance between successive predictors decays at rate ρ^t. Assumption 3.1 imposes uniform continuity of the transition kernels and observation channels in total variation, which is verifiable for standard models (e.g., finite or Lipschitz dynamics on Borel spaces). Preservation under quantization and finite-memory approximations is established in Lemma 4.3 and Proposition 4.5. We will revise the abstract to include a brief statement of the condition, the constants, and a reference to the assumptions and their preservation. revision: yes
Referee: [Abstract] Abstract: The manuscript asserts 'rigorous approximation results and performance bounds' and 'asymptotic convergence' but supplies no proof sketches, error bounds, or stability verification in the provided summary; without these, the support for the reduction from OSDISP/KSPISP to the sliding-window centralized problem cannot be assessed.

Authors: The abstract is a high-level summary and does not include proof sketches or explicit bounds, consistent with standard practice. The full paper contains the details: the centralized reduction via common information is recalled in Section 2. Explicit approximation bounds and near-optimality (with error O(ρ^W) for window length W) appear in Theorems 4.1–4.2 (OSDISP) and 4.4 (KSPISP). Q-learning convergence to near-optimal policies is proven in Theorem 5.3 via contraction arguments under the stability condition. Stability preservation for the quantized/finite-memory versions is verified in Section 4.3. We will add a sentence to the abstract referencing these theorems and the explicit nature of the bounds. revision: partial

Circularity Check

0 steps flagged

No circularity; results conditional on explicitly introduced stability assumption with independent content

full rationale

The paper cites the centralized reduction for OSDISP/KSPISP as known from prior literature and then derives new approximation bounds and Q-learning convergence results by replacing common information with a finite sliding window, with all performance guarantees explicitly conditioned on a predictor stability condition in expected total variation. This condition is presented as an assumption under which the near-optimality holds, rather than being defined in terms of the approximation results or fitted from them. No equation reduces a claimed prediction or bound back to an input parameter by construction, no ansatz is smuggled via self-citation, and the stability hypothesis is not shown to be equivalent to the target claims. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the predictor stability condition in expected total variation being satisfied; this is postulated rather than derived from more primitive assumptions.

axioms (1)

domain assumption Predictor stability condition in expected total variation
Invoked to guarantee that finite-window policies are near-optimal and that Q-learning converges.

pith-pipeline@v0.9.0 · 5560 in / 1275 out tokens · 38559 ms · 2026-05-09T20:06:25.134568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

On the role of information structure in rein- forcement learning for partially-observable sequential teams and games

A.Altabaa and Z.Yang. On the role of information structure in rein- forcement learning for partially-observable sequential teams and games. arxiv preprint: arXiv:2403.00993, 2024

work page arXiv 2024
[2]

E. J. Balder. On ws-convergence of product measures.Mathematics of Operations Research, 26(3):494–518, 2001

work page 2001
[3]

D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of markov decision processes. Mathematics of operations research, 27(4):819–840, 2002

work page 2002
[4]

Bic ¸er, A.D

O. Bic ¸er, A.D. Kara, and S. Y ¨uksel. Quantizer design for finite model approximations, model learning, and quantized q-learning for mdps with unbounded spaces.arXiv preprint arXiv:2510.04355, 2025

work page arXiv 2025
[5]

Billingsley.Probability and Measure

P. Billingsley.Probability and Measure. Wiley, 3rd edition, 1995

work page 1995
[6]

V . S. Borkar. On extremal solutions to stochastic control problems.Appl. Math. Optim., 24(1):317–330, 1991

work page 1991
[7]

V . S. Borkar. White-noise representations in stochastic realization theory. SIAM J. on Control and Optimization, 31:1093–1102, 1993

work page 1993
[8]

Cayci, N

S. Cayci, N. He, and R. Srikant. Finite-time analysis of natural actor- critic for pomdps.SIAM Journal on Mathematics of Data Science, 6(4):869–896, 2024

work page 2024
[9]

Robustness to incorrect priors and controlled filter stability in partially observed stochastic control.SIAM J

C.McDonald and S.Y ¨uksel. Robustness to incorrect priors and controlled filter stability in partially observed stochastic control.SIAM J. on Control and Optimization, 2022

work page 2022
[10]

Cregg, T

L. Cregg, T. Linder, and S. Y ¨uksel. Reinforcement learning for near-optimal design of zero-delay codes for markov sources.IEEE Transactions on Information Theory, arXiv:2311.12609, 2024

work page arXiv 2024
[11]

Demirci, A.D

Y .E. Demirci, A.D. Kara, and S. Y ¨uksel. Refined bounds on near optimality finite window policies in pomdps and their reinforcement learning.arXiv, 2024

work page 2024
[12]

Dobrushin

R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. i.Theory of Probability & Its Applications, 1(1):65–80, 1956

work page 1956
[13]

R. M. Dudley.Real Analysis and Probability. Cambridge University Press, Cambridge, 2nd edition, 2002

work page 2002
[14]

E. B. Dynkin and A. A. Yushkevich.Controlled Markov Processes, volume 235. Springer, Berlin, 1979

work page 1979
[15]

Learning rates for q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

E.Even-Dar and Y .Mansour. Learning rates for q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

work page 2003
[16]

Le Gland and N

F. Le Gland and N. Oudjane. Stability and uniform approximation of nonlinear filters using the Hilbert metric and application to particle filters.The Annals of Applied Probability, 14(1):144–187, 2004

work page 2004
[17]

Golowich, A

N. Golowich, A. Moitra, and D. Rohatgi. Planning in observable pomdps in quasipolynomial time.arXiv preprint arXiv:2201.04735, 2022

work page arXiv 2022
[18]

Hern ´andez-Lerma and J

O. Hern ´andez-Lerma and J. B. Lasserre.Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996

work page 1996
[19]

Simplified action decoder for deep multi-agent reinforcement learning.arXiv preprint arXiv:1912.02288, 2019

H.Hu and J.N.Foerster. Simplified action decoder for deep multi-agent reinforcement learning.arXiv preprint arXiv:1912.02288, 2019

work page arXiv 1912
[20]

A unified approach to dynamic decision problems with asymmetric information: Non-strategic agents.IEEE Transactions on Automatic Control, 67(3):1105–1119, 2021

H.Tavafoghi, Y .Ouyang, and D.Teneketzis. A unified approach to dynamic decision problems with asymmetric information: Non-strategic agents.IEEE Transactions on Automatic Control, 67(3):1105–1119, 2021

work page 2021
[21]

J.K.Gupta, M.Egorov, and M.Kochenderfer. Cooperative multi-agent control using deep reinforcement learning.Proccedings of the sixteenth International conference on Autonomous Agents and Multiagent Sys- tems, pages 66–83, 2017

work page 2017
[22]

Saldi, and S

A.D Kara, N. Saldi, and S. Y ¨uksel. Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity.Journal of Machine Learning Research, pages 1–34, 2023

work page 2023
[23]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Convergence and near optimality of q-learning with finite memory for partially observed models. In2021 60th IEEE Conference on Decision and Control (CDC), pages 1603–1608. IEEE, 2021

work page 2021
[24]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Near optimality of finite memory feedback policies in partially observed markov decision processes.Journal of Machine Learning Research, 23(11):1–46, 2022

work page 2022
[25]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability. Mathematics of Operations Research, 48(4):2066–2093, 2023

work page 2066
[26]

Kara and S

A.D. Kara and S. Y ¨uksel. Q-learning for stochastic control under general information structures and non-markovian environments.Transactions on Machine Learning Research, 2024. Featured Certification

work page 2024
[27]

Multi-agent reinforcement learning as a rehearsal for decentralized planning.Neurocomputing, 190:82–94, 2016

Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning.Neurocomputing, 190:82–94, 2016

work page 2016
[28]

Kurtaran

B. Kurtaran. Corrections and extensions to decentralized control with delayed sharing information pattern.IEEE Transactions on Automatic Control, 24:656–657, August 1979

work page 1979
[29]

Stick-breaking policy learning in dec-pomdps.arXiv preprint arXiv:1505.00274, 2015

Miao Liu, Christopher Amato, Xuejun Liao, Lawrence Carin, and Jonathan P How. Stick-breaking policy learning in dec-pomdps.arXiv preprint arXiv:1505.00274, 2015

work page arXiv 2015
[30]

Principled learning-to- communicate in cooperative marl: An information-structure perspective

Xiangyu Liu, Haoyi You, and Kaiqing Zhang. Principled learning-to- communicate in cooperative marl: An information-structure perspective. InCoCoMARL 2025 Workshop, June 2025. Poster at CoCoMARL 2025

work page 2025
[31]

Partially observable multi-agent reinforcement learning with information sharing,

Xiangyu Liu and Kaiqing Zhang. Partially observable multi-agent reinforcement learning with information sharing.arXiv preprint arXiv:2308.08705, 2023

work page arXiv 2023
[32]

W. Mao, K. Zhang, Z. Yang, and T. Bas ¸ar. Decentralized learn- ing of finite-memory policies in dec-pomdps.IFAC-PapersOnLine, 56(2):2601–2607, 2023

work page 2023
[33]

Di Masi and L

G. Di Masi and L. Stettner. Ergodicity of hidden markov models. Mathematics of Control, Signals and Systems, 17(4):269–296, 2005

work page 2005
[34]

McDonald and S

C. McDonald and S. Y ¨uksel. Exponential filter stability via Dobrushin’s coefficient.Electronic Communications in Probability, 25, 2020

work page 2020
[35]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. Optimal control strategies in delayed sharing information structures.IEEE Transactions on Automatic Control, 56:1606–1620, 2011

work page 2011
[36]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. The common-information approach to decentralized stochastic control. In Information and Control in Networks, Editors: G. Como, B. Bernhardsson, A. Rantzer. Springer, 2013

work page 2013
[37]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58:1644–1658, 2013

work page 2013
[38]

Asymptotic optimality of finite model approximations for partially observed markov decision processes with discounted cost.IEEE transactions on automatic control, 2020

N.Saldi, S.Y ¨uksel, and Linder T. Asymptotic optimality of finite model approximations for partially observed markov decision processes with discounted cost.IEEE transactions on automatic control, 2020

work page 2020
[39]

Centralized reduction of decen- tralized stochastic control models and their weak-feller regularity

O.Mrani-Zentar and S.Y ¨uksel. Centralized reduction of decen- tralized stochastic control models and their weak-feller regularity. arXiv:2408.13828, 2024

work page arXiv 2024
[40]

J. Ooi, S. Verbout, J. Ludwig, and G. Wornell. A separation theorem for periodic sharing information patterns in decentralized control.IEEE Transactions on Automatic Control, 42:1546–1550, November 1997

work page 1997
[41]

An improved grid-based approximation algo- rithm for pomdps.Proceeding of the 17th international joint conference on Artificial intelligence, 2001

R.Zhou and E.A.Hansen. An improved grid-based approximation algo- rithm for pomdps.Proceeding of the 17th international joint conference on Artificial intelligence, 2001

work page 2001
[42]

N. Saldi. Common information approach for static team problems with polish spaces and existence of optimal policies.arXiv preprint arXiv:2309.07571, 2023

work page arXiv 2023
[43]

Saldi, T

N. Saldi, T. Linder, and S. Y ¨uksel.Finite Approximations in Discrete- Time Stochastic Control: Quantized Models and Asymptotic Optimality. Springer, Cham, 2018

work page 2018
[44]

M. Sch ¨al. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal.Z. Wahrscheinlichkeitsth, 32:179–296, 1975

work page 1975
[45]

R. Serfozo. Convergence of Lebesgue integrals with varying measures. Sankhy¯a: The Indian Journal of Statistics, Series A, pages 380–402, 1982

work page 1982
[46]

Subramanian, A

J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems.The Journal of Machine Learning Research, 23(1):483–565, 2022

work page 2022
[47]

Varaiya and J

P. Varaiya and J. Walrand. On delayed-sharing patterns.IEEE Transactions on Automatic Control, 23:443–445, June 1978

work page 1978
[48]

V .Subramanian and H.Kao. Common information based approximate state representations in multi-agent reinforcement learning.Proccedings of the 25 th International Conference on Artificial Intelligence and Statistics, 2022

work page 2022
[49]

C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. 16

work page 1992
[50]

H. S. Witsenhausen. A counterexample in stochastic optimal control. SIAM J. Contr., 6:131–147, 1968

work page 1968
[51]

H. S. Witsenhausen. A standard form for sequential stochastic control. Mathematical Systems Theory, 7:5–11, 1973

work page 1973
[52]

H. S. Witsenhausen. The intrinsic model for discrete stochastic control: Some open problems.Lecture Notes in Econ. and Math. Syst., Springer- Verlag, 107:322–335, 1975

work page 1975
[53]

H. S. Witsenhausen. On the structure of real-time source coders.Bell Syst. Tech. J, 58:1437–1451, July/August 1979

work page 1979
[54]

Monte-carlo expectation maximization for decentralized pomdps.Proccedings of the International joint conference on artificial intelligence, 2013

Feng Wu, Shlomo Zilberstein, and Nicholas R Jennings. Monte-carlo expectation maximization for decentralized pomdps.Proccedings of the International joint conference on artificial intelligence, 2013

work page 2013
[55]

Yoshikawa

T. Yoshikawa. Dynamic programming approach to decentralized control problems.IEEE Transactions on Automatic Control, 20:796–797, 1975

work page 1975
[56]

Y ¨uksel

S. Y ¨uksel. Stochastic nestedness and the belief sharing information pattern.IEEE Transactions on Automatic Control, 54:2773–2786, December 2009

work page 2009
[57]

Y ¨uksel

S. Y ¨uksel. A universal dynamic program and refined existence results for decentralized stochastic control.SIAM Journal on Control and Optimization, 58(5):2711–2739, 2020

work page 2020
[58]

Y ¨uksel

S. Y ¨uksel. On Borkar and Young relaxed control topologies and continuous dependence of invariant measures on control policy.SIAM Journal on Control and Optimization, 62(4):2367–2386, 2024

work page 2024
[59]

Y ¨uksel and T

S. Y ¨uksel and T. Bas ¸ar.Stochastic Teams, Games, and Control under Information Constraints. Springer, Cham, 2024

work page 2024
[60]

Robustness to model approximation, learning, and sample complexity in wasserstein regular mdps.arXiv preprint arXiv:2410.14116, 2024

Yichen Zhou, Yanglei Song, and Serdar Y ¨uksel. Robustness to model approximation, learning, and sample complexity in wasserstein regular mdps.arXiv preprint arXiv:2410.14116, 2024. VII. APPENDIX A. Proof of Proposition III.7 Consider a sequence (πn q−M , IM q,n, aq,n)→(π q−M , IM q , aq) asn→ ∞and letfbe a continuous and bounded function. Then, it is suf...

work page arXiv 2024

[1] [1]

On the role of information structure in rein- forcement learning for partially-observable sequential teams and games

A.Altabaa and Z.Yang. On the role of information structure in rein- forcement learning for partially-observable sequential teams and games. arxiv preprint: arXiv:2403.00993, 2024

work page arXiv 2024

[2] [2]

E. J. Balder. On ws-convergence of product measures.Mathematics of Operations Research, 26(3):494–518, 2001

work page 2001

[3] [3]

D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of markov decision processes. Mathematics of operations research, 27(4):819–840, 2002

work page 2002

[4] [4]

Bic ¸er, A.D

O. Bic ¸er, A.D. Kara, and S. Y ¨uksel. Quantizer design for finite model approximations, model learning, and quantized q-learning for mdps with unbounded spaces.arXiv preprint arXiv:2510.04355, 2025

work page arXiv 2025

[5] [5]

Billingsley.Probability and Measure

P. Billingsley.Probability and Measure. Wiley, 3rd edition, 1995

work page 1995

[6] [6]

V . S. Borkar. On extremal solutions to stochastic control problems.Appl. Math. Optim., 24(1):317–330, 1991

work page 1991

[7] [7]

V . S. Borkar. White-noise representations in stochastic realization theory. SIAM J. on Control and Optimization, 31:1093–1102, 1993

work page 1993

[8] [8]

Cayci, N

S. Cayci, N. He, and R. Srikant. Finite-time analysis of natural actor- critic for pomdps.SIAM Journal on Mathematics of Data Science, 6(4):869–896, 2024

work page 2024

[9] [9]

Robustness to incorrect priors and controlled filter stability in partially observed stochastic control.SIAM J

C.McDonald and S.Y ¨uksel. Robustness to incorrect priors and controlled filter stability in partially observed stochastic control.SIAM J. on Control and Optimization, 2022

work page 2022

[10] [10]

Cregg, T

L. Cregg, T. Linder, and S. Y ¨uksel. Reinforcement learning for near-optimal design of zero-delay codes for markov sources.IEEE Transactions on Information Theory, arXiv:2311.12609, 2024

work page arXiv 2024

[11] [11]

Demirci, A.D

Y .E. Demirci, A.D. Kara, and S. Y ¨uksel. Refined bounds on near optimality finite window policies in pomdps and their reinforcement learning.arXiv, 2024

work page 2024

[12] [12]

Dobrushin

R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. i.Theory of Probability & Its Applications, 1(1):65–80, 1956

work page 1956

[13] [13]

R. M. Dudley.Real Analysis and Probability. Cambridge University Press, Cambridge, 2nd edition, 2002

work page 2002

[14] [14]

E. B. Dynkin and A. A. Yushkevich.Controlled Markov Processes, volume 235. Springer, Berlin, 1979

work page 1979

[15] [15]

Learning rates for q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

E.Even-Dar and Y .Mansour. Learning rates for q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

work page 2003

[16] [16]

Le Gland and N

F. Le Gland and N. Oudjane. Stability and uniform approximation of nonlinear filters using the Hilbert metric and application to particle filters.The Annals of Applied Probability, 14(1):144–187, 2004

work page 2004

[17] [17]

Golowich, A

N. Golowich, A. Moitra, and D. Rohatgi. Planning in observable pomdps in quasipolynomial time.arXiv preprint arXiv:2201.04735, 2022

work page arXiv 2022

[18] [18]

Hern ´andez-Lerma and J

O. Hern ´andez-Lerma and J. B. Lasserre.Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996

work page 1996

[19] [19]

Simplified action decoder for deep multi-agent reinforcement learning.arXiv preprint arXiv:1912.02288, 2019

H.Hu and J.N.Foerster. Simplified action decoder for deep multi-agent reinforcement learning.arXiv preprint arXiv:1912.02288, 2019

work page arXiv 1912

[20] [20]

A unified approach to dynamic decision problems with asymmetric information: Non-strategic agents.IEEE Transactions on Automatic Control, 67(3):1105–1119, 2021

H.Tavafoghi, Y .Ouyang, and D.Teneketzis. A unified approach to dynamic decision problems with asymmetric information: Non-strategic agents.IEEE Transactions on Automatic Control, 67(3):1105–1119, 2021

work page 2021

[21] [21]

J.K.Gupta, M.Egorov, and M.Kochenderfer. Cooperative multi-agent control using deep reinforcement learning.Proccedings of the sixteenth International conference on Autonomous Agents and Multiagent Sys- tems, pages 66–83, 2017

work page 2017

[22] [22]

Saldi, and S

A.D Kara, N. Saldi, and S. Y ¨uksel. Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity.Journal of Machine Learning Research, pages 1–34, 2023

work page 2023

[23] [23]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Convergence and near optimality of q-learning with finite memory for partially observed models. In2021 60th IEEE Conference on Decision and Control (CDC), pages 1603–1608. IEEE, 2021

work page 2021

[24] [24]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Near optimality of finite memory feedback policies in partially observed markov decision processes.Journal of Machine Learning Research, 23(11):1–46, 2022

work page 2022

[25] [25]

Y ¨uksel

A.D Kara and S. Y ¨uksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability. Mathematics of Operations Research, 48(4):2066–2093, 2023

work page 2066

[26] [26]

Kara and S

A.D. Kara and S. Y ¨uksel. Q-learning for stochastic control under general information structures and non-markovian environments.Transactions on Machine Learning Research, 2024. Featured Certification

work page 2024

[27] [27]

Multi-agent reinforcement learning as a rehearsal for decentralized planning.Neurocomputing, 190:82–94, 2016

Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning.Neurocomputing, 190:82–94, 2016

work page 2016

[28] [28]

Kurtaran

B. Kurtaran. Corrections and extensions to decentralized control with delayed sharing information pattern.IEEE Transactions on Automatic Control, 24:656–657, August 1979

work page 1979

[29] [29]

Stick-breaking policy learning in dec-pomdps.arXiv preprint arXiv:1505.00274, 2015

Miao Liu, Christopher Amato, Xuejun Liao, Lawrence Carin, and Jonathan P How. Stick-breaking policy learning in dec-pomdps.arXiv preprint arXiv:1505.00274, 2015

work page arXiv 2015

[30] [30]

Principled learning-to- communicate in cooperative marl: An information-structure perspective

Xiangyu Liu, Haoyi You, and Kaiqing Zhang. Principled learning-to- communicate in cooperative marl: An information-structure perspective. InCoCoMARL 2025 Workshop, June 2025. Poster at CoCoMARL 2025

work page 2025

[31] [31]

Partially observable multi-agent reinforcement learning with information sharing,

Xiangyu Liu and Kaiqing Zhang. Partially observable multi-agent reinforcement learning with information sharing.arXiv preprint arXiv:2308.08705, 2023

work page arXiv 2023

[32] [32]

W. Mao, K. Zhang, Z. Yang, and T. Bas ¸ar. Decentralized learn- ing of finite-memory policies in dec-pomdps.IFAC-PapersOnLine, 56(2):2601–2607, 2023

work page 2023

[33] [33]

Di Masi and L

G. Di Masi and L. Stettner. Ergodicity of hidden markov models. Mathematics of Control, Signals and Systems, 17(4):269–296, 2005

work page 2005

[34] [34]

McDonald and S

C. McDonald and S. Y ¨uksel. Exponential filter stability via Dobrushin’s coefficient.Electronic Communications in Probability, 25, 2020

work page 2020

[35] [35]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. Optimal control strategies in delayed sharing information structures.IEEE Transactions on Automatic Control, 56:1606–1620, 2011

work page 2011

[36] [36]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. The common-information approach to decentralized stochastic control. In Information and Control in Networks, Editors: G. Como, B. Bernhardsson, A. Rantzer. Springer, 2013

work page 2013

[37] [37]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58:1644–1658, 2013

work page 2013

[38] [38]

Asymptotic optimality of finite model approximations for partially observed markov decision processes with discounted cost.IEEE transactions on automatic control, 2020

N.Saldi, S.Y ¨uksel, and Linder T. Asymptotic optimality of finite model approximations for partially observed markov decision processes with discounted cost.IEEE transactions on automatic control, 2020

work page 2020

[39] [39]

Centralized reduction of decen- tralized stochastic control models and their weak-feller regularity

O.Mrani-Zentar and S.Y ¨uksel. Centralized reduction of decen- tralized stochastic control models and their weak-feller regularity. arXiv:2408.13828, 2024

work page arXiv 2024

[40] [40]

J. Ooi, S. Verbout, J. Ludwig, and G. Wornell. A separation theorem for periodic sharing information patterns in decentralized control.IEEE Transactions on Automatic Control, 42:1546–1550, November 1997

work page 1997

[41] [41]

An improved grid-based approximation algo- rithm for pomdps.Proceeding of the 17th international joint conference on Artificial intelligence, 2001

R.Zhou and E.A.Hansen. An improved grid-based approximation algo- rithm for pomdps.Proceeding of the 17th international joint conference on Artificial intelligence, 2001

work page 2001

[42] [42]

N. Saldi. Common information approach for static team problems with polish spaces and existence of optimal policies.arXiv preprint arXiv:2309.07571, 2023

work page arXiv 2023

[43] [43]

Saldi, T

N. Saldi, T. Linder, and S. Y ¨uksel.Finite Approximations in Discrete- Time Stochastic Control: Quantized Models and Asymptotic Optimality. Springer, Cham, 2018

work page 2018

[44] [44]

M. Sch ¨al. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal.Z. Wahrscheinlichkeitsth, 32:179–296, 1975

work page 1975

[45] [45]

R. Serfozo. Convergence of Lebesgue integrals with varying measures. Sankhy¯a: The Indian Journal of Statistics, Series A, pages 380–402, 1982

work page 1982

[46] [46]

Subramanian, A

J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems.The Journal of Machine Learning Research, 23(1):483–565, 2022

work page 2022

[47] [47]

Varaiya and J

P. Varaiya and J. Walrand. On delayed-sharing patterns.IEEE Transactions on Automatic Control, 23:443–445, June 1978

work page 1978

[48] [48]

V .Subramanian and H.Kao. Common information based approximate state representations in multi-agent reinforcement learning.Proccedings of the 25 th International Conference on Artificial Intelligence and Statistics, 2022

work page 2022

[49] [49]

C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. 16

work page 1992

[50] [50]

H. S. Witsenhausen. A counterexample in stochastic optimal control. SIAM J. Contr., 6:131–147, 1968

work page 1968

[51] [51]

H. S. Witsenhausen. A standard form for sequential stochastic control. Mathematical Systems Theory, 7:5–11, 1973

work page 1973

[52] [52]

H. S. Witsenhausen. The intrinsic model for discrete stochastic control: Some open problems.Lecture Notes in Econ. and Math. Syst., Springer- Verlag, 107:322–335, 1975

work page 1975

[53] [53]

H. S. Witsenhausen. On the structure of real-time source coders.Bell Syst. Tech. J, 58:1437–1451, July/August 1979

work page 1979

[54] [54]

Monte-carlo expectation maximization for decentralized pomdps.Proccedings of the International joint conference on artificial intelligence, 2013

Feng Wu, Shlomo Zilberstein, and Nicholas R Jennings. Monte-carlo expectation maximization for decentralized pomdps.Proccedings of the International joint conference on artificial intelligence, 2013

work page 2013

[55] [55]

Yoshikawa

T. Yoshikawa. Dynamic programming approach to decentralized control problems.IEEE Transactions on Automatic Control, 20:796–797, 1975

work page 1975

[56] [56]

Y ¨uksel

S. Y ¨uksel. Stochastic nestedness and the belief sharing information pattern.IEEE Transactions on Automatic Control, 54:2773–2786, December 2009

work page 2009

[57] [57]

Y ¨uksel

S. Y ¨uksel. A universal dynamic program and refined existence results for decentralized stochastic control.SIAM Journal on Control and Optimization, 58(5):2711–2739, 2020

work page 2020

[58] [58]

Y ¨uksel

S. Y ¨uksel. On Borkar and Young relaxed control topologies and continuous dependence of invariant measures on control policy.SIAM Journal on Control and Optimization, 62(4):2367–2386, 2024

work page 2024

[59] [59]

Y ¨uksel and T

S. Y ¨uksel and T. Bas ¸ar.Stochastic Teams, Games, and Control under Information Constraints. Springer, Cham, 2024

work page 2024

[60] [60]

Robustness to model approximation, learning, and sample complexity in wasserstein regular mdps.arXiv preprint arXiv:2410.14116, 2024

Yichen Zhou, Yanglei Song, and Serdar Y ¨uksel. Robustness to model approximation, learning, and sample complexity in wasserstein regular mdps.arXiv preprint arXiv:2410.14116, 2024. VII. APPENDIX A. Proof of Proposition III.7 Consider a sequence (πn q−M , IM q,n, aq,n)→(π q−M , IM q , aq) asn→ ∞and letfbe a continuous and bounded function. Then, it is suf...

work page arXiv 2024