Generating Local Shields for Decentralised Partially Observable Markov Decision Processes

Haoran Yang (University of Oxford); Nobuko Yoshida (University of Oxford)

arxiv: 2604.06873 · v1 · submitted 2026-04-08 · 💻 cs.MA

Generating Local Shields for Decentralised Partially Observable Markov Decision Processes

Haoran Yang (University of Oxford) , Nobuko Yoshida (University of Oxford) This is my paper

Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3

classification 💻 cs.MA

keywords decentralized POMDPshieldingprocess algebraMealy machinemulti-agent safetypartial observabilitypath findingsafety filter

0 comments

The pith

A shield process algebra compiles to local Mealy machines that filter unsafe joint actions for each agent in Dec-POMDPs using only local observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent systems under partial observation often cannot ensure safety because individual agents lack knowledge of others' states and cannot communicate. The paper introduces a shield process algebra with guarded choice and recursion to specify safe global behavior in these communication-free Dec-POMDP settings. From any such specification, the approach builds a process automaton, then a global Mealy machine that accepts only safe joint actions, and finally projects that machine onto local Mealy machines. Each local machine tracks the subset of global states consistent with one agent's observations and outputs the safe action set for that agent. This pipeline is implemented in Rust with PRISM integration to compute safety probabilities, and a path-finding case study shows substantial collision reductions compared to the unshielded case.

Core claim

We define a shield process algebra with guarded choice and recursion for specifying safe global behaviour in communication-free Dec-POMDP settings. From a shield process we compile a process automaton, then a global Mealy machine as a safe joint-action filter, and finally project it to local Mealy machines whose states are belief-style subsets of the global Mealy machine states consistent with each agent's observations, and which output per-agent safe action sets.

What carries the argument

The projection from a global Mealy machine onto per-agent local Mealy machines whose states are observation-consistent subsets of global states and whose outputs are restricted safe action sets.

If this is right

Agents can select only safe actions using solely their local observations and the precomputed local Mealy machines.
Best- and worst-case safety probabilities can be computed independently of the agents' policies via integration with probabilistic model checkers.
Different shield process specifications produce varying trade-offs between expressiveness and conservatism while still reducing unsafe events such as collisions.
The method supplies a formal compilation pipeline from high-level process specifications to runtime local filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compilation steps could be reused with other specification languages if they also admit a global Mealy-machine representation.
Belief-subset states in the local machines open a route to memory-bounded implementations that scale with observation alphabet size rather than full state space.
The separation between specification, global filter, and local projections allows independent verification of the shield before deployment.

Load-bearing premise

The projection from global Mealy machine to local belief subsets preserves the original safety guarantees without requiring communication or full state knowledge.

What would settle it

An execution trace in which every agent chooses an action permitted by its local Mealy machine yet the resulting joint action violates a safety condition enforced by the global Mealy machine.

Figures

Figures reproduced from arXiv: 2604.06873 by Haoran Yang (University of Oxford), Nobuko Yoshida (University of Oxford).

**Figure 2.** Figure 2: Initial state for the “Blind Agents” (R = 0). Since each agent’s local observation remains constant over time, it is impossible to construct a shield that depends solely on local observations [7]. Likewise, DFA-based constructions that attempt to avoid unsafe states [3] are ineffective in this setting, because no action available to any agent can be guaranteed to lead, in combination, to a safe joint actio… view at source ↗

**Figure 3.** Figure 3: Shield process automaton for the “Blind Agents”. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Global shield Mealy machine for the “Blind Agents”. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Local shield Mealy machines in the "Blind Agents". [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation curves under training for learned policies with shields under 3 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Multi-agent systems under partial observation often struggle to maintain safety because each agent's locally chosen action does not, in general, determine the resulting joint action. Shielding addresses this by filtering actions based on the current state, but most existing techniques either assume access to a shared centralised global state or employ memoryless local filters that cannot consider interaction history. We introduce a shield process algebra with guarded choice and recursion for specifying safe global behaviour in communication-free Dec-POMDP settings. From a shield process, we compile a process automaton, then a global Mealy machine as a safe joint-action filter, and finally project it to local Mealy machines whose states are belief-style subsets of the global Mealy machine states consistent with each agent's observations, and which output per-agent safe action sets. We implement the pipeline in Rust and integrate PRISM, the Probabilistic Symbolic Model Checker, to compute best- and worst-case safety probabilities independently of the agents' policies. A multi-agent path-finding case study demonstrates how different shield processes substantially reduce collisions compared to the unshielded baseline while exhibiting varying levels of expressiveness and conservatism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a shield process algebra with guarded choice and recursion to specify safe global joint-action behavior in communication-free Dec-POMDPs. From a shield process it compiles a process automaton, then a global Mealy machine that filters joint actions, and finally projects the global machine to per-agent local Mealy machines whose states are observation-consistent subsets of global states and whose outputs are safe local action sets. The pipeline is implemented in Rust, integrated with PRISM to compute best- and worst-case safety probabilities independently of agent policies, and evaluated on a multi-agent path-finding case study in which shielded agents exhibit fewer collisions than an unshielded baseline.

Significance. If the projection step is shown to preserve global safety, the work would supply a formal, policy-independent route to decentralized shielding under partial observability, combining process algebra, automata compilation, and symbolic model checking. The explicit separation of shield specification from agent policies and the use of PRISM for quantitative safety bounds are methodological strengths that could be reused in other Dec-POMDP domains. The path-finding case study provides an existence proof of practical utility, although the absence of numerical collision rates and PRISM error bounds currently limits assessment of the magnitude of improvement.

major comments (2)

[§4] §4 (Projection from global to local Mealy machines): The manuscript asserts that the local belief-style Mealy machines preserve the safety guarantees of the global shield without inter-agent communication. However, no theorem, inductive argument, or formal statement is supplied showing that every tuple of actions chosen independently from the local safe sets is accepted by the global machine in the underlying true state. Because the belief construction over-approximates possible worlds, it is possible for locally permitted combinations to be globally unsafe; this gap is load-bearing for the central decentralised-safety claim.
[§5] §5 (Case study and PRISM integration): The path-finding experiments are described only qualitatively (“substantially reduce collisions” and “varying levels of expressiveness and conservatism”). No numerical collision rates, PRISM-computed safety probabilities with error bounds, or statistical comparison against the unshielded baseline are reported. Without these data it is impossible to verify the empirical support for the pipeline’s effectiveness.

minor comments (2)

[§3] The abstract and §3 refer to “belief-style subsets” without a precise definition or pseudocode for the subset construction; a small clarifying paragraph or algorithm box would improve readability.
The Rust implementation and PRISM interface are mentioned but no repository link, version numbers, or reproduction instructions are supplied; adding these would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Projection from global to local Mealy machines): The manuscript asserts that the local belief-style Mealy machines preserve the safety guarantees of the global shield without inter-agent communication. However, no theorem, inductive argument, or formal statement is supplied showing that every tuple of actions chosen independently from the local safe sets is accepted by the global machine in the underlying true state. Because the belief construction over-approximates possible worlds, it is possible for locally permitted combinations to be globally unsafe; this gap is load-bearing for the central decentralised-safety claim.

Authors: We acknowledge that the current manuscript does not contain an explicit theorem or inductive proof for the safety preservation property under projection. The local Mealy machines are derived by projecting the global machine onto per-agent observation-consistent state subsets (beliefs). These subsets over-approximate the possible global states from each agent's perspective. The local safe action sets are computed as the largest sets S_i for each agent such that every possible joint action formed by taking one action from each S_i is accepted by the global Mealy machine in every state belonging to the agent's belief subset. Because the true state is contained in each agent's belief, any independently chosen tuple from the local safe sets will be safe in the true state. This construction ensures the decentralised safety property, albeit potentially conservatively. To address the referee's concern, we will insert a formal theorem statement together with a proof sketch in Section 4 of the revised manuscript. revision: yes
Referee: [§5] §5 (Case study and PRISM integration): The path-finding experiments are described only qualitatively (“substantially reduce collisions” and “varying levels of expressiveness and conservatism”). No numerical collision rates, PRISM-computed safety probabilities with error bounds, or statistical comparison against the unshielded baseline are reported. Without these data it is impossible to verify the empirical support for the pipeline’s effectiveness.

Authors: We agree that the experimental results in the case study are presented qualitatively in the current version. The manuscript does report that shielded agents exhibit fewer collisions, but without specific numbers or PRISM error bounds. In the revised manuscript, we will augment Section 5 with quantitative data, including numerical collision rates for shielded and unshielded agents, the best- and worst-case safety probabilities computed by PRISM along with their error bounds, and a statistical comparison to the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive compilation from shield process to local Mealy machines via standard automata projection.

full rationale

The paper describes a pipeline that compiles a shield process algebra into a process automaton, then a global Mealy machine, and finally projects it to local belief-style Mealy machines using observation-consistent subsets. This is a standard constructive translation and projection operation in automata theory, integrated with PRISM for probability computation. No equations or steps reduce the safety claims to fitted parameters, self-definitions, or load-bearing self-citations; the method relies on external model-checking tools and does not rename known results or smuggle ansatzes. The derivation chain is self-contained as a methodological construction without reducing outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the correctness of the compilation and projection steps for preserving safety, which rest on domain assumptions about Dec-POMDP belief states and the expressive power of the new process algebra; no free parameters or independent evidence for the invented constructs are mentioned.

axioms (2)

domain assumption Dec-POMDPs admit belief-style state subsets consistent with local observations that can be used for local decision making
Standard modeling assumption in partially observable multi-agent systems invoked to justify the local Mealy machine construction
ad hoc to paper A process algebra with guarded choice and recursion can specify safe global joint-action behavior
Core new construct introduced to define the shield process

invented entities (2)

Shield process algebra no independent evidence
purpose: To specify safe global behaviour in communication-free Dec-POMDP settings
New language introduced for the shielding specification
Local Mealy machines no independent evidence
purpose: To output per-agent safe action sets from observation-consistent subsets of global states
Derived construct for decentralized enforcement

pith-pipeline@v0.9.0 · 5496 in / 1640 out tokens · 70991 ms · 2026-05-10T17:36:56.709954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

In: AAAI

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum & Ufuk Topcu (2018): Safe Reinforcement Learning via Shielding . Proceedings of the AAAI Conference on Artificial In- telligence 32(1), doi:10.1609/aaai.v32i1.11797. Available at https://ojs.aaai.org/index.php/AAAI/ article/view/11797

work page doi:10.1609/aaai.v32i1.11797 2018
[2]

arXiv:1501.02573

Roderick Bloem, Bettina Koenighofer, Robert Koenighofer & Chao Wang (2015): Shield Synthesis: Runtime Enforcement for Reactive Systems, doi:10.48550/arXiv.1501.02573. arXiv:1501.02573

work page doi:10.48550/arxiv.1501.02573 2015
[3]

Safe multi-agent reinforcement learning via shielding,

Ingy Elsayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu & Lu Feng (2021):Safe Multi-Agent Reinforcement Learning via Shielding, doi:10.48550/arXiv.2101.11196. arXiv:2101.11196

work page doi:10.48550/arxiv.2101.11196 2021
[4]

In: CONCUR

Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban & Roderick Bloem (2020): Safe Reinforce- ment Learning Using Probabilistic Shields . In Igor Konnov & Laura Kovacs, editors: 31st International Conference on Concurrency Theory, CONCUR 2020 , Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Germany, pp. 31–316, doi:10.4230/LIPIcs.CONCUR.2020...

work page doi:10.4230/lipics.concur.2020.3 2020
[5]

Mealy (1955): A method for synthesizing sequential circuits

George H. Mealy (1955): A method for synthesizing sequential circuits. The Bell System Technical Journal 34(5), pp. 1045–1079, doi:10.1002/j.1538-7305.1955.tb03788.x

work page doi:10.1002/j.1538-7305.1955.tb03788.x 1955
[6]

Daniel Melcer, Christopher Amato & Stavros Tripakis (2022): Shield Decentralization for Safe Multi- Agent Reinforcement Learning . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho & A. Oh, editors: Advances in Neural Information Processing Systems , 35, Curran Associates, Inc., pp. 13367–13379. Available at https://proceedings.neurips.cc/paper_fi...

work page 2022
[7]

Daniel Melcer, Christopher Amato & Stavros Tripakis (2024): Shield Decentralization for Safe Re- inforcement Learning in General Partially Observable Multi-Agent Environments . pp. 2384–2386, doi:10.65109/V AAR1124

work page doi:10.65109/v 2024
[8]

In: The Seventeenth Workshop on Adaptive and Learning Agents

Daniel Melcer, Stavros Tripakis & Christopher Amato (2025): Learned Shields for Multi-Agent Rein- forcement Learning . In: The Seventeenth Workshop on Adaptive and Learning Agents . Available at https://openreview.net/forum?id=DXHxmyq5kO

work page 2025
[9]

León (2021): An Abstraction-based Method to Verify Multi-Agent Deep Reinforcement-Learning Behaviours

Pierre El Mqirmi, Francesco Belardinelli & Borja G. León (2021): An Abstraction-based Method to Verify Multi-Agent Deep Reinforcement-Learning Behaviours . CoRR abs/2102.01434, doi:10.48550/arXiv.2102.01434. arXiv:2102.01434

work page doi:10.48550/arxiv.2102.01434 2021
[10]

A concise introduction to decentralized POMDPs, volume 1

Frans Oliehoek & Christopher Amato (2016): A Concise Introduction to Decentralized POMDPs . doi:10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016
[11]

Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne T

Roni Stern, Nathan R. Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne T. Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, T. K. Satish Kumar, Eli Boyarski & Roman Barták (2019): Multi-Agent Pathfind- ing: Definitions, Variants, and Benchmarks . CoRR abs/1906.08291, doi:10.48550/arXiv.1906.08291. arXiv:1906.08291

work page doi:10.48550/arxiv.1906.08291 2019
[12]

Christopher J. C. H. Watkins & Peter Dayan (1992): Technical Note: Q -Learning. Mach. Learn. 8(3–4), p. 279–292, doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992

[1] [1]

In: AAAI

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum & Ufuk Topcu (2018): Safe Reinforcement Learning via Shielding . Proceedings of the AAAI Conference on Artificial In- telligence 32(1), doi:10.1609/aaai.v32i1.11797. Available at https://ojs.aaai.org/index.php/AAAI/ article/view/11797

work page doi:10.1609/aaai.v32i1.11797 2018

[2] [2]

arXiv:1501.02573

Roderick Bloem, Bettina Koenighofer, Robert Koenighofer & Chao Wang (2015): Shield Synthesis: Runtime Enforcement for Reactive Systems, doi:10.48550/arXiv.1501.02573. arXiv:1501.02573

work page doi:10.48550/arxiv.1501.02573 2015

[3] [3]

Safe multi-agent reinforcement learning via shielding,

Ingy Elsayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu & Lu Feng (2021):Safe Multi-Agent Reinforcement Learning via Shielding, doi:10.48550/arXiv.2101.11196. arXiv:2101.11196

work page doi:10.48550/arxiv.2101.11196 2021

[4] [4]

In: CONCUR

Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban & Roderick Bloem (2020): Safe Reinforce- ment Learning Using Probabilistic Shields . In Igor Konnov & Laura Kovacs, editors: 31st International Conference on Concurrency Theory, CONCUR 2020 , Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Germany, pp. 31–316, doi:10.4230/LIPIcs.CONCUR.2020...

work page doi:10.4230/lipics.concur.2020.3 2020

[5] [5]

Mealy (1955): A method for synthesizing sequential circuits

George H. Mealy (1955): A method for synthesizing sequential circuits. The Bell System Technical Journal 34(5), pp. 1045–1079, doi:10.1002/j.1538-7305.1955.tb03788.x

work page doi:10.1002/j.1538-7305.1955.tb03788.x 1955

[6] [6]

Daniel Melcer, Christopher Amato & Stavros Tripakis (2022): Shield Decentralization for Safe Multi- Agent Reinforcement Learning . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho & A. Oh, editors: Advances in Neural Information Processing Systems , 35, Curran Associates, Inc., pp. 13367–13379. Available at https://proceedings.neurips.cc/paper_fi...

work page 2022

[7] [7]

Daniel Melcer, Christopher Amato & Stavros Tripakis (2024): Shield Decentralization for Safe Re- inforcement Learning in General Partially Observable Multi-Agent Environments . pp. 2384–2386, doi:10.65109/V AAR1124

work page doi:10.65109/v 2024

[8] [8]

In: The Seventeenth Workshop on Adaptive and Learning Agents

Daniel Melcer, Stavros Tripakis & Christopher Amato (2025): Learned Shields for Multi-Agent Rein- forcement Learning . In: The Seventeenth Workshop on Adaptive and Learning Agents . Available at https://openreview.net/forum?id=DXHxmyq5kO

work page 2025

[9] [9]

León (2021): An Abstraction-based Method to Verify Multi-Agent Deep Reinforcement-Learning Behaviours

Pierre El Mqirmi, Francesco Belardinelli & Borja G. León (2021): An Abstraction-based Method to Verify Multi-Agent Deep Reinforcement-Learning Behaviours . CoRR abs/2102.01434, doi:10.48550/arXiv.2102.01434. arXiv:2102.01434

work page doi:10.48550/arxiv.2102.01434 2021

[10] [10]

A concise introduction to decentralized POMDPs, volume 1

Frans Oliehoek & Christopher Amato (2016): A Concise Introduction to Decentralized POMDPs . doi:10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016

[11] [11]

Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne T

Roni Stern, Nathan R. Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne T. Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, T. K. Satish Kumar, Eli Boyarski & Roman Barták (2019): Multi-Agent Pathfind- ing: Definitions, Variants, and Benchmarks . CoRR abs/1906.08291, doi:10.48550/arXiv.1906.08291. arXiv:1906.08291

work page doi:10.48550/arxiv.1906.08291 2019

[12] [12]

Christopher J. C. H. Watkins & Peter Dayan (1992): Technical Note: Q -Learning. Mach. Learn. 8(3–4), p. 279–292, doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992