Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Enrico Marchesini; Luca Marzari

arxiv: 2605.14758 · v1 · pith:OFFOCYIDnew · submitted 2026-05-14 · 💻 cs.AI

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Luca Marzari , Enrico Marchesini This is my paper

Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords RNN verificationprobabilistic verificationreinforcement learningmulti-agent RLhidden state approximationpolicy samplingstatistical boundspartially observable RL

0 comments

The pith

RNN-ProVe estimates the likelihood of undesired behaviors in RNN policies by sampling feasible hidden states and applying statistical error bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a probabilistic verification framework called RNN-ProVe for recurrent neural network policies in reinforcement learning. It approximates the set of hidden states reachable under a trained policy through targeted sampling and derives statistical bounds to produce high-confidence estimates of behavioral violations. This addresses limitations in existing tools that rely on restrictive assumptions or coarse over-approximations, which often yield conservative or inconclusive results. A sympathetic reader would care because the method targets partially observable single-agent and multi-agent settings where history-dependent policies make standard verification difficult.

Core claim

RNN-ProVe estimates the likelihood of undesired behaviors in RNN-based policies. It uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that RNN-ProVe yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

What carries the argument

Policy-driven sampling to approximate feasible hidden states under the trained policy, paired with statistical error bounds on violation likelihood estimates.

If this is right

Produces bounded-error high-confidence estimates of behavioral violations instead of binary or overly conservative outcomes.
Applies directly to recurrent policies in partially observable environments without requiring restrictive modeling assumptions.
Extends verification coverage to cooperative multi-agent reinforcement learning tasks.
Yields feasibility-aware probabilistic guarantees that reflect what states are actually reachable under the policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sampling-plus-bounds approach could be combined with online monitoring to flag high-risk states during policy execution.
If the statistical bounds prove tight in practice, they might serve as a lightweight alternative to exhaustive reachability analysis for continuous hidden-state spaces.
The method opens a path to verifying policies whose hidden-state dynamics are too complex for symbolic or abstraction-based tools.

Load-bearing premise

That sampling trajectories from the trained policy produces a representative approximation of all hidden states the policy can actually reach.

What would settle it

An experiment that computes the exact violation probability on a small RNN policy instance and shows that the sampled estimate with its reported bounds fails to contain the true value.

Figures

Figures reproduced from arXiv: 2605.14758 by Enrico Marchesini, Luca Marzari.

**Figure 2.** Figure 2: An explanatory RNN transition [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: RNN-VERIFICATION for RL navigation: (a) The agent selects a safe action that avoids a collision. (b) The verification problem asks whether there exists a feasible history h (e.g., h ′ t , h′′ t ) such that the agent selects an undesired action, i.e., f(h, o) ≤ 0. As in robustness verification, we assume that the behavior of interest can be expressed using a single scalar output whose sign indicates wheth… view at source ↗

**Figure 4.** Figure 4: RNN-ProVe overview [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: (Top) Single-agent navigation with red obstacles, where [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Violation heatmap for the 8×8 navigation task. Each grid cell (state) is colored according to the % of violations computed via RNN-ProVe. Red cells indicate higher violations, revealing regions of the environment where the learned policy is more sensitive. portion of the feasible history space. When used within the RNN-ProVe pipeline, this data enables learning a feasibility classifier that closely approxi… view at source ↗

**Figure 7.** Figure 7: An explanatory RNN transition. Example 1. Let the nominal input be x = [1, 2]⊤ (i.e., an input sequence of length 2) and perturbation radius ε = 1, yielding intervals x1 ∈ [0, 2] and x2 ∈ [1, 3]. We initialize H0 = {0}. The verification proceeds by propagating intervals through time. For t=1, we obtain h1 = ReLU(h0 + x1) = ReLU(0 + [0, 2]) = [0, 2]. Similarly for t = 2, we have h2 = ReLU(h1 + x2) = ReLU([1… view at source ↗

**Figure 8.** Figure 8: Environments and training performance. (Left) Single-agent navigation task with obstacles in red and the corresponding training [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RNN-ProVe samples trajectories under the policy to bound violation probabilities for RNN agents, but the statistical guarantees stand or fall on whether those samples actually represent the reachable hidden-state distribution.

read the letter

The paper's core move is to replace over-approximations of RNN hidden states with policy-driven sampling plus concentration bounds, then report high-confidence estimates of bad behavior in both single-agent POMDPs and cooperative multi-agent settings. That is the main novelty: a practical, feasibility-aware alternative to the conservative tools that currently dominate RNN verification.

What works is the problem framing and the experimental scope. History-dependent policies are a real deployment obstacle in partially observable RL, and the authors show results on tasks where existing methods either time out or return vacuous bounds. The multi-agent experiments are a plus; most prior verification work stops at single agents.

The soft spot is exactly the one flagged in the stress test. The statistical bounds are only valid if the sampled trajectories give a representative picture of the hidden-state distribution induced by the trained policy. The abstract does not supply a proof or even a clear argument that policy rollouts achieve this; correlated samples, incomplete exploration of the RNN state manifold, or coordination artifacts in the multi-agent case could all bias the estimate and invalidate the error bounds. Without that justification, the quantitative guarantees are conditional on an assumption that is not shown to hold.

The paper is aimed at the RL verification community rather than a broad audience. Readers already working on safety properties of learned controllers will find the direction useful and the experiments worth examining. It is coherent on its own terms and engages the literature, so it deserves a serious referee even though the sampling representativeness claim will need close scrutiny in review.

Referee Report

1 major / 0 minor

Summary. The paper proposes RNN-ProVe, a probabilistic verification framework for RNN-based policies in partially observable single-agent and multi-agent RL. It approximates the feasible hidden-state set under a trained policy via policy-driven sampling, then derives statistical error bounds to produce bounded-error, high-confidence estimates of the probability of undesired behaviors. Experiments on partially observable tasks are claimed to show that RNN-ProVe provides more quantitative and feasibility-aware guarantees than existing tools while scaling to recurrent and multi-agent settings.

Significance. If the sampling procedure can be shown to produce a sufficiently representative approximation of the reachable hidden-state distribution and the statistical bounds are rigorously derived, the method could provide a practical alternative to conservative over-approximations for verifying history-dependent policies where exact verification is intractable. This would be particularly valuable in multi-agent cooperative settings.

major comments (1)

[Abstract] Abstract: The claim that RNN-ProVe 'derives statistical error bounds' to produce 'bounded-error, high-confidence estimates' is load-bearing on the unstated assumption that policy-driven sampling yields a representative (unbiased or sufficiently covering) approximation of the reachable hidden-state set under the policy. No derivation, sampling algorithm, or argument establishing this representativeness is supplied in the provided text; without it, the concentration inequalities cannot be guaranteed to apply, directly undermining the central probabilistic guarantee.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting an important point about the foundations of our probabilistic claims. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that RNN-ProVe 'derives statistical error bounds' to produce 'bounded-error, high-confidence estimates' is load-bearing on the unstated assumption that policy-driven sampling yields a representative (unbiased or sufficiently covering) approximation of the reachable hidden-state set under the policy. No derivation, sampling algorithm, or argument establishing this representativeness is supplied in the provided text; without it, the concentration inequalities cannot be guaranteed to apply, directly undermining the central probabilistic guarantee.

Authors: Policy-driven sampling generates trajectories by executing the trained RNN policy within the POMDP, so the observed hidden states are drawn exactly from the distribution induced by the policy and the environment dynamics. This construction ensures the samples are representative of the reachable hidden-state distribution under the policy, allowing standard concentration inequalities to apply directly to the empirical estimates of violation probabilities. The sampling algorithm, its unbiasedness properties, and the derivation of the resulting error bounds are presented in Section 3 of the manuscript. We agree that the abstract would benefit from an explicit reference to this property and will revise it accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent statistical estimation

full rationale

The provided abstract and description present RNN-ProVe as using policy-driven sampling to approximate reachable hidden states followed by standard statistical error bounds for violation probabilities. This is a conventional Monte Carlo-style estimation procedure whose validity depends on sampling assumptions but does not reduce any claimed prediction or bound to the inputs by definition or self-citation. No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The method is self-contained against external benchmarks for its statistical component.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5696 in / 1034 out tokens · 25075 ms · 2026-06-30T20:50:12.725412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages · 2 internal anchors

[1]

An introduction to centralized training for decentralized execution in cooperative multi-agent re- inforcement learning.arXiv preprint arXiv:2409.03052,

Christopher Amato. An introduction to centralized training for decentralized execution in cooperative multi-agent re- inforcement learning.arXiv preprint arXiv:2409.03052,

work page arXiv
[2]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Deep Recurrent Q-Learning for Partially Observable MDPs

Matthew Hausknecht and Peter Stone. Deep recurrent q- learning for partially observable mdps.arXiv preprint arXiv:1507.06527,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Safe deep reinforcement learning by verifying task-level properties

Enrico Marchesini, Luca Marzari, Alessandro Farinelli, and Christopher Amato. Safe deep reinforcement learning by verifying task-level properties. InProceedings of the 2023 International Conference on Autonomous Agents and Mul- tiagent Systems, AAMAS 2023, pages 1466–1475. ACM,

2023
[5]

Rl2grid: Benchmarking reinforcement learning in power grid oper- ations.arXiv preprint arXiv:2503.23101,

Enrico Marchesini, Benjamin Donnot, Constance Crozier, Ian Dytham, Christian Merz, Lars Schewe, Nico Westerbeck, Cathy Wu, Antoine Marot, and Priya L Donti. Rl2grid: Benchmarking reinforcement learning in power grid oper- ations.arXiv preprint arXiv:2503.23101,

work page arXiv
[6]

Donti, Changliu Liu, and En- rico Marchesini

Luca Marzari, Priya L. Donti, Changliu Liu, and En- rico Marchesini. Improving policy optimization viaϵ- retrain. InProceedings of the 24th International Con- ference on Autonomous Agents and Multiagent Systems, AAMAS 2025, pages 1464–1472. International Foundation for Autonomous Agents and Multiagent Systems / ACM,

2025
[7]

Advancing neural network verification through hierarchi- cal safety abstract interpretation

Luca Marzari, Isabella Mastroeni, and Alessandro Farinelli. Advancing neural network verification through hierarchi- cal safety abstract interpretation. InECAI 2025 - 28th Eu- ropean Conference on Artificial Intelligence, pages 1736– 1743,

2025
[8]

Yun, Peizhi Niu, Xusheng Luo, and Changliu Liu

Tianhao Wei, Hanjiang Hu, Luca Marzari, Kai S. Yun, Peizhi Niu, Xusheng Luo, and Changliu Liu. Modelverification.jl: A comprehensive toolbox for formally verifying deep neu- ral networks. InComputer Aided Verification - 37th In- ternational Conference, CAV 2025, volume 15932 ofLec- ture Notes in Computer Science, pages 395–408. Springer,

2025
[9]

Finally,Y T =y T = 1·h 2 = [1,5]

= ReLU([1,5]) = [1,5]. Finally,Y T =y T = 1·h 2 = [1,5]. Sincemin(Y T ) = 1>0, the RNN is verified as robust. 7 On the Hardness of the #RNN-VERIFICATIONProblem. We now analyze the computational complexity of the #RNN-VERIFICATIONproblem. Theorem 1.The#RNN-VERIFICATIONproblem is #P-hard. Proof Sketch.The hardness result relies on the established relationsh...

2017
[10]

The problem requires characterizing the set of histories (hidden states)Γ(T) :={h∈ H |f(h, x)≤0}. If we fix the observationxand considerhas the input variable, determining the volumeV ol(Γ(T))is equivalent to determining the preimage volume of a DNNfthat maps to the unsafe region(−∞,0]. Hence, from the #P-hardness of the #DNN-VERIFICATIONit follows that #...

2020
[11]

Best performing parameters used for training are highlighted in bold

Table 2: Hyper-parameters candidate for initial grid search tuning. Best performing parameters used for training are highlighted in bold. Learning rate 3e-3,3e-4, 3e-5 γ0.9, 0.95, 0.99 Buffer size (episodes)1000, 2500, 5000 Batch size32, 64, 128 Sampling trajectory size 10, 25, 50,T Polyak averagingω0.995, 0.9998 N° hidden layers2, 3 Hidden layers size32,...

2000

[1] [1]

An introduction to centralized training for decentralized execution in cooperative multi-agent re- inforcement learning.arXiv preprint arXiv:2409.03052,

Christopher Amato. An introduction to centralized training for decentralized execution in cooperative multi-agent re- inforcement learning.arXiv preprint arXiv:2409.03052,

work page arXiv

[2] [2]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Deep Recurrent Q-Learning for Partially Observable MDPs

Matthew Hausknecht and Peter Stone. Deep recurrent q- learning for partially observable mdps.arXiv preprint arXiv:1507.06527,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Safe deep reinforcement learning by verifying task-level properties

Enrico Marchesini, Luca Marzari, Alessandro Farinelli, and Christopher Amato. Safe deep reinforcement learning by verifying task-level properties. InProceedings of the 2023 International Conference on Autonomous Agents and Mul- tiagent Systems, AAMAS 2023, pages 1466–1475. ACM,

2023

[5] [5]

Rl2grid: Benchmarking reinforcement learning in power grid oper- ations.arXiv preprint arXiv:2503.23101,

Enrico Marchesini, Benjamin Donnot, Constance Crozier, Ian Dytham, Christian Merz, Lars Schewe, Nico Westerbeck, Cathy Wu, Antoine Marot, and Priya L Donti. Rl2grid: Benchmarking reinforcement learning in power grid oper- ations.arXiv preprint arXiv:2503.23101,

work page arXiv

[6] [6]

Donti, Changliu Liu, and En- rico Marchesini

Luca Marzari, Priya L. Donti, Changliu Liu, and En- rico Marchesini. Improving policy optimization viaϵ- retrain. InProceedings of the 24th International Con- ference on Autonomous Agents and Multiagent Systems, AAMAS 2025, pages 1464–1472. International Foundation for Autonomous Agents and Multiagent Systems / ACM,

2025

[7] [7]

Advancing neural network verification through hierarchi- cal safety abstract interpretation

Luca Marzari, Isabella Mastroeni, and Alessandro Farinelli. Advancing neural network verification through hierarchi- cal safety abstract interpretation. InECAI 2025 - 28th Eu- ropean Conference on Artificial Intelligence, pages 1736– 1743,

2025

[8] [8]

Yun, Peizhi Niu, Xusheng Luo, and Changliu Liu

Tianhao Wei, Hanjiang Hu, Luca Marzari, Kai S. Yun, Peizhi Niu, Xusheng Luo, and Changliu Liu. Modelverification.jl: A comprehensive toolbox for formally verifying deep neu- ral networks. InComputer Aided Verification - 37th In- ternational Conference, CAV 2025, volume 15932 ofLec- ture Notes in Computer Science, pages 395–408. Springer,

2025

[9] [9]

Finally,Y T =y T = 1·h 2 = [1,5]

= ReLU([1,5]) = [1,5]. Finally,Y T =y T = 1·h 2 = [1,5]. Sincemin(Y T ) = 1>0, the RNN is verified as robust. 7 On the Hardness of the #RNN-VERIFICATIONProblem. We now analyze the computational complexity of the #RNN-VERIFICATIONproblem. Theorem 1.The#RNN-VERIFICATIONproblem is #P-hard. Proof Sketch.The hardness result relies on the established relationsh...

2017

[10] [10]

The problem requires characterizing the set of histories (hidden states)Γ(T) :={h∈ H |f(h, x)≤0}. If we fix the observationxand considerhas the input variable, determining the volumeV ol(Γ(T))is equivalent to determining the preimage volume of a DNNfthat maps to the unsafe region(−∞,0]. Hence, from the #P-hardness of the #DNN-VERIFICATIONit follows that #...

2020

[11] [11]

Best performing parameters used for training are highlighted in bold

Table 2: Hyper-parameters candidate for initial grid search tuning. Best performing parameters used for training are highlighted in bold. Learning rate 3e-3,3e-4, 3e-5 γ0.9, 0.95, 0.99 Buffer size (episodes)1000, 2500, 5000 Batch size32, 64, 128 Sampling trajectory size 10, 25, 50,T Polyak averagingω0.995, 0.9998 N° hidden layers2, 3 Hidden layers size32,...

2000